How should I prepare text classification data?

Is there any chance you can give me tips on how to prepare my text classification data?

When you create a dataset you should consider two things; what data do I need and the file format.

What data do I need?
The most important part is of course to consider the problem you’re trying to solve, meaning that you should think of the future use-case in terms of: “What question will I ask my AI model?”, “What data will I send to help answering the question?” and, “What response would I like to get back?”.

The dataset you prepare should contain answers to these questions.

Let’s take a simple example of training a model to predict sentiment (positive or negative) on a piece of tex.

  • What question will I ask my AI model?
    Can you predict the sentiment of this text?
    The answers gives you a rough idea of the two parts of the dataset:
    The text (input) and the sentiment (output)
  • What data will I send to help answering the question?
    1 paragraph of text
  • What response would I like to get back?:
    The sentiment,
    —Labelled so that I can understand it; positive/negative or true/false or 1/0, etc.

Now you have the specification of your dataset!

File format
As you can read here, the file format you need to create is a CSV file. The easiest way for most is to use a spreadsheet editor (Google sheets, MS Excel or similar) and use “Save as CSV-files”.

So for the example above you should have a spreadsheet with the first column containing 1 paragraph of text, and the second column would be the true sentiment for the text on the same row.
You have to use the same notation format for all rows (you can’t mix positive/negative, with true/false)

Use a header row
Add a first row with headers, so that the platform will know what to call the different features.
When you have this, all you have to do is save it as CSV-file and upload it into the Peltarion Platform.

Remember that for a model to be trained it can only use rows without empty values, but the Platform will help you clean the dataset once you uploaded it.

Good luck!