What are smart ways to create synthetic text data?

I’m using textual data for a classification model, but I have unbalanced classes. What are some smart ways to create synthetic data for the smallest classes?

Dealing with issues regarding unbalanced classes or limited data using synthetic data or augmenting your existing data is not as well researched for text data as it is for image data. There is no clear “best way” of augmenting or creating new data, that being said there are quite a few promising techniques that would make sense to try out in this case.

Before getting into that I just want to mention that the Peltarion Platform deals with imbalanced classes using class weights. This can be sufficient in some cases but in others you would look into methods more line with your question.

There are many different ways of generating or augmenting the data in the minority classes. This post is a nice overview of common approaches. The approaches that I would suggest from this overview are:

  1. Back-Translation. Although it can be effective, please be aware that it can be computationally intensive as two language models are required to generate “new” data and if you do not have the architecture to support this I would not recommend this approach.

  2. TF-IDF based word replacement. Not nearly as computationally expensive as the approach above but not trivially cheap.

  3. Error injection/word swap. These are more trivial approaches that are very cheap and easy to use but I would not expect any massive improvements when using these approaches over simply using class weights or over-sampling the minority classes.

If you are familiar with python I would suggest looking into the python package and repository nlpaug, where all of the above mentioned techniques are implemented and easy to apply to any text data out-of-the-box.

There are also more novel approaches to solving this issue such as using learnable augmentation techniques as in Text AutoAugment or Generative Models. However for most cases I would expect them to be overkill and please be aware that such approaches can not be applied out-of-the-box and require a lot of work to set up for your data.

Hopefully this has answered your question, otherwise feel free to ask if anything is unclear!

3 Likes

Thanks @Axel ! Enough leads to dive into and see how they can help me. I’ll be trying out some of them and see how they’ll work for me. Great stuff!

1 Like

Hi Nasnl & Axel,
I just wanted to add that the back-translate part can quite easily be done in a “no-code” way using Google translate and Google sheets.

Just add a column for the language you want to translate “through” and then an extra column to translate back.
The trick is playing around a little with what language to translate through, and for some cases maybe translating through 2 languages if the translation is too good.

Happy modeling!
/Björn

2 Likes

Hi Bjorn,

That is a very creative solution and easy to deploy for practical projects. Nice.

I tried a few languages and I can recommend back-translations to/from Vietnamese. Those translations are notoriously bad… which is a good thing here. :smiley:

Jeroen

1 Like

Haha, let’s hope they don’t improve that then :slight_smile:

1 Like