What are the best practices in tabular data?

Can you share some tips on how I can start with tabular data? Sometimes I get multiple columns and I do not know if it’s relevant or not.

Multiple columns can be relevant, or not, it’s one of those “it depends” questions :slight_smile:

What you need to consider is the business problem you are trying to solve. As a minimum you need 2 columns; one that represents the input data (what you will always know) and one that represents the output data (the value you want the model to learn to predict).

A column in a spreadsheet is typically called a “feature” in data science, simply because it contains the properties of the data that you deem valuable.

Let’s take an example: You want to predict the selling price in an auction of specific things.

You know by asking the experts that what matters most is the color, the condition and the purchase price, now that would be your 3 input features. Your output feature is the price it was sold for, because that’s what you want to train the model to predict.

A dataset for this problem is then typically a spreadsheet with the columns; Color, Condition, Purchase price, Selling price.

Make sure that all values in the same columns are of the same typ (categorical data for color and condition, and values in the same currency fro purchasing- and selling price).

Save that spreadsheet as a CSV and import it into the platform, choose your input and output features accordingly and you’re all set!

Even if your problem is something completely different than my example above, just think in the same way, by starting with the business problem, then “ask the experts” and then create a dataset with the right input and output features.

Good luck!

1 Like