Do you have tones of data and probably you want to take advantage of Data science and AI which are slowly but surely coming to the retail and ecommerce sector and these businesses are constantly generating huge amount of data. Before going into any modeling or data analysis each observer need to prepare the data in a way that machines can work with it. We decided to write an article about main approaches for data preparation of categorical variables called data encoding mainly observed in survey data.
What data preparation means?
Computers like variable as numbers therefore all textual values need to be presented in their numerical equivalent in order to be used for machine learning algorithms and deep learning neural networks. Simply they don’t like text. Here is an example.
We have variables like:
- “Brand of car” with values “Mercedes”,”BMW” or others
- “Retailers Name” with values like “Kaufland”, “Metro” or “Mr.Bricolage” .
- “Sex” – “Male”, “Female” and many others 🙂
We noticed that there are many easy to use tools for start working with your data and make your first predictions, which is great and we love to see more and more businesses looking into data for making their regular decisions. As a common mistake we notice that some beginners start playing with such a tool, they grab all the data and put it into one modeler and then start to observing what is the accuracy. Everything seems perfect, but what we can recommend is first to prepare the data into a format which computers can read it. Data preparation is the first steps after data observation. Normally you are given raw data and that usually looks something like this.
Normally we split the data into three main blocks – demographic, behavior and transnational data, what we noticed that for demographic data there are a lot of information in text format. All this looks wonderful however, you can’t run a machine learning algorithm on it. You can only use simple statistics and decision trees.
Main data preps techniques
For each text category we assign unique number like:
- Programmer = 1,
- Doctor= 2
- Dentist = 3
- Teacher = 4 and so on
That is indeed one of the most implemented techniques which covers up to 90% of all cases. So if you remember only this you will have a very good coverage 🙂
The most used library for encoding if you are using python is #scikit learn. For reference here are the encoders provided by it:
- Backward Difference Coding
- CatBoost Encoder
- Helmert Coding
- James-Stein Encoder
- Leave One Out
- One Hot
- Polynomial Coding
- Sum Coding
- Target Encoder
- Weight of Evidence
For python practitioners here is an example.
Define the columns you need to encode like this:
categorical_columns = ['Gender','Occupation','Work-time','Family-status']
Create an ordinal encoder:
ordinalencoder_X = OrdinalEncoder()
Encode the selected columns
df[categorical_columns] = ordinalencoder_X.fit_transform(df[categorical_columns])
That is it! Shout-out to decision-trees -> they work fantastic with just ordinal encoder.
One hot encoder
Now this is a bit less intuitive way. Here each value in the category has it’s unique column and for a current value all columns are 0 except the column with the name of the value which is 1. Here is a visual:
You may ask why we need such encoding. The answer is simple if you have numbers for fruits, see example above. It will not make sense that something is bigger or less than particular number which can be refereed in different decision tree techniques later on, if you want to dive deeper we found one god explanation at stackoverflow.
A python code:
one_hot = pd.get_dummies(df['B'])
This will give you the encoded column B. If you want to visualize it and see how the whole dataframe drop the original column.
df = df.drop('B',axis = 1)
df = df.join(one_hot)
Less known but still useful encoders
This technique works only if the features are already in numeric form. It will convert those numbers to binary and create as many columns as needed to fit the largest binary number. For example, for 6 values it will have 2 to power of 3 = 3 columns.
It is used, if one hot encoding takes a lot of memory ( imagine for every value separate column) why not using a binary encoder. For 8 features only 3 columns have to represent them each with its own weight’s allowing for more accurate models. It’s still better than ordinal when you have 1 weight of a column to express all features. It should not perform significantly worse than one hot.
Instantiate an encoder – here we use Binary()
ce_basen = ce.BinaryEncoder(cols=[‘color’])
Fit and transform and presto, you’ve got encoded data
N is the base which we power of, for instance if N is 1 it is one-hot encoding, if it is 2 it is basically the same as a binary encoding. You will rarely need a custom base, however, the underlying principle is the same. You lose power to express each feature the more you increase dimensionality but you gain space which may become necessary if you are constrained by RAM.
ce_basen = ce.BaseNEnoder(cols=[‘color’])
Each value is passed through a hash function for example md5 and its values are stored.
This is done because with binary and base_n encoding, the number of features still determined the number of columns. Not so with the hashing encoder where each string or a number passed through a hash function is transformed to a fixed-length number(you can pass it as a parameter the default is 8). If there are extraordinary many feature values you may want to use this one.
ce_basen = ce.HashingEncoder(cols=[‘color’])
For each value you encode the average total distribution, like in the example the temperature is frequency encoded. Hot is encountered 40% of the time so it’s value it 0.4. Cold its 20 % so 0.2 and so on.
Unlike the previous where you try to manipulate the number of columns with which you express a value in a category, here the number of columns stays the same. Instead of the value of the category, you just put a number(unlike ordinal encoder it’s not just 1 to n). If you are going to stick with one column to express all the values at least make the numbers somewhat useful. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.
For each value in the category which you want to encode calculate the average and replace that number with the value. Here the mean for warm is 0.75 which means that for every row with hot temperature with summed the target and divided by the number of rows. Be careful because you can come up to overfitting.
In real life most often you will use ordinal and one hot encoder. Make sure to try them out. If you are up to some data science competition then you can experiment and use mean encoder for example. If you really go into detail here is a quick visualization you can use created by feature labs. We hope that the article answer on some of the questions which we mentioned at the beginning. If you have any comments or suggestion let us know into comments section.