Managing categorical data
In many classification problems, the target dataset is made up of categorical labels that cannot immediately be processed by every algorithm. An encoding is needed, and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with 2 features each:
import numpy as np
X = np.random.uniform(0.0, 1.0, size=(10, 2))
Y = np.random.choice(('Male', 'Female'), size=(10))
print(X[0])
array([ 0.8236887 , 0.11975305])
print(Y[0])
'Male'
The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is, an index of an instance array called classes_:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
yt = le.fit_transform(Y)
print(yt)
[0 0 0 1 0 1 1 0 0 1]
le.classes_array(['Female', 'Male'], dtype='|S6')
The inverse transformation can be obtained in this simple way:
output = [1, 0, 1, 1, 0, 0]
decoded_output = [le.classes_[int(i)] for i in output]
print(decoded_output)
['Male', 'Female', 'Male', 'Male', 'Female', 'Female']
This approach is simple and works well in many cases, but it has a drawback: all labels are turned into sequential numbers. A classifier that works with real values will then consider similar numbers according to their distance, without any concern for semantics. For this reason, it's often preferable to use so-called one-hot encoding, which binarizes the data. For labels, this can be achieved by using the LabelBinarizer class:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
Yb = lb.fit_transform(Y)
array([[1],
[0],
[1],
[1],
[1],
[1],
[0],
[1],
[1],
[1]])
lb.inverse_transform(Yb)
array(['Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male',
'Male', 'Male'], dtype='|S6')
In this case, each categorical label is first turned into a positive integer and then transformed into a vector where only one feature is 1 while all the others are 0. This means, for example, that using a softmax distribution with a peak corresponding to the main class can be easily turned into a discrete vector where the only non-null element corresponds to the right class. For example, consider the following code:
import numpy as np
Y = lb.fit_transform(Y)
array([[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]])
Yp = model.predict(X[0])
array([[0.002, 0.991, 0.001, 0.005, 0.001]])
Ypr = np.round(Yp)
array([[ 0., 1., 0., 0., 0.]])
lb.inverse_transform(Ypr)
array(['Female'], dtype='|S6')
Another approach to categorical features can be adopted when they're structured like a list of dictionaries (not necessarily dense; they can have values, but only for a few features). For example:
data = [
{ 'feature_1': 10.0, 'feature_2': 15.0 },
{ 'feature_1': -5.0, 'feature_3': 22.0 },
{ 'feature_3': -2.0, 'feature_4': 10.0 }
]
In this case, scikit-learn offers the DictVectorizer and FeatureHasher classes; they both produce sparse matrices of real numbers that can be fed into any machine learning model. The latter has a limited memory consumption and adopts MurmurHash3 (refer to https://en.wikipedia.org/wiki/MurmurHash for further information), which is general-purpose (non-cryptographic, hence has a non-collision-resistant hash function with a 32-bit output). The code for these two methods is as follows:
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
dv = DictVectorizer()
Y_dict = dv.fit_transform(data)
Y_dict.todense()
matrix([[ 10., 15., 0., 0.],
[ -5., 0., 22., 0.],
[ 0., 0., -2., 10.]])
dv.vocabulary_
{'feature_1': 0, 'feature_2': 1, 'feature_3': 2, 'feature_4': 3}
fh = FeatureHasher()
Y_hashed = fh.fit_transform(data)
Y_hashed.todense()
matrix([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
In both cases, I suggest you read the original scikit-learn documentation so that you know all of the possible options and parameters.
When working with categorical features (normally converted into positive integers through LabelEncoder), it's also possible to filter the dataset in order to apply one-hot encoding by using the OneHotEncoder class. In the following example, the first feature is a binary index that indicates 'Male' or 'Female':
from sklearn.preprocessing import OneHotEncoder
data = [
[0, 10],
[1, 11],
[1, 8],
[0, 12],
[0, 15]
]
oh = OneHotEncoder(categorical_features=[0])
Y_oh = oh.fit_transform(data)
>>> Y_oh.todense()
matrix([[ 1., 0., 10.],
[ 0., 1., 11.],
[ 0., 1., 8.],
[ 1., 0., 12.],
[ 1., 0., 15.]])
Considering that these approaches increase the number of values (also exponentially with binary versions), all the classes adopt sparse matrices based on the SciPy implementation. See https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html for further information.