Extracting Feature Names from the ColumnTransformer
While using scikit-learn
's ColumnTransformer, I run into a “trivial” issue. After transforming the features, they do not have names in the new numpy
array.
In stackoverflow.com seems that few other people have faced this issue, and struggled with it. This post is about my version of solving this problem.
Code
Please find the whole code in my Github repo.
Inspiration
This code improves on a previous work done in this post by Johannes Haupt. The following code makes some improvements that fit better my use case. The main improvement is on handling transformers that create bins (e.g:KBinsDiscretizer
)
Transformers
For most transformers in ColumnTransformers
, scikit-learn library offers get_feature_names()
. Sadly, in the case ofOneHotEncoding
and KBinsDiscretizer
, this method does not exist.
Improvements
KBinsDiscretizer — this transformer was not covered by the original post from Johannes. The current solution, in this case, generates a feature name for every bin. It follows this pattern: feature_name :
+ [bin_left_edge
+ bin_right_edge
).
OneHotEncoder — this transformer was covered by the aforementioned post. However, there was one improvement that could be made. As an example, if the OneHotEncoder was applied on the feature sex
which has three values (male
, female
and other
), the result would look something like this:
['onehot__x0_male',
'onehot__x0_female',
'onehot__x0_other']
The current solution does keep track of the feature name (instead of replacing it with x0
) and also the value that is encoded for the feature.
Example
This example is taken from the scikit-learn example pages. The example is about preprocessing a dataset, and training a model that is relevant in the insurance industry. Here is short extract:
The first transformer is sampling the scalar value of DrivAge
(driver age) into 10 bins. Second is one-hot encoding the values for features VehBrand
, VehPower
and VehGas
. The last two transformers do not really change the structure of the original features.
Result
feature_names = get_feature_names(column_trans)
feature_names
Output (shortened):
['binned_numeric__DrivAge: [18.0, 26.0)',
'binned_numeric__DrivAge: [26.0, 30.0)',
'binned_numeric__DrivAge: [30.0, 34.0)',
.
.
.
'binned_numeric__DrivAge: [65.0, 99.0)',
'onehot_categorical__VehBrand_B1',
'onehot_categorical__VehBrand_B10',
'onehot_categorical__VehBrand_B11',
.
.
.
'onehot_categorical__VehBrand_B6',
.
.
.
'onehot_categorical__VehPower_15.0',
'onehot_categorical__VehGas_Diesel',
'onehot_categorical__VehGas_Regular',
'BonusMalus',
'log_scaled_numeric__Density']
Just by taking a look at the output, it is clear what the name of the transformer that generated that particular feature is( binned_numer
, onehot_categorical
or log_scaled_numeric
), what the original feature name is (DrivAge
, VehBrand
, VehPower
, VehGas
, Density
) and what is the information encoded in that feature (eg: segment (18, 26), brand B1, gas Diesel, etc).