Extracting Feature Names from the ColumnTransformer

Bujar Bakiu
2 min readSep 29, 2021

--

Photo by Samule Sun on Unsplash

While using scikit-learn's ColumnTransformer, I run into a “trivial” issue. After transforming the features, they do not have names in the new numpy array.

In stackoverflow.com seems that few other people have faced this issue, and struggled with it. This post is about my version of solving this problem.

Code

Please find the whole code in my Github repo.

Inspiration

This code improves on a previous work done in this post by Johannes Haupt. The following code makes some improvements that fit better my use case. The main improvement is on handling transformers that create bins (e.g:KBinsDiscretizer )

Transformers

For most transformers in ColumnTransformers, scikit-learn library offers get_feature_names() . Sadly, in the case ofOneHotEncoding and KBinsDiscretizer , this method does not exist.

Improvements

KBinsDiscretizer — this transformer was not covered by the original post from Johannes. The current solution, in this case, generates a feature name for every bin. It follows this pattern: feature_name : + [bin_left_edge + bin_right_edge ).

OneHotEncoder — this transformer was covered by the aforementioned post. However, there was one improvement that could be made. As an example, if the OneHotEncoder was applied on the feature sex which has three values (male, female and other), the result would look something like this:

['onehot__x0_male',
'onehot__x0_female',
'onehot__x0_other']

The current solution does keep track of the feature name (instead of replacing it with x0) and also the value that is encoded for the feature.

Example

This example is taken from the scikit-learn example pages. The example is about preprocessing a dataset, and training a model that is relevant in the insurance industry. Here is short extract:

The first transformer is sampling the scalar value of DrivAge (driver age) into 10 bins. Second is one-hot encoding the values for features VehBrand, VehPower and VehGas. The last two transformers do not really change the structure of the original features.

Result

feature_names = get_feature_names(column_trans)
feature_names

Output (shortened):


['binned_numeric__DrivAge: [18.0, 26.0)',
'binned_numeric__DrivAge: [26.0, 30.0)',
'binned_numeric__DrivAge: [30.0, 34.0)',
.
.
.
'binned_numeric__DrivAge: [65.0, 99.0)',
'onehot_categorical__VehBrand_B1',
'onehot_categorical__VehBrand_B10',
'onehot_categorical__VehBrand_B11',
.
.
.
'onehot_categorical__VehBrand_B6',
.
.
.
'onehot_categorical__VehPower_15.0',
'onehot_categorical__VehGas_Diesel',
'onehot_categorical__VehGas_Regular',
'BonusMalus',
'log_scaled_numeric__Density']

Just by taking a look at the output, it is clear what the name of the transformer that generated that particular feature is( binned_numer, onehot_categorical or log_scaled_numeric), what the original feature name is (DrivAge, VehBrand, VehPower, VehGas, Density ) and what is the information encoded in that feature (eg: segment (18, 26), brand B1, gas Diesel, etc).

--

--