scikit-learn's ColumnTransformer, I run into a “trivial” issue. After transforming the features, they do not have names in the new
In stackoverflow.com seems that few other people have faced this issue, and struggled with it. This post is about my version of solving this problem.
Please find the whole code in my Github repo.
This code improves on a previous work done in this post by Johannes Haupt. The following code makes some improvements that fit better my use case. The main improvement is on handling transformers that create bins (e.g:
For most transformers in
ColumnTransformers, scikit-learn library offers
get_feature_names() . Sadly, in the case of
KBinsDiscretizer , this method does not exist.
KBinsDiscretizer — this transformer was not covered by the original post from Johannes. The current solution, in this case, generates a feature name for every bin. It follows this pattern:
feature_name : + [
OneHotEncoder — this transformer was covered by the aforementioned post. However, there was one improvement that could be made. As an example, if the OneHotEncoder was applied on the feature
sex which has three values (
other), the result would look something like this:
The current solution does keep track of the feature name (instead of replacing it with
x0) and also the value that is encoded for the feature.
This example is taken from the scikit-learn example pages. The example is about preprocessing a dataset, and training a model that is relevant in the insurance industry. Here is short extract:
The first transformer is sampling the scalar value of
DrivAge (driver age) into 10 bins. Second is one-hot encoding the values for features
VehGas. The last two transformers do not really change the structure of the original features.
feature_names = get_feature_names(column_trans)
['binned_numeric__DrivAge: [18.0, 26.0)',
'binned_numeric__DrivAge: [26.0, 30.0)',
'binned_numeric__DrivAge: [30.0, 34.0)',
'binned_numeric__DrivAge: [65.0, 99.0)',
Just by taking a look at the output, it is clear what the name of the transformer that generated that particular feature is(
log_scaled_numeric), what the original feature name is (
Density ) and what is the information encoded in that feature (eg: segment (18, 26), brand B1, gas Diesel, etc).