ColumnTransformer失败，管道中有CountVectorizer/HashingVectorizer(多个文本功能)

本文介绍了ColumnTransformer失败，管道中有CountVectorizer/HashingVectorizer(多个文本功能)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

类似于此问题(ColumnTransformer fails with CountVectorizer in a pipeline)，我希望使用管道中的ColumnTransformer对具有文本功能的列应用CountVectorizer/HashingVectorizer。但我不是只有一个文字功能，而是多个。如果我传递了一个功能(而不是像另一个问题的解决方案中建议的那样作为列表)，它工作得很好，我如何为多个功能传递它？

numeric_features = ['x0', 'x1', 'y0', 'y1']
categorical_features = []
text_features = ['text_feature', 'another_text_feature']

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('encoder', OneHotEncoder())])
text_transformer = Pipeline(steps=[('hashing', HashingVectorizer())])

preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, numeric_features), 
    ('categorical', categorical_transformer, categorical_features),
    ('text', text_transformer, text_features)
])
    
steps = [('preprocessor', preprocessor),
         ('clf', SGDClassifier())]
    
pipeline = Pipeline(steps=steps)
    
pipeline.fit(X_train, y_train)

推荐答案

只需为每个文本功能使用单独的转换器。

preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, numeric_features), 
    ('categorical', categorical_transformer, categorical_features),
    ('text', text_transformer, 'text_feature'),
    ('more_text', text_transformer, 'another_text_feature'),
])

(转换器在装配过程中被克隆，因此您将有两个单独的text_transformer副本，一切都很好。如果您担心像这样指定相同的转换器两次，您始终可以在指定ColumnTransformer之前手动复制/克隆它。)

这篇关于ColumnTransformer失败，管道中有CountVectorizer/HashingVectorizer(多个文本功能)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除！

ColumnTransformer失败，管道中有CountVectorizer/HashingVectorizer(多个文本功能)

问题描述

推荐答案

相关文档推荐