使用预定义列表获取 pandas 列中匹配单词的计数

Get count of matching word in string of pandas column with a predefined list(使用预定义列表获取 pandas 列中匹配单词的计数)
本文介绍了使用预定义列表获取 pandas 列中匹配单词的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame 包含 indextext 列.

I have a DataFrame contains index and text columns.

例如:

index | text
1     | "I have a pen, but I lost it today."
2     | "I have pineapple and pen, but I lost it today."

现在我有一个很长的列表,我想将 text 中的每个单词与列表进行匹配.

Now I have a long list, and I want to match each of the words in text with the list.

假设:

long_list = ['pen', 'pineapple']

我想创建一个 FunctionTransformer 来匹配 long_list 中的单词与列值的每个单词,如果匹配,则返回计数.

I would want to create a FunctionTransformer to match words in the long_list with each word of the column value, if there is a match, return the count.

index | text                                             | count
1     | "I have a pen, but I lost it today."             | 1
2     | "I have pineapple and pen, but I lost it today." | 2

我是这样做的:

def count_words(df):
    long_list = ['pen', 'pineapple']
    count = 0
    for c in df['tweet_text']:
        if c in long_list:
            count = count + 1
            
    df['count'] = count   
    return df

count_word = FunctionTransformer(count_words, validate=False)

我如何开发其他 FunctionTransformer 的示例如下:

An example of how I develop my other FunctionTransformer will be:

def convert_twitter_datetime(df):
    df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
    return df

convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)

推荐答案

灵感来自@Quang Hoang 的回答

Inspired by @Quang Hoang's answer

import pandas as pd
import sklearn as sk

y=['pen', 'pineapple']

def count_strings(X, y):
    pattern = r'{}'.format('|'.join(y))
    return X['text'].str.count(pattern)

string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)

结果

    text                                              count
1   "I have a pen, but I lost it today."                1
2   "I have pineapple and pen, but I lost it today.     2

对于下面的df2:

#df2
      text
1     "I have a pen, but I lost it today. pen pen"
2     "I have pineapple and pen, but I lost it today."

我们得到

string_transformer.transform(X=df2)
#result
1    3
2    2
Name: text, dtype: int64

这表明,我们将函数转换为 sklearn 样式的对象.为了进一步抽象这一点,我们可以将列名作为关键字参数传递给 count_strings.

This shows, that we converted the function to an sklearn-style object. To abstact this even further we can hand over the column name as key-word argument to count_strings.

这篇关于使用预定义列表获取 pandas 列中匹配单词的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)