向现有数据帧添加空间输出时,列不对齐

When adding SpaCy output to existing dataframe, columns do not align(向现有数据帧添加空间输出时,列不对齐)
本文介绍了向现有数据帧添加空间输出时,列不对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV,其中包含一列文章标题,我使用Spacy从其中提取出现在标题中的任何人名。尝试使用Spacy提取的名称向CSV添加新列时,它们与从中提取它们的行不对齐。

我相信这是因为Spacy结果有自己的索引,独立于原始数据的索引。

我已尝试将, index=df.index)添加到新列行,但得到";ValueError:传递的值的长度为2,索引暗示为10。&q;

如何将空格输出与其来源行对齐?

以下是我的代码:

import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:UsersAdminDownloadsitsnicethat (5).csv", nrows=10,
                  usecols=['article_title']))
article = [_ for _ in df['article_title']]

import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
    if ent.label_ == "PERSON":
        people.append(ent)

import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())

这是生成的数据帧:

                                       article_title       artist_names
0  "They’re like, is that? Oh it’s!" – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...  (Dylan, Mulvaney)
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...                NaN

这就是我所期待的:

                                       article_title       artist_names
0  "They’re like, is that? Oh it’s!" – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...                NaN
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...   (Dylan, Mulvaney)

您可以看到MACTOR_NAMES列中的第5个值与第5个文章标题相关。如何使它们对齐?

感谢您的帮助。

推荐答案

我会遍历文章,分别检测每个文章中的实体,并将检测到的实体放在一个列表中,每个文章有一个元素:

nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]

entities_by_article = []
for doc in nlp.pipe(article):
  people = []
  for ent in doc.ents:
    if ent.label_ == "PERSON":
      people.append(ent)
  entities_by_article.append(people)

df['artist_names'] = pd.Series(entities_by_article)

注意:for doc in nlp.pipe(article)是Spacy在文本列表中循环的更有效方式,可以替换为:

for a in article:
  doc = nlp(a)
  ## rest of code within loop

这篇关于向现有数据帧添加空间输出时,列不对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)