ValueError:无法为Entities中的多个范围中包含的令牌27设置实体

ValueError: Unable to set entity for token 27 which is included in more than one span in entities(ValueError:无法为Entities中的多个范围中包含的令牌27设置实体)
本文介绍了ValueError:无法为Entities中的多个范围中包含的令牌27设置实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将dataset转换为.spacy,方法是先在doc中将其转换为DocBin。可以通过GoogleDocs访问整个dataset文件。

我运行以下函数:

def converter(data, outputFile):
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start, end, label in annot["entities"]: # add character indexes
            # supported modes: strict, contract, expand
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            # to avoid having the traceback; 
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

dataset上运行函数后,我获得了回溯: ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities, blocked, missing or outside.

仔细查看dataset文件以查找引发此回溯的text后,我发现了以下内容:

[('HereLongText..(abstract)',
  {'entities': [('0', '27', 'SpecificDisease'),
    ('80', '93', 'SpecificDisease'),
    ('260', '278', 'SpecificDisease'),
    ('615', '628', 'SpecificDisease'),
    ('673', '691', 'SpecificDisease'),
    ('754', '772', 'SpecificDisease')]})]

我不知道如何解决此问题。

推荐答案

我认为这应该会清楚地说明您的问题。以下是具有相同错误的代码的略微修改版本。

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

def converter(data, outputFile):
    nlp = spacy.blank("en")  # load a new spacy model
    doc_bin = DocBin()  # create a DocBin object

    for text, annot in tqdm(data):  # data in previous format
        doc = nlp.make_doc(text)  # create doc object from text
        ents = []

        for start, end, label in annot["entities"]:  # add character indexes
            # supported modes: strict, contract, expand

            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            # to avoid having the traceback;
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents  # label the text with the ents
        doc_bin.add(doc)

    doc_bin.to_disk(f"./{outputFile}.spacy")  # save the docbin object
    return f"Processed {len(doc_bin)}"


data = [("I like cheese", 
    {"entities": [
        (0, 1, "Sample"),
        (0, 1, "Sample"), # Same thing twice
        ]})]

converter(data, "out.txt")

请注意,在这些示例中,完全相同的跨度有两个注释。如果删除其中一个批注,则不会出现错误。

您可能收到错误,因为您的批注重叠且不可用。

这篇关于ValueError:无法为Entities中的多个范围中包含的令牌27设置实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)