Python:在单词边界上拆分 unicode 字符串

Python: Split unicode string on word boundaries(Python:在单词边界上拆分 unicode 字符串)
本文介绍了Python:在单词边界上拆分 unicode 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个字符串,并将其缩短为 140 个字符.

I need to take a string, and shorten it to 140 characters.

目前我在做:

if len(tweet) > 140:
    tweet = re.sub(r"s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

所以这对英文和类似英文的字符串非常有效,但对于中文字符串却失败了,因为 tweet.split() 只返回一个数组:

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> s
u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002'
>>> s.split()
[u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002']

我应该怎么做才能处理 I18N?这对所有语言都有意义吗?

How should I do this so it handles I18N? Does this make sense in all languages?

如果这很重要,我正在使用 python 2.5.4.

I'm on python 2.5.4 if that matters.

推荐答案

在与一些母语为粤语、普通话和日语的人交谈后,似乎很难做正确的事情,但我目前的算法仍然对他们有意义互联网帖子的上下文.

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.

意思是,它们习惯于在空间上分割并在末尾添加……"的处理方式.

Meaning, they are used to the "split on space and add … at the end" treatment.

所以我会偷懒并坚持下去,直到我收到不理解它的人的抱怨.

So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.

对我的原始实现的唯一更改是不要在最后一个单词上强制使用空格,因为它在任何语言中都是不需要的(并使用 unicode 字符 ... &#x2026 而不是 ... 三个点 保存2个字符)

The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)

这篇关于Python:在单词边界上拆分 unicode 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)