Python:在单词边界上拆分 unicode 字符串

2022-07-14 Python问题得得之家

Python: Split unicode string on word boundaries(Python:在单词边界上拆分 unicode 字符串)

本文介绍了Python:在单词边界上拆分 unicode 字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要一个字符串，并将其缩短为 140 个字符.

I need to take a string, and shorten it to 140 characters.

目前我在做:

if len(tweet) > 140:
    tweet = re.sub(r"s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

所以这对英文和类似英文的字符串非常有效，但对于中文字符串却失败了，因为 tweet.split() 只返回一个数组:

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> s
u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002'
>>> s.split()
[u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002']

我应该怎么做才能处理 I18N?这对所有语言都有意义吗?

How should I do this so it handles I18N? Does this make sense in all languages?

如果这很重要，我正在使用 python 2.5.4.

I'm on python 2.5.4 if that matters.

推荐答案

在与一些母语为粤语、普通话和日语的人交谈后，似乎很难做正确的事情，但我目前的算法仍然对他们有意义互联网帖子的上下文.

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.

意思是，它们习惯于在空间上分割并在末尾添加……"的处理方式.

Meaning, they are used to the "split on space and add … at the end" treatment.

所以我会偷懒并坚持下去，直到我收到不理解它的人的抱怨.

So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.

对我的原始实现的唯一更改是不要在最后一个单词上强制使用空格，因为它在任何语言中都是不需要的(并使用 unicode 字符 ... &#x2026 而不是 ... 三个点 保存2个字符)

The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)

这篇关于Python:在单词边界上拆分 unicode 字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除！

上一篇：Django 模型翻译查询回退下一篇：Tkinter 应用程序 - 允许多种语言

相关文档推荐

Leetcode 234：回文链接列表

Leetcode 234: Palindrome LinkedList(Leetcode 234：回文链接列表)

如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件？

How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件？)

子进程。打开尝试写入不存在的管道

subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)

我想实现从Windows到Linux的POpen-code：

I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code：)

实时读取子进程中的标准输出

Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)

如何在Python中安全地调用随机文件上的类型？

How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型？)

栏目导航

前端问题 php问题 Java问题 Python问题 C/C++问题 C#/.NET问题移动开发问题数据库问题

最新文章

热门文章

热门标签

html vue validate adobe dreamweaver hbuilder vscode aptana editor dedecms ckeditor 编辑器过滤规则织梦图片本地化模板缩略图图集图片删除 ajax 瀑布流无限下拉 cms 判断 sql 清除 tag 文档数 angularjs2 按钮切换效果 vue3 thinkphp yii2 css 项目列表 li go Beego Buffalo Echo Gin Iris Revel 百度云虚拟主机 pbootcms 伪静态框架排序数据库对象字段 sql语句 php 字符串分割 D3.js bootstrap 函数 svg selectAll 织梦cms 关键词解析采集长度限制日期正则表达式