只接受波斯字符的正则表达式

regex for accepting only persian characters(只接受波斯字符的正则表达式)
本文介绍了只接受波斯字符的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一种表单,其中一个自定义验证器应该只接受波斯字符.我使用了以下代码:

var myregex = new Regex(@"^[u0600-u06FF]+$");if (myregex.IsMatch(mytextBox.Text)){args.IsValid = true;}别的{args.IsValid = 假;}

但是,它似乎只能检测阿拉伯字符,因为它没有涵盖所有波斯字符(它缺少这四个:گ,چ,پ,ژ).

有没有办法解决这个问题?

解决方案

TL;DR

波斯语必须使用的字符集如下:

  • 使用

    整个故事

    这个答案的存在是为了纠正一个常见的误解.代码点 060006FF 不表示 波斯语/波斯语字母([آ-ی]也不行):

    [u0600-u0605 ؐ-ؚu061Cـ ۖ-u06DD ۟-8 ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛؞ ؟3٭ ٪ ؉ ؊ ؈ ؎ ؏۞۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀة-ثٹٺټٽٿۻ ط ظ ڟ ع غ ۼ ف ڡ-ڦ ٯ ق ڧ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶۄ-8 ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ u061D]

    255 个字符属于 阿拉伯语块 (0600–06FF),波斯语字母有 32 个字母,除了波斯语的数字演示之外,它将是 42.如果我们添加元音(最初是阿拉伯语元音,在波斯语中很少使用)而不使用 Tanvin(ًٍِ ٌ ) 和 Tashdid (ّ ) 都是阿拉伯语变音符号而非波斯语的子集,我们最终会得到 46 个字符.这意味着 u0600-u06FF 包含的字符比您需要的多 209 个!

    7 代码点 06F7 是数字 7٧ 代码点 0667 的波斯语表示 是相同数字的阿拉伯语表示.۶ 是数字 6 的波斯语表示,٦ 是同一数字的阿拉伯语表示.并且全部驻留在 060006FF 代码点中.

    <块引用>

    波斯数字四(۴)、五(۵)和六(۶)的形状是与阿拉伯语中使用的形状不同,其他数字有不同的代码点.

    您也可以看到波斯语/波斯语中不存在的不同数量的其他字符,并且在验证名字或姓氏时没有人愿意拥有它们.

    [آ-ی] 也包含 117 个字符,这远远超过了验证所需的字符.您可以使用 Unicode CLDR 查看它们.

    I'm working on a form where one of its custom validators should only accept Persian characters. I used the following code:

    var myregex = new Regex(@"^[u0600-u06FF]+$");
    if (myregex.IsMatch(mytextBox.Text))
    {
        args.IsValid = true;
    }
    else
    {
        args.IsValid = false;
    }
    

    However, it seems that it can only detect Arabic characters, as it doesn't cover all Persian characters (it lacks these four: گ,چ,پ,ژ ).

    Is there a way to solve this problem?

    解决方案

    TL;DR

    FarsiMUSTusedcharactersetsareasfollowing:

    • Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters or use codepoints regarding your regex flavor (not all engines support uXXXX notation):

      ^[u0622u0627u0628u067Eu062A-u062Cu0686u062D-u0632u0698u0633-u063Au0641u0642u06A9u06AFu0644-u0648u06CC]+$
      

    • Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers or regarding your regex flavor:

      ^[u06F0-u06F9]+$
      

    • Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels or regarding your regex flavor:

      [u202Cu064Bu064Cu064E-u0652]
      

    or a combination of those together. You may want to add other Arabic letters like Hamza ء to your character set additionally.

    Why are [u0600-u06FF] and [آ-ی] both wrong?

    includes:

    • گ with codepoint 06AF
    • چ with codepoint 0686
    • پ with codepoint 067E
    • ژ with codepoint 0698

    aresimplyWRONG.

    contains209morecharactersthanyouneed!anditincludesnumberstoo!

    Whole story

    This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):

    [u0600-u0605 ؐ-ؚu061Cـ ۖ-u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏
    ۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ
    ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ
    ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ
    ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ u061D]
    

    255 characters are fallen under Arabic block (0600–06FF), Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) without Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we would end up with 46 characters. This means u0600-u06FF contains 209 more characters than you need!

    ۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.

    The shapes of the Persian digits four (۴), five (۵), and six (۶) are different from the shapes used in Arabic and the other numbers have different codepoints.

    You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.

    [آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

    这篇关于只接受波斯字符的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

DispatcherQueue null when trying to update Ui property in ViewModel(尝试更新ViewModel中的Ui属性时DispatcherQueue为空)
Drawing over all windows on multiple monitors(在多个监视器上绘制所有窗口)
Programmatically show the desktop(以编程方式显示桌面)
c# Generic Setlt;Tgt; implementation to access objects by type(按类型访问对象的C#泛型集实现)
InvalidOperationException When using Context Injection in ASP.Net Core(在ASP.NET核心中使用上下文注入时发生InvalidOperationException)
LINQ many-to-many relationship, how to write a correct WHERE clause?(LINQ多对多关系,如何写一个正确的WHERE子句?)