为什么在切换要比较的字符串顺序时,Difference

Why does the Difference function give different results when switching order of strings to compare?(为什么在切换要比较的字符串顺序时,Difference 函数会给出不同的结果?)
本文介绍了为什么在切换要比较的字符串顺序时,Difference 函数会给出不同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 SQL Server 中,如果我执行以下操作:

In SQL Server, if I do the following:

Difference ('Kennady', 'Kary') : I get 2

如果我这样做:

Difference ('Kary', 'Kennady') : I get 3.

我认为差异函数会查看底层的 Soundex 值,并给出一个 0-4 的数字,表示有多少字符是相同的.

I thought the Difference function looks at the Soundex values under the hood, and gives a 0-4 number of how many characters in place are the same.

SELECT SOUNDEX('Kennady') AS [SoundEx Kennady]
    , SOUNDEX('Kary') AS [SoundEx Kary]
    , DIFFERENCE ('Kennady', 'Kary') AS [Difference Kennady vs Kary]
    , DIFFERENCE ('Kary', 'Kennady') AS [Difference Kary vs Kennady];

推荐答案

这是严格的观察.文档 非常清楚:

This is strictly observational. The documentation is pretty clear:

返回的整数是 SOUNDEX 值中的字符数那是一样的.返回值范围从 0 到 4:0表示弱相似或无相似,4 表示强相似或相同的值.

The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.

根据本文档,返回值不应因参数的顺序而异.

According to this documentation, the return value should not differ based on the order of the arguments.

来自我的查询:Kennady"--> K530 和Kary"--> K600.它们有两个共同的字符,所以值应该是 2.

From my queries: "Kennady" --> K530 and "Kary" --> K600. These have two characters in common, so the value should be 2.

现在,我注意到Kenn"--> K500.将Kennady"截断为Kary"的长度会得到值3".嗯.

Now, I notice that "Kenn" --> K500. Truncating "Kennady" to the length of "Kary" results in the value "3". Hmmm.

因此,我认为 DIFFERENCE() 是使用第一个参数的长度来截断第二个参数.这使得参数的顺序很重要.先把较长的论点放在首位.

Hence, I think that DIFFERENCE() is using the length of the first argument to truncate the second argument. That makes the order of the arguments important. Put the longer argument first.

我在其他一些字符串上试过这个.相同的模式似乎有效.我还没有找到任何说明这种情况的文件.

I tried this out on some other strings. The same patterns seems to work. I have not found any documentation that specifies that this is the case.

我想微软会称其为功能"而不是错误";)

I suppose Microsoft would call this a "feature" and not a "bug" ;).

以上推测并不完全正确.考虑以下

The above speculation is not quite correct. Consider the following

  • leepaupauld --> L114
  • 利奥波德 --> L143
  • leepaup --> L110

然而,

  • difference(leepaupauld, leopold) = 4 (!)
  • 差异(利奥波德,利波保德)= 3
  • difference(leepaup, leopold) = 3 (!)
  • 差异(利奥波德,利帕普)= 2

考虑到字符串的 soundex 值,(!) 是我的判断,即结果根本没有意义.

The (!) is my judgement that the result makes no sense at all, given the soundex values for the strings.

所以,问题不在于长度.这是@jpw 在评论中指向的底层方法.问题似乎是一个字符串中的重复匹配值.但是,根据文档,这些不应该多次匹配同一个字符.

So, the issue isn't the length. It is the underlying method, which @jpw points to in the comment. The problem appears to be duplicate matching values in one string. However, according to the documentation, these should not match the same character multiple times.

我的建议:使用 Levenshtein 距离.这说得通.它在更长的字符串上效果更好.这是理智的.它不是内置的,但很容易在网络上找到任何数据库的实现.

My advice: Use Levenshtein distance. It makes sense. It works better on longer strings. It is sane. It is not built in, but it is easy enough to find an implementation on the web for any database.

这篇关于为什么在切换要比较的字符串顺序时,Difference 函数会给出不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Execute complex raw SQL query in EF6(在EF6中执行复杂的原始SQL查询)
SSIS: Model design issue causing duplications - can two fact tables be connected?(SSIS:模型设计问题导致重复-两个事实表可以连接吗?)
SQL Server Graph Database - shortest path using multiple edge types(SQL Server图形数据库-使用多种边类型的最短路径)
Invalid column name when using EF Core filtered includes(使用EF核心过滤包括时无效的列名)
How should make faster SQL Server filtering procedure with many parameters(如何让多参数的SQL Server过滤程序更快)
How can I generate an entity–relationship (ER) diagram of a database using Microsoft SQL Server Management Studio?(如何使用Microsoft SQL Server Management Studio生成数据库的实体关系(ER)图?)