Levenshtein distance

来自牛奶河Wiki
阿奔讨论 | 贡献2024年6月5日 (三) 17:03的版本 →‎应用场景
跳到导航 跳到搜索

MED(Minimum Edit Distance,编辑距离),由俄罗斯数学家弗拉基米尔·莱文斯坦(Vladimir Levenshtein)在 1965 年提出,也因此而得名 Levenshtein Distance(莱文斯坦距离)。

在信息论、语言学和计算机科学领域,Levenshtein Distance 是用来度量两个序列相似程度的指标。通俗地来讲,编辑距离指的是在两个单词之间,由其中一个单词转换为另一个单词所需要的最少单字符编辑操作次数(插入(Insertion)、删除(Deletion)、替换(Substitution))。

如:ld[hello, hell] = 1(插入"o"), ld[sit, seat] = 2

应用场景

  • 拼写检查

采用数据对齐的方式解决数据不完全一致问题,选取匹配度比较高(LD值比较小)的数据。如解决地址信息填写不规范问题

  • 数据匹配

词典应用的拼写提示。例如输入了 throwab,就能提示出 throwable。遍历 t 开头的单词库,寻找匹配度比较高(LD值比较小)的单词进行提示

  • 匹配侦测

例如可以把匹配度高于某一个阈值的代码、文章初筛出来以便于进一步确认是否抄袭

Time Complexity

The Levenshtein distance algorithm implemented with dynamic programming has a time complexity of O(mn), where:

  • m is the length of the first string.
  • n is the length of the second string.

This means the running time of the algorithm grows proportionally to the product of the string lengths.

Here's why:

1. The algorithm uses a 2D DP table with dimensions (m+1) x (n+1). 2. It iterates through each cell of this table, performing constant-time operations like comparisons and minimum value calculations. 3. The total number of cells to be filled is (m+1) * (n+1), which translates to O(mn) complexity.

Therefore, the time it takes to calculate the Levenshtein distance between two strings increases linearly with the combined length of both strings.

Code

ld(x, y) = min -> ld(x-1, y)+1 OR ld(x, y-1)+1 OR ld(x-1, y-1)+[0|1]
[0|1]: if x[-1] = y[-1], then 0 else 1

Python

// Levenshtein Distance, recursion
// Less efficient because there are multiple double counts
def ld(str1, str2):
    m = len(str1)
    n = len(str2)

    if m == 0: return n
    if n == 0: return m
    if str1 == str2: return 0
    d = 0 if str1[-1] == str2[-1] else 1

    return min(ld(str1, str2[:-1]) + 1, ld(str1[:-1], str2) + 1, ld(str1[:-1], str2[:-1]) + d)

Java

// Levenshtein Distance, m * n array
// The space used is larger
public static int levenshtein2 (String str1, String str2) {
    if (isnull(str1)) return str2.length();
    if (isnull(str2)) return str1.length();
    if (str1.equals(str2)) return 0;

    int m, n, d, i, j;
    m = str1.length();
    n = str2.length();
    int[][] dp = new int[m+1][n+1];
    for (i = 0; i<=m; i++) dp[i][0] = i;
    for (j = 1; j<=n; j++) dp[0][j] = j;

    for (i = 1; i<=m; i++) {
        for (j = 1; j<=n; j++) {
            d = (str1.charAt(i-1) == str2.charAt(j-1)) ? 0 : 1;
            dp[i][j] = min(dp[i-1][j-1]+d, min(dp[i][j-1]+1, dp[i-1][j]+1));
        }
    }

    return dp[m][n];
}
// Levenshtein Distance, min(m, n) * 2 array
public static int levenshtein (String str1, String str2) {
    if (isnull(str1)) return str2.length();
    if (isnull(str2)) return str1.length();
    if (str1.equals(str2)) return 0;

    // less space used - <<
    if (str1.length() > str2.length()) return levenshtein(str2, str1);
    // less space used - >>

    int m, n, d, i, j;
    m = str1.length();
    n = str2.length();
    int[][] dp = new int[m+1][2];
    for (i = 0; i<=m; i++) dp[i][0] = i;
    dp[1][1] = 1;

    for (j = 1; j<=n; j++) {
        for (i = 1; i<=m; i++) {
            d = (str1.charAt(i-1) == str2.charAt(j-1)) ? 0 : 1;
            dp[i][1] = min(dp[i-1][0]+d, min(dp[i][0]+1, dp[i-1][1]+1));
        }
        for (i = 0; i<=m; i++) {
            dp[i][0] = dp[i][1];
        }
        dp[0][1] = dp[0][0] + 1;
    }

    return dp[m][1];
}