Approximate string-matching algorithms

Nowadays computers keep huge amounts of data. However, the retrieval of textual information is difficult when misspelled or not exactly known. This article deals with approximate string-matching algorithms. It gives a simplified but clear description of algorithms with examples.

First, we are going to explain the procedures constructing the phonetic codes for the searched text that sound the same, but spelled differently. Then we explain procedures that give different types of textual similarity metric used in a fault-tolerant search. In conclusion, the algorithms are compared so that the reader can choose the right one. See also resources later in this article.

Comments and questions are welcome.

Phonetic coding

The procedures discussed here are used to simplify searches in databases when one knows how the text is pronounced but not how it is spelled. For this purpose the phonetic codes are constructed for the searched text, while the database is preventively indexed using those codes. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and can be filed together. The use of phonetic codes reduces matching problems from wrong or different spelling.

Phonetic codes are very useful in order to match lists of either personal names or enterprise names. They are also used for speech recognition and in search engines of databases like the anti-terrorist ones.

Soundex

Soundex builds a phonetic code for people’s name. The resulting code is made of one letter and three digits, each digit being one of six consonant sounds. The algorithm was first applied in the U.S. census in 1880.

Procedure
  1. Take the first letter.
  2. Translate remaining characters:
    B, F, P, V → 1
    C, G, J, K, Q, S, X, Z → 2
    D, T → 3
    L → 4
    M, N → 5
    R → 6
  3. Drop adjacent letters having the same code.
  4. Drop non-initial A, E, I, O, U, Y, W and H.
  5. Take first four characters padding with zeros.
Examples
  1. ALEXANDRE → A4E2A536E → A4E2A536E → A42536 → A425.
  2. ALEKSANDER → A4E22A53E6 → A4E2A53E6 → A42536 → A425.

Daitch-Mokotoff Soundex

In 1985 the new Daitch-Mokotoff Soundex algorithm was applied for phonetic coding of Slavic and Yiddish surnames with similar pronunciations and different spellings. The most important improvements are the following: the code is composed of six characters; there are two different codes when a name has two possible pronunciations; the code is composed of ten figures from 0 to 9.

NYSIIS

The New York State Identification and Intelligence System was developed in 1970 through rigorous empirical analysis. In this algorithm a phonetic code of up to 6 letters is constructed.

Procedure
  1. Translate first characters of name:
    MAC → MCC
    PH → FF
    KN → NN
    PF → FF
    K → C
    SCH → SSS
  2. Translate last characters of name:
    EE → Y
    IE → Y
    DT, RT, RD, NT, ND → D
  3. Translate remaining characters by following rules,
    incrementing by one character each time:
    EV → AF; else E, I, O, U → A
    Q → G
    Z → S
    M → N
    KN → N; else K → C
    SCH → SSS
    PH → FF
    H → previous character, if previous or next are nonvowel
    W → previous character, if previous is vowel
  4. Check last character:
    if last character is S, remove it
    if last characters are AY, replace with Y
    if last character is A, remove it
  5. Drop second letter of doubled letters.
  6. Take first six characters.
Examples
  1. ALEXANDRE → ALAXANDRA → ALAXANDR → ALAXANDR → ALAXAN
  2. ALEKSANDER → ALACSANDAR → ALACSANDAR → ALACSANDAR → ALACSA

Metaphone

The algorithm phonetically codes words by reducing them to 16 consonant sounds: B, X, S, K, J, T, F, H, L, M, N, P, R, 0, W, Y. Zero represents the “th” sound; X stands for “sh”.

Procedure
  1. Drop second letter of doubled letters, except C.
  2. If the word begins with KN, GN, PN, AE, WR, drop first letter.
  3. Drop B at the end of a word after M.
  4. C → X in CIA or CH; C → S in CI, CE or CY; C → K otherwise.
  5. D → J in DGE, DGY or DGI; D → T otherwise.
  6. Drop G in GH and if not at end or before a vowel in GN or GNED; G → J before I or E or Y if not double GG; G → K otherwise.
  7. Drop H after vowel and if no vowel follows.
  8. Drop K after C.
  9. P → F in PH.
  10. Q → K.
  11. S → X in SH or SIO or SIA.
  12. T → X in TIA or TIO; T → 0 in TH; T is dropped in TCH.
  13. V → F.
  14. If the word begins with WH, drop H; drop W if not followed by a vowel.
  15. If the word begins with X, then X → S; X → KS otherwise.
  16. Drop Y if not followed by a vowel.
  17. Z → S.
  18. Vowels are kept only when they are the first letter.
  19. In all the other cases the letters do not change.
Examples
  1. ALEXANDRE → ALEKSANTRE → ALKSNTR
  2. ALEKSANDER → ALEKSANTER → ALKSNTR

Double metaphone

Double metaphone is the improved version of Metaphone. This algorithm phonetically codes words by reducing them to 12 consonant sounds. It returns two keys if a word has two possible pronunciations.

Caverphone

The Caverphone phonetic matching algorithm was created in the Caversham Project at the University of Otago in New Zealand in 2002. The algorithm allows accents present in the study area (southern part of the city of Dunedin, New Zealand).

A new version, Caverphone 2.0, has been built as a more general phonetic match.

Similarity metric

Other procedures discussed here use different types of textual similarity metric. They can be used in a fault-tolerant search.

Most of those procedures calculate the similarity of two strings as number between 0 and 1. Value 0 means the strings are completely different. Value 1 means perfect match of the strings. Intermediate values correspond to a partial match.

The similarity metric algorithms are useful for spell checking, speech recognition, DNA analysis, plagiarism detection, to find difference between files and for remote screen update problem.

Hamming distance

For two strings of the same length, the Hamming distance is the number of positions in which the two strings have different characters.

Examples
  1. Hamming distance from ALEXANDRE to ALEKSANDR is 6 (ALE coincide, the resting 6 characters are different).

Edit distance

The edit distance of two strings is defined as the minimum number of insertions, deletions and substitutions required to transform the first string into the second. The distance 0 means the strings are identical.

The algorithm was devised in 1965 by the sovietic scientist Vladimir Levenshtein. The edit distance is also called the Levenshtein distance.

There are modern versions of Edit distance algorithm assigning different weights to insertions, deletions and substitutions.

Examples
  1. Edit distance from ALEXANDRE to ALEKSANDER is 4 (substitute X by K, insert S after K, insert E after D, delete E at the end).

Trigram

The trigram similarity between two strings is determined by the number of matching letter triples in both strings. The algorithm can be generalized to n-grams, where n=1, 2, …

The strings string1 and string2 are padded to left and to right by one space symbol. Then the strings are divided into letter triplets (trigrams). Finally, the similarity is calculated as:

s = 2*m / (a + b).

Here:

m is number of matching trigrams
a is number of trigrams in string1
b is number of trigrams in string2

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is 2*3 / (9+10) = 0.32 (matching _AL, ALE, AND).

Ratcliff/Obershelp pattern recognition

The Ratcliff/Obershelp algorithm computes the similarity of two strings as the doubled number of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence.

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is 2 * (3+3+1+1) / (9+10) = 0.84 (matching ALE, AND, E, R).

Jaro-Winkler

The Jaro-Winkler similarity was developed at the U.S. Census and used in post-enumeration analysis.

Given strings string1 and string2, their similarity is:

s = m/3a + m/3b + (m-t)/3m.

Here:

m is number of matching characters
a is length of string1
b is length of string2
t is number of transpositions

Two characters are considered matched only if they are no further apart than (max(a,b)/2 – 1). The first matched character on string1 is compared to the first matched character on string2; the second matched character on string1 is compared to the second matched character on string2, etc. The number of mismatched characters is divided by two to yield the number of transpositions.

The improved Jaro-Winkler method uses weights different from 1/3. It also penalizes less some types of error: visual scanning and keypunch errors, errors at the end of a string.

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is (8/9 + 8/10 + (8-1)/8) / 3 = 0.85 (matching A, L, E, A, N, D, R, E; 1 transposition).

Typewriting errors

There are some algorithms taking into account errors resulting from the pressing of erroneous buttons on the keyboard: V instead of B, 6 instead of Y, etc.

Comparison of algorithms

The choice of one from the following string-matching algorithms mainly depends on the nature of error influencing the text data.

AlgorithmTypeWhen to use
Soundexphoneticmisspelled English words
Daitch-Mokotoffphoneticdifferently spelled European surnames
NYSIISphoneticmisspelled English and some foreign words
Metaphone, Double metaphonephoneticmisspelled English words
Caverphone 2.0phoneticmisspelled English words
Hamming, Edit distancesimilaritylocal spelling errors
Trigram, n-gramsimilarityspelling errors, edited text
Ratcliff/Obershelpsimilarityspelling errors, edited text
Jaro-Winklersimilarityspelling and typewriting errors

Resources

Handbooks, lectures

  1. The dictionary of algorithms and data structures.
    http://www.nist.gov/dads/
  2. Handbook of algorithms and data structures by Gaston H. Gonnet and Ricardo Baeza-Yates.
    http://www.dcc.uchile.cl/~rbaeza/handbook/
  3. Exact string matching algorithms: lectures of Laboratoire d’Informatique de Rouen, France.
    http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Papers

  1. Soundex reference.
    http://www.archives.gov/
  2. Usage of Soundex and NYSIIS in NameSearch system.
    http://www.name-searching.com/
  3. Article on Soundex and NYSIIS by Andrew Coates.
    http://www.civilsolutions.com.au/
  4. Introduction to Daitch-Mokotoff soundex system by Gary Mokotoff.
    http://www.avotaynu.com/soundex.html
  5. Description of Caverphone 2.0.
    http://caversham.otago.ac.nz/
  6. Metaphone reference.
    http://aspell.sourceforge.net/
  7. Double metaphone reference.
    http://www.cuj.com/
  8. Article by Reinhard Rapp on Trigram method.
    http://heise.de/ct/english/97/04/386/
  9. Article by Bruce L. Lambert on usage of N-gram and Edit distance in drug naming.
    http://www.hc-sc.gc.ca/
  10. Paper by William E. Winkler and Yves Thibaudeau on Jaro-Winkler string comparator.
    http://www.census.gov/srd/papers/pdf/rr91-9.pdf
  11. Articles on Fuzzy string searching, Phonetic algorithm, Soundex, NYSIIS, Metaphone, Daitch-Mokotoff Soundex, Levenshtein distance and Trigram search in Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/
  12. Article on SQL Server 2005 Fuzzy Lookup and Grouping technology on MSDN Magazine.
    http://msdn.microsoft.com/

Implementations

  1. Implementation of Soundex in C.
    http://physics.nist.gov/cuu/Reference/soundex.html
  2. Implementation of Metaphone and Double metaphone in Basic, C, Perl, and C++.
    http://aspell.sourceforge.net/metaphone/
  3. Implementation of Edit distance in Java, C++ and Visual Basic.
    http://www.merriampark.com/ld.htm
  4. Dynamic programming algorithm for Edit distance.
    http://www.csse.monash.edu.au/
  5. Fuzzy string search in TCL.
    http://mini.net/tcl/3841
  6. Description of PHP string functions levenshtein, soundex, similar_text and metaphone.
    http://www.php.net/manual/en/ref.strings.php
  7. Fuzzy Matching & Deduplication SourceGorge project.
    http://sourceforge.net/projects/dedupe/

Leave a Reply

Your email address will not be published. Required fields are marked *