SQLTeam.com | Weblogs | Forums

Levenshtein Distance Algorithm


#1

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

The above URL giving clear picture about Levenshtein algorithm between two strings and it seems to be working fine as expected. How do I apply this into one table? My requirement here is, I want to show matched strings from Name Column in my DB table. Let's assume my DB table has 10 rows and should be displayed as follows..

ABCD
AB_B

QWE_
QWER

PQRSWE
P__S_W_

AS23_R
AS_3T_

Is there any possiblity to apply this algorithm for DB table?


#2

well, of course you can apply it to a DB table, Why else would it be in SQL Server? If you install Master data Services, you can use this algorithm and others:

DQS matching uses nGrams. This algorithm tokenizes strings to all possible substrings of length n.
These tokens are called nGrams. Then it checks how many different
nGrams two strings have in common. The more they have in common, the
more similar the strings are. By dividing the number of nGrams in common
with the number of all possible nGrams, you can get a coefficient with a
value between 0 and 1 that estimates the similarity of the strings. The
closer the coefficient is to the value 1, the more similar the strings
are.
SSIS fuzzy transformations use a much more sophisticated algorithm.
The algorithm was developed by Microsoft Research and is not public.
However, some details are described in the Books Online for SQL Server
2012 article “Fuzzy Lookup Transformation” at http://msdn.microsoft.com.ezproxy.torontopubliclibrary.ca/en-us/library/ms137786.aspx.
In addition to DQS and SSIS, MDS has some public string similarity
algorithms as well. You can find these algorithms implemented in the MDS
database through the mdq.Similarity function. The function implements
the following four public algorithms:

  • Levenshtein distance (also called edit distance)
  • Jaccard index
  • Jaro-Winkler distance
  • Simil (longest common substring; also called Ratcliff/Obershelp)

In addition, you can get nGrams with the mdq.NGrams function, also
provided in the MDS database. Finally, there is also a function that
finds similar dates called mdq.SimilarityDate. You can use all of the
MDS functions in Transact-SQL queries.