Trying to eliminate similar strings

micky · September 8, 2015, 10:04am

Hi All, I have a table that looks like this

ID | Name |
1 Bob Green
2 B Green
3 S Red
4 Sam Red

is it at all possible to somehow merge those 2 names into one field, perhaps Bob Green with a new ID of 5?

I don't even know where to start for something like this.

Thanks for all your help.

Kristen · September 8, 2015, 10:16am

Good luck!

You may have two, different, people both called "Bob Green" of course ...

A cheap-and-cheerful method is to use SOUNDEX. I doubt that is robust enough for what you are attempting.

More sophisticated than that is the Levenshtein Edit Distance Algorithm. There are some implementations on this forum, but I didn't manage to find anything recent, and there are several versions posted by people with only a single-post-count which I am suspicious about, but in case of interest here is one such thread (on the old forum here) :
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

Kristen · September 8, 2015, 10:19am

When we have done this sort of "cleanup" in the past we have given a score to each possibly-matched-pair. Some scores have been deemed good enough for the machine to decide that its a true match, others have needed a Human to verify it ...

For example, one of our rules was that if we had an approximately matching Organisation Name, with an exactly matching Phone Number (after stripping all non-numeric characters) and ZIP Code that would be considered to be a match. Afterwards we found out that lots of Organisations used an Agent to handle their business, and the Phone Number / Zip Code was for the agent ... and the Organisation Names were actually distinct rather than duplicates