SQLTeam.com | Weblogs | Forums

Fuzzy Search of SQL Database


#1

Hi

I am wanting to perform a fuzzy search of our SQL Server database, but due to the string I am searching I don't think
Soundex and Difference will work.

I want to search our database for the following string '2PG 3A C6 VGA MJ HDMI I/L 1MLGST WH'

The database may contain various combinations of this string possibly with more characters such as '2PG IND3A 1RJ45 C6 1VGA 1MJ INLINE 1MLWIELGST W' I have tried to look at standardising the text but I was up to 50 replace statements and thought I was going the wrong way.

Is there anything better I can do?


#2

Levenshtein distance algorithm maybe?


#3

Thanks, Kristen.

I have found the SQL from an old SQL Team Forum (http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=66781), but I am struggling to make it run. I have created the stored procedure. When I execute it I am seeing no results, so I guess I am doing something wrong?

exec edit_distance_within @s = '2PG 3A C6 VGA MJ HDMI I/L 1MLGST WH', @t = '2PG IND3A 1RJ45 C6 1VGA 1MJ INLINE 1MLWIELGST W'


#4

I've only used it as a Function rather than a Procedure. I get "22" for those two values - dunno if that is "close" or "far" though!


#5

not sure if this will help, but I have done something similar to this with some tweaks to search every column in all tables. you might have to run this on each database you have or work it into one of those database loop scripts. one thing you will have to add is a filter to exclude out non string column types. I used this to figure out which column in which table in GP Dynamics' cryptic table names was being changed from the client tool.

create table #randomator(searchstring varchar(2500) not null)

insert into #randomator
select '2PG 3A C6 VGA MJ HDMI I/L 1MLGST WH'



select 'select top 10 ''' + t.name + ''', * from ' + t.name + ' where ' + c.name + ' like ''%' + r.searchstring + '%'''
  from sys.tables t join sys.columns c on t.object_id = c.object_id
cross apply (
 SELECT Split.a.value('.', 'VARCHAR(100)') AS searchstring  
		  FROM (SELECT CAST ('<M>' + REPLACE(rtrim(ltrim(searchstring)), ' ', '</M><M>') + '</M>' AS XML) AS searchstring  
					   --4855
					   --select *
				  FROM #randomator
				) A CROSS APPLY searchstring.nodes ('/M') AS Split(a)
) r

drop table #randomator

this will spit out a list of rows. copy them all to tsql and run the queries


#6

When searching All Columns in All Tables then perhaps that should restrict to only those with some sort of String datatype? Likely to also need to use QuoteName() on the Table/Column (and Schema) names ... sooner or later ... and likely to also need Schema Name too ... sooner or later ...

I'd post my search all tables and columns script, but its massive and handles all sorts of edge-condition searches that I've encountered over the years ...


#7

indeed needs all the trimmings you mentioned.


#8

Thanks for the suggestions.

I have the Levenshtein distance algorithm (suggested by Kristen) working and it appears to be giving me the results I expected, but I will take a look at the other suggestions.