Weighted probability a row is the same

bluebsh · August 12, 2015, 6:06pm

I need to be able to search a table for data that is similar, but have it weighted as to which is the most likely candidate for a match...

Say I have a table that has this information..

FirstName, LastName, DateOfBirth, SocialSecurityNumber, Address

a search query can specify any or all of those pieces of information... I want to return what it thinks is the most likely the correct one based on weighted column values... like SSN is a much higher weight then say first and last name, but first, dob and address have a pretty high weight but not as high as ssn...

so say

column with weight

FirstName - 10
LastName - 10
Dob - 20
SSN - 100
Address - 30

so a first name and DOB would have a total weight of 30, so it's more likely then a first and last only which only has a weight of 20, but not as likely if I searched by ssn only which has 100 total weight.

how would you go about doing something similar to this in table logic? I have a few ideas, and actually have a working piece of code that is very rough by just making a query for each combination and assigning the result a weight union it and then sorting it.. but that is a long drawn out way of doing it.

any ideas to simplify this concept?

JamesK · August 12, 2015, 6:49pm

Something like this?

SELECT
	.....,
	10 * CASE WHEN @FirstName = FirstName THEN 1 ELSE 0 +
	10 * CASE WHEN @LastName = LastName THEN 1 ELSE 0 +
	20 * CASE WHEN @DOB = DOB THEN 1 ELSE 0 +
	100 * CASE WHEN @SSN = SSN THEN 1 ELSE 0 +
	30 * CASE WHEN @Address = [Address] THEN 1 ELSE 0 END AS
		CombinedWeight
FROM
	....
ORDER BY
	CombinedWeight DESC;

djj55 · August 12, 2015, 7:24pm

@JamesK, what no END?

Thanks I am always trying to learn

JamesK · August 12, 2015, 7:36pm

"No END" to my woes! Thanks djj! Edited my original post.

Kristen · August 12, 2015, 11:39pm

I would modify JamesK's approach as follows (no need for the multiplication that I can see?)

SELECT
	.....,
	CASE WHEN @FirstName = FirstName THEN 10 ELSE 0 +
	CASE WHEN @LastName = LastName THEN 10 ELSE 0 +
	CASE WHEN @DOB = DOB THEN 20 ELSE 0 +
	CASE WHEN @SSN = SSN THEN 100 ELSE 0 +
	CASE WHEN @Address = [Address] THEN 30 ELSE 0 END AS
		CombinedWeight
FROM  YourTable
WHERE    FirstName = @FirstName
      OR LastName = @LastName
      OR DOB = @DOB
      OR SSN = @SSN
      OR [Address] = @Address
ORDER BY
	CombinedWeight DESC;

the problem is that the WHERE clause is very inefficient.

An alternative, which might be faster, would be:

CREATE TABLE #TEMP
(
    T_ID int,     -- Whatever is the PKey for your table
    PRIMARY KEY
    (
        T_ID
    )
)
--
DECLARE @intTotalResults int = 0,
        @MaxResultsRequired int = 500 -- Number of results required
--
    -- Highest weighting queries first
IF @SSN IS NOT NULL
BEGIN
	INSERT INTO #TEMP
	SELECT YourPKey
	FROM YourTable
	WHERE SSN = @SSN
        SELECT @intTotalResults = @intTotalResults + @@ROWCOUNT
--
        IF @intTotalResults > @MaxResultsRequired GOTO DisplayResults
END
--
IF @Address IS NOT NULL
BEGIN
	INSERT INTO #TEMP
	SELECT YourPKey
	FROM YourTable
	WHERE     [Address] = @Address
              AND YourPKey NOT IN (SELECT T_ID FROM #TEMP)
        SELECT @intTotalResults = @intTotalResults + @@ROWCOUNT
--
        IF @intTotalResults > @MaxResultsRequired GOTO DisplayResults
END
--
etc.
--
DisplayResults:
--
SELECT
	.....,
	CASE WHEN @FirstName = FirstName THEN 10 ELSE 0 +
	CASE WHEN @LastName = LastName THEN 10 ELSE 0 +
	CASE WHEN @DOB = DOB THEN 20 ELSE 0 +
	CASE WHEN @SSN = SSN THEN 100 ELSE 0 +
	CASE WHEN @Address = [Address] THEN 30 ELSE 0 END AS
		CombinedWeight
FROM YourTable
WHERE     YourPKey IN (SELECT T_ID FROM #TEMP)
ORDER BY
	CombinedWeight DESC

this would require that the various search columns are indexed.

In practice, when we do this type of thing, we also match on "Near misses" so if you search for "John" for the first name then anything which is "John" gets a Weight=10, but variations like "Jon" get 5 and anything matching the SOUNDEX for John gets, say, 2.

We store all this "searchable field data" in a table and we refresh that using a TRIGGER whenever a record is modified. i.e. we store the SSN, DOB, Address etc. into that table - one row per item, including the SOUNDEX.

We then search that "searchable field data" table instead of the actual table. Its all indexed, and includes a SOUNDEX value for the data too. We also have an Alias table for word substitutions, that includes a dictionary of all root words (English only for us ) and aliases that the Client can add - e.g. if they have a product name and users may search for alternative spellings or competitors equivalent words, or a part number where a competitor's part numbering system is widely used in the industry. We split things like address into individual words and store those in the "searchable field data" so that search for YORK matches York Road, York Street as well as the town/city of York ... or New York. If the user searches for multiple words, e.g. "New" and "York" then that will give a higher weighting for "New York" because it matches two words in the address, rather than just one, but lower down the results list the user will see "York Street" and "York Road" and then anything matching the SOUNDEX of York etc.

bluebsh · August 13, 2015, 1:58pm

ah, ok... well that's pretty much what I am already doing...

was just hoping they came up with a tool or new statement that helped do this easier in newer versions of sql server

Kristen · August 13, 2015, 2:18pm

I don't use it, but the Free Text tools might offer something? Inability to search on short keywords is a turn-off for me - dunno if there is still that limitation though.