How to return closest or best match to a UK postcode?

QuattroDave · June 5, 2015, 10:44am

Morning all,

I've been playing around with this this problem this morning. So far I've tried using difference and soundex but can't seem to get either to function as I'd expect... Basically I have a big table with loads of UK postcodes and they're geographic location (longitude & latitude). Un-fortunately it doesn't contain all uk postcodes and seeing as it doesn't need to be hugely accurate I was just going to return the closest match...

Example search for 'AL1 1DQ' (which isn't in the db) then it might return the longitude & latitude for 'AL1 1DP' or 'AL1 1DR' or 'AL1 1CY' basically what ever it deems the closest match...

Using this at the moment:

SELECT TOP 1 *
FROM Geolocation2
WHERE SOUNDEX('AL1 1DQ') = SOUNDEX(Postcode)

Annoyingly it returns 'AL9 6ZJ' although I'm not sure why......

Thanks

Dave

viggneshwar · June 5, 2015, 10:56am

SOUNDEX mainly considers the first letter.

SOUNDEX will return A400 for the below
select SOUNDEX('AL1 1DQ')
select SOUNDEX('AL9 6ZJ')
select SOUNDEX('AL')

Please refer the below article
https://msdn.microsoft.com/en-us/library/ms187384.aspx

QuattroDave · June 5, 2015, 10:59am

Ah, I didn't realise that.... Any pointers or suggestions I should look at next..?

Thanks

Dave

gbritton · June 5, 2015, 11:54am

SSIS Fuzzy Lookup and Grouping.

bitsmed · June 5, 2015, 12:42pm

I have had success with this alternative: Levenshtein disctance

QuattroDave · June 5, 2015, 1:15pm

gbritton & bitsmed:
Thank you both for your in put, both look to be very usable options, what's the performance like with these options...?

I'm wondering if we're getting a bit complex, 99% of the time the simple statement:

SELECT TOP 1 longitude, latitude
FROM Geolocation2
WHERE postcode = '??????'

works just fine only every now and again does it not find a post code... What would you think about the approach of if it can't find a postcode remove the last character and look again and so on and so on...?

Thanks

Dave

gbritton · June 5, 2015, 1:27pm

Fuzzy lookup is very good. It can use Levenshtein distance among other algorithms.

fwiw performance depends on your match tolerance. If you say 50% is good enough, many more rows have to be checked than if you say I need 95%. You'll need to set it up with your data and experiment to find a balance between good matching and performance.

What would you think about the approach of if it can't find a postcode
remove the last character and look again and so on and so on...?

Frankly not much. The algos already mentioned do that and much, much more.

bitsmed · June 5, 2015, 1:34pm

I agree with @gbritton. Fuzzy takes longer than your commenly direct hit, and you should only do the fuzzy search in the cases where you don't get direct hit. And by your own statement, the fussy search would only be activated i 1 % of the cases.

SwePeso · June 7, 2015, 8:56pm

See http://weblogs.sqlteam.com/peterl/archive/2008/08/13/Validate-UK-postcode.aspx

QuattroDave · June 8, 2015, 12:42pm

Thanks for your reply, unfortunately I don't want to validate a postcode. I have a table with nearly all the 2 million UK postcodes in it with they're geographic location, but for the odd one or two that are missing I just want just want to retrieve the geographic location of the closest matched postcode...

I am seriously tempted to try the Levenshtein approch as it looks a bit easier to implement than Fuzzy....

Thanks

Dave

gbritton · June 8, 2015, 1:05pm

Really? Fuzzy Matching is just point and click with SSDT

ian_pettman · June 9, 2015, 12:10pm

Well (in the UK) if you have a post code for a radio station, hospital or any other organisation that receives a large amount of email then the post code can be quite different from the place next door BBC London Portland Place London W1A 1AA but if you find the post code for 3 Portland Place W1B 1HR... the best you can do is find the geocenter for the prefix part or pay the post office for a licence....