Inner join with LIKE statement. Is there a more efficient way?

SergioM · November 23, 2015, 6:56pm

I want to link two databases that do not share a key. To do that, the closest possible match is that the manufacturer & part number fields from one table are likely to appear in the title of the other table.

It works, but the problem is that table has about 65K rows and the other has 2M. Even fractions of this search take 10-20 minutes. Is there a more efficient way to do the same thing?

SELECT 
	BM.[ID]
	,BM.[BHID]
	,BM.[PartNoManufacturer]
FROM [MyDB].[dbo].[BH_Master] BM
INNER JOIN [OtherDB].[dbo].[product] P
	ON p.ProductName LIKE '%' + BM.PartNoManufacturer + '%'
WHERE p.ProductName LIKE '%' + BM.Manufacturer + '%'

gbritton · November 23, 2015, 7:04pm

The problem is this part:

LIKE '%' + BM.PartNoManufacturer + '%'

that means that SQL has to do nested loops, resulting in 65k * 2M rows to be read and compared. If your data supports it, remove the leading '%' from the LIKE clauses and index the PartNoManufacturer and Manufacturer columns (as well as the ProductName column)

ScottPletcher · November 23, 2015, 7:32pm

About all you can do with the current structure is create a separate covering index so that SQL won't have to scan the entire table. If you can build indexes online, remove the comment markers around that part:

CREATE NONCLUSTERED INDEX BH_Master__IX_Search ON dbo.BH_Master ( ID, BHID ) INCLUDE ( Manufacturer, PartNoManufacturer ) WITH ( FILLFACTOR = 95, /* ONLINE = ON, */ SORT_IN_TEMPDB = ON ) ON [PRIMARY]; --replace [PRIMARY] with a different filegroup if you want to store it elsewhere

SergioM · November 23, 2015, 7:53pm

I tried without them & after ~35 minutes, I decided to cancel the query.

It took longer than the previous ones - which may have to do with me picking Top 20 instead of top 2. Either way, I'll have to take you at your word in its efficiency. Thanks though

SergioM · November 23, 2015, 7:54pm

TBH, I have no experience with clustered indexes, but given the scenario, it's time to learn. Thanks for giving me a starting point.

gbritton · November 23, 2015, 7:54pm

Don't take my word for it! Indexing won't help if you keep the leading '%' in the LIKE clause. Check the execution plan. I'm betting a couple of nested loops

djj55 · November 23, 2015, 8:15pm

I do not suppose the ProductName is anything like Manufacturer_Part? i.e. then manufacturer name is always first and the part is always second?

JamesK · November 23, 2015, 8:20pm

To add to what @gbritton said, think of indexes - clustered or non-clustered, it doesn't matter for purposes of this discussion - as entries in a telephone book, which are sorted alphabetically by last name, firstname..

If you want to look for the last name Bell, it is easy to find - go to the page that starts with B, and then find the subset of that starts with E and so on. If you want to find all the first names that start with "James", that is a little harder, but the index still helps. Under each letter, you can go to the section that starts with J.

Now, if you were to say that you want to find all the last names that have the string "bell" in there - for example, Campbell - then, you can no longer go to the page that starts with B. You have to scan the entire phone book to find such last names. That is to say, you cannot use the index.

If you were to say that your ProductName always STARTS WITH the Part Manufacturer Number, then things become easier. Instead of

ON p.ProductName LIKE '%' + BM.PartNoManufacturer + '%'

you would use

ON p.ProductName LIKE BM.PartNoManufacturer + '%'

And then, like magic, if there is an index on ProductName, SQL Server will be easily able to go right to the page where Bell is listed.

ScottPletcher · November 23, 2015, 9:10pm

A covering index can still reduce the number of pages that must be scanned to satisfy the query, sometimes very significantly.

gbritton · November 23, 2015, 9:11pm

fair point!

jcelko · November 27, 2015, 8:00pm

People do not understand that encoding schemes have to be carefully designed. If you can find a good history library classifications before the Dewey Decimal Classification (DDC), read it. One of the signs of a bad design (or no design at all) is using complex regular expressions. For example, DDC is a hierarchical encoding. so I can use

ddc LIKE '5__.%' for science
ddc LIKE ''51_.%' for math
ddc LIKE ''512.%' for algebra

These searches are fast since the wild card is on the right end of the string.

Tables model sets, so their names should be plural or collective nouns. Do you really have only one product? Columns model attributes, so just as you have "product_name", you need "manufacturer_" as a data element name. Joins on names instead of a structure attribute is weird.

Can you post the specs with sample data?

odeddror · November 28, 2015, 3:03pm

Sergio,
You will need to create a dataset for BM.Manufacturer with proper index then do the join
(CTE will not work because it will have to iterate the entire table)

Thanks,
Ed Dror