I want to link two databases that do not share a key. To do that, the closest possible match is that the manufacturer & part number fields from one table are likely to appear in the title of the other table.
It works, but the problem is that table has about 65K rows and the other has 2M. Even fractions of this search take 10-20 minutes. Is there a more efficient way to do the same thing?
SELECT
BM.[ID]
,BM.[BHID]
,BM.[PartNoManufacturer]
FROM [MyDB].[dbo].[BH_Master] BM
INNER JOIN [OtherDB].[dbo].[product] P
ON p.ProductName LIKE '%' + BM.PartNoManufacturer + '%'
WHERE p.ProductName LIKE '%' + BM.Manufacturer + '%'
that means that SQL has to do nested loops, resulting in 65k * 2M rows to be read and compared. If your data supports it, remove the leading '%' from the LIKE clauses and index the PartNoManufacturer and Manufacturer columns (as well as the ProductName column)
About all you can do with the current structure is create a separate covering index so that SQL won't have to scan the entire table. If you can build indexes online, remove the comment markers around that part:
CREATE NONCLUSTERED INDEX BH_Master__IX_Search ON dbo.BH_Master ( ID, BHID ) INCLUDE ( Manufacturer, PartNoManufacturer ) WITH ( FILLFACTOR = 95, /* ONLINE = ON, */ SORT_IN_TEMPDB = ON ) ON [PRIMARY]; --replace [PRIMARY] with a different filegroup if you want to store it elsewhere
I tried without them & after ~35 minutes, I decided to cancel the query.
It took longer than the previous ones - which may have to do with me picking Top 20 instead of top 2. Either way, I'll have to take you at your word in its efficiency. Thanks though
Don't take my word for it! Indexing won't help if you keep the leading '%' in the LIKE clause. Check the execution plan. I'm betting a couple of nested loops
To add to what @gbritton said, think of indexes - clustered or non-clustered, it doesn't matter for purposes of this discussion - as entries in a telephone book, which are sorted alphabetically by last name, firstname..
If you want to look for the last name Bell, it is easy to find - go to the page that starts with B, and then find the subset of that starts with E and so on. If you want to find all the first names that start with "James", that is a little harder, but the index still helps. Under each letter, you can go to the section that starts with J.
Now, if you were to say that you want to find all the last names that have the string "bell" in there - for example, Campbell - then, you can no longer go to the page that starts with B. You have to scan the entire phone book to find such last names. That is to say, you cannot use the index.
If you were to say that your ProductName always STARTS WITH the Part Manufacturer Number, then things become easier. Instead of
ON p.ProductName LIKE '%' + BM.PartNoManufacturer + '%'
you would use
ON p.ProductName LIKE BM.PartNoManufacturer + '%'
And then, like magic, if there is an index on ProductName, SQL Server will be easily able to go right to the page where Bell is listed.
People do not understand that encoding schemes have to be carefully designed. If you can find a good history library classifications before the Dewey Decimal Classification (DDC), read it. One of the signs of a bad design (or no design at all) is using complex regular expressions. For example, DDC is a hierarchical encoding. so I can use
ddc LIKE '5__.%' for science
ddc LIKE ''51_.%' for math
ddc LIKE ''512.%' for algebra
These searches are fast since the wild card is on the right end of the string.
Tables model sets, so their names should be plural or collective nouns. Do you really have only one product? Columns model attributes, so just as you have "product_name", you need "manufacturer_" as a data element name. Joins on names instead of a structure attribute is weird.
Sergio,
You will need to create a dataset for BM.Manufacturer with proper index then do the join
(CTE will not work because it will have to iterate the entire table)