Is there a good way to search against unstructured data?

SergioM · February 17, 2022, 8:48pm

I've imported my emails directly to a SQL server database. So I have one table with just blobs of unstructured data. And I also have a product table in my database.

Example:

Is there an efficient (or semi-reasonable) way to query to see if each SKU (product id) exists in that unstructured blob? I have 50K products, and 1K emails like this.

If it is not reasonable to do in TSQL, I would use Python/Regex, but this would be easier.

Lets assume that the data looks like this.


CREATE TABLE #email_content (email_blob nvarchar(max));
INSERT INTO #email_content (email_blob) VALUES ('Some content to extract a unique thing UniqueID123.  More data, not structured');

CREATE TABLE #product_table (sku nvarchar(max));
INSERT INTO #product_table (sku) 
VALUES ('UniqueID123')
	, ('UniqueID456')
	, ('UniqueID789')
	, ('UniqueIDABC')
	, ('UniqueIDDEF')
	, ('UniqueIDGHI')
	;

EDIT - this is my first crack at it. Though I don't know if it can be made more efficient

SELECT
	[eco].[email_blob]
	, [pta].[sku]
FROM #email_content eco
CROSS APPLY #product_table pta
WHERE [eco].[email_blob] LIKE '%' + [pta].[sku] +  '%'

ScottPletcher · February 17, 2022, 9:19pm

For only 1K emails, that general approach should be workable.

Can more than one (of the 50K) skus be mentioned in the same email?

SergioM · February 17, 2022, 9:21pm

Great. I do think 1K is the upper limit. Once data is extracted I plan on flagging those rows so that they are not re-processed.

Yes, in reality I simplified the use-case for the post. But a SKU can appear more than once & there are actually several unique identifiers that I will be comparing against. I just figured it was the same thing, but again.

ScottPletcher · February 17, 2022, 9:31pm

If there can be more than one sku, you'll need some type of loop to find them all. A well-written function can actually perform well for that.