Mapping problem

barnabeck · June 11, 2017, 8:19pm

I have 2 tables:
Table "TAG" lists all the TAG numbers used to identify items of an order. In the displayed code underneath TAG is Table "Y"

TAG
-----
IT 00123
IT 00124
...

Table "Files" lists all certificates names that belong to these items. The file names are typed in manually by the worker and therefore often carry typos or are uploaded multiple times. The filenames are composed by serial number + TAG numbers. In the displayed code underneath FILES is Table "Z"

Files
------
X1234/IT00123
X54321/IT 00124
...

Goal: Build a query that:

finds all TAGs that are not mapped by any of the files
finds all files with multiple mapping to the same TAG number
find those files that are not mapped to any TAG number

I do this:

;with Score as (
Select WS_TAGNUMBER, z.NAME from (
Select WS_TAGNUMBER
from WS_TagNoSalesLine
Where WS_SALESLINEREFSALESID = @Salesid)y
full outer join
(
Select NAME from (
Select b.name  FROM SALESLINE a
                  inner join DOCUREF b on a.RECID = b.REFRECID AND b.ACTUALCOMPANYID = 'ES75' and b.REFTABLEID = '359'
	          inner join DOCUVALUE c on b.VALUERECID =  c.RECID
		  inner join DOCUPARAMETERS d on d.DATAAREAID = 'ES75'
WHERE a.SALESID = @SALESID and a.DATAAREAID = 'ES75' and c.FILETYPE = 'DOC')x)z on REPLACE(z.NAME,' ','')+'XYZ' like '%' + REPLACE(y.WS_TAGNUMBER,@TAGsInvariable,'') +'XYZ'+'%')

select *, ROW_NUMBER() OVER (ORDER BY Cnt desc, WS_TAGNUMBER) AS ord from (
select WS_TAGNUMBER, COUNT(WS_TAGNUMBER) as Cnt, stuff((select ', ' + NAME from Score a
   WHERE a.WS_TAGNUMBER = Score.WS_TAGNUMBER
   ORDER BY NAME
   for xml path('')),1,1,'') AS FileNames
from Score
GROUP BY WS_TAGNUMBER)X

Here comes the question:
This does Query the job in 12 seconds for 1250 TAG numbers; the problem is that any orphaned FileName is not displayed, as it does not correspond to a TAG number and therefore looses its content if I group by TAGNUMBER.
Therefore I substitute WS_TAGNUMBER in the second line by COALESCE(WS_TAGNUMBER,''), but then all of the sudden the query takes more then 2 hours to execute.

Anyone has a bright idea, or a different approach?
Martin

Kristen · June 12, 2017, 8:01am

You need the WHERE clause to be SARGable - when you put a Function Call in there, such as COALESCE(WS_TAGNUMBER,'') then it is not longer SARGable - e.g. SQL will not use an index, if available.

You could do OR - but I expect that will be the same

WHERE a.WS_TAGNUMBER = Score.WS_TAGNUMBER
    OR (a.WS_TAGNUMBER IS NULL AND Score.WS_TAGNUMBER = '')

but I think the more likely performant approach would be to use a UNION ALL. Repeat the query in the UNION ALL statement but with a WHERE clause to (just) handle the situation where a.WS_TAGNUMBER IS NULL

This is also likely to perform very badly - also not SARGABLE

on REPLACE(z.NAME,' ','')+'XYZ' like '%' + REPLACE(y.WS_TAGNUMBER,@TAGsInvariable,'') +'XYZ'+'%')

Also make sure that all your JOIN and WHERE clause conditions are "covered" by indexes

I don't know if it is significant, but we use a more complete construction for XML concatenation of a delimited list:

			FOR XML PATH('')
				, TYPE
		).value('.', 'varchar(max)')	-- NOTE: "'value" is case sensitive!!
	        , 1, 1, '')

you might find that your, slimmer version, mangles values - might be serendipity that your score.NAME does not, yet!, exhibit that problem.

ScottPletcher · June 12, 2017, 3:50pm

Typically SQL handles this code:

WHERE a.WS_TAGNUMBER = Score.WS_TAGNUMBER
OR (a.WS_TAGNUMBER IS NULL AND Score.WS_TAGNUMBER = '')

perfectly. If a UNION is best, the SQL plan will do so automatically.

barnabeck · June 13, 2017, 7:35am

@Kristen
Thank you for the explanations on SARGability. I followed your suggestion and did it with UNION all. As you expected

WHERE a.WS_TAGNUMBER = Score.WS_TAGNUMBER
OR (a.WS_TAGNUMBER IS NULL AND Score.WS_TAGNUMBER = '')

gave the same bad performance. Here is my solution that takes 44 seconds (for 1250 TAGS) which is not too good but I can live with it.

Score as (
Select WS_TAGNUMBER as TAG, Name 
from TAGS
left outer join FILES on REPLACE(NAME,' ','')+'XYZ' like '%' + REPLACE(WS_TAGNUMBER,@TAGsInvariable,'') +'XYZ'+'%'
UNION ALL 
SELECT '', Name
from FILES
left outer join TAGS on REPLACE(NAME,' ','')+'XYZ' like '%' + REPLACE(WS_TAGNUMBER,@TAGsInvariable,'') +'XYZ'+'%'
WHERE TAGS.WS_TAGNUMBER is NULL)

select TAG, COUNT(COALESCE(TAG,'')) as Cnt,
   stuff((select ', ' + NAME from Score a
          WHERE a.TAG = Score.TAG
          ORDER BY NAME
          FOR XML PATH(''), TYPE).value('.', 'varchar(max)') , 1, 1, '') AS Files
from Score
GROUP BY TAG

Thank you for the comment on the Stuff concatenation, I don't know where I picked that code up, but I can't exclude that I had been using this - slimmer/uncomplete/erroneous?- construction on several occasions, without being aware that this might falsify the result

Kristen · June 13, 2017, 11:26am

Almost certainly the REPLACE and the leading '%' on the LIKE are the problems there.

I'm guess how this is structured in your data, but if you could maintain the TAGs as a SubTable of FILES, rather than a space/etc. delimited list, then the search could be "instant"

Its a trade off between ease of entry etc. and speed of searching / reporting.

One option would be to leave it as it is, but to put a TRIGGER on the FILES table and have that "split" the NAME into individual elements and UpSert them in a FilesNameElements sub-table - and then report on that.

barnabeck · June 13, 2017, 11:50am

Yes, you are right and I had the same sensation. So I changed the LIKE operator to a RIGHT operator and join the 2 tables based on the comparison of the last X-characters. This is sufficiently fast, although I won't catch all the possible typing errors (extra letter on the file name's end). People are generally speaking too creative in finding ways to fool any almost troubled proved algorithm; so I leave it here.

Kristen · June 13, 2017, 1:55pm

Foolproof eh? "However big the fool, there is always a bigger fool to admire him"

Kristen · June 13, 2017, 1:57pm

I wonder if you could INDEX the REVERSE of the field (e.g. as a Computed Column, and then index that) ... and then compare the Start of the column - in particular if you can say

WHERE ColA LIKE ColB + '%'

that s should perform very well. It should be even better if instead of "Column-B" you have "@Parameter-B"