CSV to XML to list processing

jp.mcc · August 5, 2016, 9:51am

Using this bit of code:

DECLARE @t TABLE( ID INT IDENTITY, data VARCHAR(50), data2 VARCHAR(50))
INSERT INTO @t(data,data2) SELECT 'AA,AB,AC,AD', 'BA,BB,BC,BD'

SELECT F1.id,
 F1.data,
 f1.data2,
 O.splitdata
FROM
(SELECT *,cast('<X>'+replace(F.data,',','</X><X>')+'</X>' as XML) as xmlfilter from @t F)F1
CROSS APPLY
(SELECT fdata.D.value('.','varchar(50)') as splitdata FROM f1.xmlfilter.nodes('X') as fdata(D)) O

We get these results:

id	data		data2		splitdata
1	AA,AB,AC,AD	BA,BB,BC,BD	AA
1	AA,AB,AC,AD	BA,BB,BC,BD	AB
1	AA,AB,AC,AD	BA,BB,BC,BD	AC
1	AA,AB,AC,AD	BA,BB,BC,BD	AD

But what I really want is to match the 2 columns "data" and "data2" together, to get this:

id	data		data2		splitdata	split2
1	AA,AB,AC,AD	BA,BB,BC,BD	AA		BA
1	AA,AB,AC,AD	BA,BB,BC,BD	AB		BB
1	AA,AB,AC,AD	BA,BB,BC,BD	AC		BC
1	AA,AB,AC,AD	BA,BB,BC,BD	AD		BD

i.e. a match of the 2 csv columns across each element number.
I've been trying all sorts of XML combinations and using ROW_NUMBER() to get an original order number of the csv elements, but can't get it working.

Any help gratefully received.

viggneshwar · August 5, 2016, 11:22am

The same query should be processed for data2 and joined with the first result.

JeffModen · August 6, 2016, 11:24pm

XML splitters are terrible for performance because you first have to do an expansive (causes string growth) replacement of the delimiters as well as concatenating end tags onto the string, which is also slow because it's also expansive). You'll also run into problems with special characters.

See the following article to verify my claims, please. This is also where you'll find the DelimitedSplit8K function that you'll need for the code below. If your DBA objects to using a function, tell him/her not to worry because DelimitedSplit8K is a very high performance iTVF (inline Table Valued Function), which is unlike either a slow UDF or even slower mTVF (multi-statement Table Valued Function).
http://www.sqlservercentral.com/articles/Tally+Table/72993/

Here's one possible solution. It only makes one PASS at the table and will easily handle all the special characters that XML will not without some form of de-entitization, normally in the form of using the TYPE hint, which cause it to run almost twice as slow as it currently does.

--===== Create more than 1 row of test data.
     -- NOTE THAT THIS IS NOT A PART OF THE SOLUTION!
     -- We're just creating test data here.
DECLARE @t TABLE(ID INT IDENTITY, data VARCHAR(50), data2 VARCHAR(50))
;
 INSERT INTO @t(data,data2)
 VALUES  ('AA,AB,AC,AD', 'BA,BB,BC,BD')
        ,('AA,AB,AC,AD', 'CA,CB,CC,CD')
        ,('One,Two,Three,Four', 'AYE,BEE,SEA,DEE')
;
--===== Solve the problem using the DelimitedSplit8K function
   WITH cteSplit AS
(--==== Stack the split data vertically
 SELECT t.ID, s.ItemNumber, s.Item, s.Col
   FROM @t t
  CROSS APPLY 
        (--==== Split each column and add a column identifier
         SELECT *,Col=1 FROM dbo.DelimitedSplit8K(data ,',')
          UNION ALL
         SELECT *,Col=2 FROM dbo.DelimitedSplit8K(data2,',')
        ) s (ItemNumber,Item,Col)
)--==== Pivot the data back to being horizontal for each ID
     -- using a high performance CROSSTAB
 SELECT  ID
        ,ItemNumber
        ,data  = MAX(CASE WHEN Col=1 THEN Item ELSE '' END)
        ,data2 = MAX(CASE WHEN Col=2 THEN Item ELSE '' END)
   FROM cteSplit
  GROUP BY ItemNumber,ID
;

jp.mcc · August 12, 2016, 11:43am

Thanks for that, it helped a lot. But as usual, things move on a bit and I'm now trying to use more combinations and I'm just not getting my expected results. Can you help a little more please?

Given this:
DECLARE @t TABLE(ID INT IDENTITY, [colour] VARCHAR(50), [cm] VARCHAR(50), [size] VARCHAR(50), [sm] VARCHAR(50));

INSERT INTO @t([colour],[cm],[size],[sm])
VALUES  ('Red,Yellow,Green,Blue', '-R,-Y,-G,-B', 'Small,Medium,Large', '-S,-M,-L')
       ,('Yellow,Green,Blue', '-Y,-G,-B', '','')
       ,('','','Medium,Large','-M,-L')
       ,('','','','');

I would really like to see these results:

id   col_x            col_y
1    Red Small        -R-S
1    Red Medium       -R-M
1    Red Large        -R-L
1    Yellow Small     -Y-S
1    Yellow Medium    -Y-M
1    Yellow Large     -Y-L
1    Green Small      -G-S
1    Green Medium     -G-M
1    Green Large      -G-L
1    Blue Small       -B-S
1    Blue Medium      -B-M
1    Blue Large       -B-L
2    Yellow           -Y
2    Green            -G
2    Blue             -B
3    Medium           -M
3    Large            -L
4

So, for however many elements per row are in [colour] concatenated with however many elements in [size], and similarly for [cm] and [sm] (ie 4 x 3 = 12 for the 1st row in this example). And in some cases there are no elements in [size] and [sm], or [colour] and [cm]. But there are always the same number of elements in [colour] as in [cm] and similarly in [size] and [sm].

I hope that's clear.

Many thanks