SQLTeam.com | Weblogs | Forums

Need t-sql code to Parse string


#1

We are running SQL 2008 R2, and need efficient code to parse the [Journal Message] field. Blow is the original field and desired result. I used underscore to show were the separations should go.

 Journal Message
Admitted Rozgonyi, Alexandra (Card #106958) at  CO01 0116 Lobby to North Wing Entry [In]     
Admitted Wickman, Jared W (Card #63586) at  MA01 14/01 RECEPTION [In]     
Admitted CARNIADO, SANTIAGO (Card #63110) at  MX05 P08 Ratings Entry (W) [Out]     

    Action_Name_Card_Code_Building_Direction
Admitted_Rozgonyi, Alexandra_106958_CO01_0116 Lobby to North Wing Entry_In
Admitted_Wickman, Jared W_63586_MA01_14/01 RECEPTION_In     
Admitted_CARNIADO, SANTIAGO_63110_MX05 P08 Ratings Entry (W)_Out

#2

So the SPEC is this?

[Action] = "Admitted" break on first space, [Action] NEVER contains a space?
[Name] = "Rozgonyi, Alexandra " break on " (Card" - [Name] is ALWAYS followed by "(Card" ?
[Card] = "106958" break on ")", [Card] is always numeric, proceeded by "#" and followed by ")"?

[Code] = "CO01" break on "space", [Code] never contains embedded space?
[Direction] = "In" - always the final value between "[...]", and is ALWAYS present on all rows?
[Building] = whatever is left, between [Code] and [Direction]


#3

Hi Kristen,
Thanks for responding. This data records the use of badge access cards when passing through secure doors. Currently the this tales contains 12 millions records. Action has two possibilities Admitted or Rejected and with no spaces. Yes Name is always followed by Card#, and Yes Card is always numeric and proceeded by # and followed by ). And true code never contacts embedded spaces. And last Direction is always In or Out.


#4

Here is my attempt - it requires a string splitter you can get here: http://www.sqlservercentral.com/articles/Tally+Table/72993/

Declare @testTable Table(Journal varchar(255));

 Insert Into @testTable (Journal)
 Values ('Admitted Rozgonyi, Alexandra (Card #106958) at  CO01 0116 Lobby to North Wing Entry [In]')
      , ('Admitted Wickman, Jared W (Card #63586) at  MA01 14/01 RECEPTION [In]')
      , ('Admitted CARNIADO, SANTIAGO (Card #63110) at  MX05 P08 Ratings Entry (W) [Out]');

 Select t.Journal
      , a1.Item As Action
      , replace(b2.Item, ',', ', ') As Name
      , c2.Item As Code
      , d1.Item As Building
      , e1.Item As Direction
   From @testTable t
  Outer Apply (Select a.Item From dbo.DelimitedSplit8K(t.Journal, ' ') a Where a.ItemNumber = 1) As a1

  Outer Apply (Select b.Item From dbo.DelimitedSplit8K(t.Journal, '(') b Where b.ItemNumber = 1) As b1
  Outer Apply (Select c.Item From dbo.DelimitedSplit8K(replace(b1.Item, ', ', ','), ' ') c Where c.ItemNumber = 2) As b2

  Outer Apply (Select d.Item From dbo.DelimitedSplit8K(t.Journal, '#') d Where d.ItemNumber = 2) As c1
  Outer Apply (Select e.Item From dbo.DelimitedSplit8K(c1.Item, ')') e Where e.ItemNumber = 1) As c2

  Outer Apply (Select f.Item From dbo.DelimitedSplit8K(substring(t.Journal, charindex(' at ', t.Journal, 1) + 4, 255), '[') f Where f.ItemNumber = 1) As d1

  Outer Apply (Select h.Item From dbo.DelimitedSplit8K(replace(t.Journal, ']', ''), '[') h Where h.ItemNumber = 2) As e1

As long as the pattern of data remains the same this will work - but if the data can contain the characters we are splitting on it can become a bit more difficult. In those cases you may have to use charindex or patindex to get a portion of the string that can then be parsed (see the building in above query).


#5

WOW thanks. My eyes are popping out on getting this to work. It's so fat too.:slight_smile: Wow, if you have a bitcon address, I'll send you 5 dollars. This has been a long standing problem. THAHKS


#6

I have an alternative method - without using the delimited split function above. This method uses a similar function I created called fnGetStringElement. The difference is this function is a scalar function instead of an inline-table valued function.

I have not fully tested performance - but in several cases I have found that this function out performs the inline-table valued function.

Here is the function:

  CREATE Function [dbo].[fnGetStringElement] (
        @pString varchar(8000)
      , @pDelimiter char(1)
      , @pElement int)
Returns varchar(8000)
   With Schemabinding
     As
  Begin

Declare @returnValue varchar(8000) = '';

   With e1(n) 
     As ( --=== Create Ten 1's
 Select 1 Union All Select 1 Union All
 Select 1 Union All Select 1 Union All
 Select 1 Union All Select 1 Union All
 Select 1 Union All Select 1 Union All
 Select 1 Union All Select 1 --10
        )
      , e2(n) As (Select 1 From e1 a, e1 b) -- 100
      , e3(n) As (Select 1 From e2 a, e2 b) -- 10,000
      , cteTally (n)
     As (
 Select Top (datalength(isnull(@pString, 0)))
        row_number() over(Order By (Select Null))
   From e3
        )
      , cteStart(n1)
     As (
 Select 1
  Union All
 Select t.n + 1
   From cteTally t
  Where substring(@pString, t.n, 1) = @pDelimiter
        )
      , cteEnd (n1, l1)
     As (
 Select s.n1
      , coalesce(nullif(charindex(@pDelimiter, @pString, s.n1), 0) - s.n1, 8000)
   From cteStart s
        )
      , cteSplit         --==== Do the split
     As (
 Select row_number() over(Order By e.n1) As ItemNumber
      , substring(@pString, e.n1, e.l1) As Item
   From cteEnd e
        )
 Select @returnValue = ltrim(rtrim(Item))
   From cteSplit
  Where ItemNumber = @pElement;

 Return @returnValue;
    End

GO

And here is how it would be used:

Declare @testTable Table(Journal varchar(255));

 Insert Into @testTable (Journal)
 Values ('Admitted Rozgonyi, Alexandra (Card #106958) at  CO01 0116 Lobby to North Wing Entry [In]')
      , ('Admitted Wickman, Jared W (Card #63586) at  MA01 14/01 RECEPTION [In]')
      , ('Admitted CARNIADO, SANTIAGO (Card #63110) at  MX05 P08 Ratings Entry (W) [Out]');

 Select t.Journal
      , a1.Item As Action
      , replace(b2.Item, ',', ', ') As Name
      , c2.Item As Code
      , d1.Item As Building
      , e1.Item As Direction
   From @testTable t
  Outer Apply (Select dbo.fnGetStringElement(t.Journal, ' ', 1) As Item) As a1
  Outer Apply (Select dbo.fnGetStringElement(replace(dbo.fnGetStringElement(t.Journal, '(', 1), ', ', ','), ' ', 2) As Item) As b2
  Outer Apply (Select dbo.fnGetStringElement(dbo.fnGetStringElement(t.Journal, '#', 2), ')', 1) As Item) As c2
  Outer Apply (Select dbo.fnGetStringElement(substring(t.Journal, charindex(' at ', t.Journal, 1) + 4, 255), '[', 1) As Item) As d1
  Outer Apply (Select dbo.fnGetStringElement(replace(t.Journal, ']', ''), '[', 2) As Item) As e1;

If you compare the execution plans - this version has a single scan of the table - with constant scans for each outer apply statement with almost all of the cost being associated with the table scan and nested loop joins to combine the results.

Change from Outer Apply to Cross Apply and you eliminate everything except table scan.

Of course, further testing across larger sets of data is really needed to make sure this works as well as expected.

Note: changing the function above to use a physical Tally table could improve performance even more.


#7

Is this a sensible method? Performant?

DECLARE @testTable TABLE
(
	Journal varchar(255)
	, [Action] varchar(100)
	, [Name] varchar(300)
	, [Card] varchar(60)
	, [Code] varchar(40)
	, [Building] varchar(300)
	, [Direction] varchar(30)
);

INSERT INTO @testTable
(
	Journal
)
VALUES  ('Admitted Rozgonyi, Alexandra (Card #106958) at  CO01 0116 Lobby to North Wing Entry [In]')
      , ('Admitted Wickman, Jared W (Card #63586) at  MA01 14/01 RECEPTION [In]')
      , ('Admitted CARNIADO, SANTIAGO (Card #63110) at  MX05 P08 Ratings Entry (W) [Out]');
DECLARE	@intOffset1	int
	, @intOffset2	int
	, @intOffset3	int
	, @intOffset4	int
	, @intOffset5	int
	, @intOffset6	int
	, @strRemainder1	varchar(8000)
	, @strRemainder2	varchar(8000)
	, @strRemainder3	varchar(8000)
	, @strRemainder4	varchar(8000)
UPDATE	U
SET	@intOffset1 = CHARINDEX(' ', Journal, 0)
	, [Action] = LEFT(Journal, @intOffset1-1)
	, @strRemainder1 = STUFF(Journal, 1, @intOffset1, '')
	, @intOffset2 = CHARINDEX('(Card #', @strRemainder1, 0)
	, [Name] = LEFT(@strRemainder1, @intOffset2-2)
	, @strRemainder2 = STUFF(@strRemainder1, 1, @intOffset2+6, '')
	, @intOffset3 = CHARINDEX(') at ', @strRemainder2, 0)
	, [Card] = LEFT(@strRemainder2, @intOffset3-1)
	, @strRemainder3 = STUFF(@strRemainder2, 1, @intOffset3+1, '')
	, @intOffset4 = PATINDEX('% [A-Z][A-Z][0-9][0-9] %', @strRemainder3)
	, [Code] = SUBSTRING(@strRemainder3, @intOffset4+1, 4)
	, @strRemainder4 = STUFF(@strRemainder3, 1, @intOffset4+5, '')
	, @intOffset5 = CHARINDEX('[', @strRemainder4, 0)
	, [Direction] = REPLACE(STUFF(@strRemainder4, 1, @intOffset5, ''), ']', '')
	, [Building] = LEFT(@strRemainder4, @intOffset5-2)
FROM	@testTable AS U

SELECT	*
FROM	@testTable

#8

Kristen,
Your version has worked very well. I am very grateful for the time and help. This version does not require additional functions and the results are parsed perfectly even for cases not shared in my sample. I will study your code. :slight_smile:


#9

It might run like a 3-legged-dog though ... hopefully someone here will have an opinion on that [compared to the FUNCTION() suggestions]


#10

Kristen - what does the execution plan look like for your version? In the versions using either function - it is only using CHARINDEX to identify where to split the data - and we use REPLACE and SUBSTRING for a couple of items..

I would be concerned with all of the additional processing required for each function call in your version - and the passing of that data through the variables to the next function call.

The solution above that uses the function fnGetStringElement - changing it to use a CROSS APPLY instead of OUTER APPLY reduces the execution plan to 2 operations. A table scan to get the data and a Compute Scalar to get the values. That compute scalar operation is executed 1 time over the full set of data.

An interesting note - changing the scalar function to use a physical Tally table results in an execution plan that has one table scan and 3 compute scalar operations. The execution IO and TIME are more than 3 times faster using the in-memory tally table on my system.

All of this is academic until tested against a million rows to see the actual impact.


#11

Using

SET STATISTICS IO ON; SET STATISTICS TIME ON

I get:

Table '#A2C82376'. Scan count 1, logical reads 4, physical reads 0 ....

not sure that tells us anything - it made one pass of the data, but might have spent hours fiddling with the local working variables etc.

With:

SET SHOWPLAN_TEXT ON

I get:

|--Table Update(OBJECT:(@testTable AS [U])
	, SET:([@intOffset1] = [Expr1003]
		,[@strRemainder1] = [Expr1005]
		,[@intOffset2] = [Expr1006]
		,[@strRemainder2] = [Expr1008]
		,[@intOffset3] = [Expr1009]
		,[@strRemainder3] = [Expr1011]
		,[@intOffset4] = [Expr1012]
		,[@strRemainder4] = [Expr1014]
		,[@intOffset5] = [Expr1015]
		,@testTable.[Action] as [U].[Action] = [Expr1004]
		,@testTable.[Name] as [U].[Name] = [Expr1007]
		,@testTable.[Card] as [U].[Card] = [Expr1010]
		,@testTable.[Code] as [U].[Code] = [Expr1013]
		,@testTable.[Direction] as [U].[Direction] = [Expr1016]
		,@testTable.[Building] as [U].[Building] = [Expr1017]))
|--Compute Scalar(DEFINE:([Expr1003]=[@intOffset1] = charindex(' ',@testTable.[Journal] as [U].[Journal],(0))
		, [Expr1004]=CONVERT_IMPLICIT(varchar(100),substring(@testTable.[Journal] as [U].[Journal],(1),[@intOffset1]-(1)),0)
		, [Expr1005]=[@strRemainder1] = stuff(@testTable.[Journal] as [U].[Journal],(1),[@intOffset1],'')
		, [Expr1006]=[@intOffset2] = charindex('(Card #',[@strRemainder1],(0))
		, [Expr1007]=CONVERT_IMPLICIT(varchar(300),substring([@strRemainder1],(1),[@intOffset2]-(2)),0)
		, [Expr1008]=[@strRemainder2] = stuff([@strRemainder1],(1),[@intOffset2]+(6),'')
		, [Expr1009]=[@intOffset3] = charindex(') at ',[@strRemainder2],(0))
		, [Expr1010]=CONVERT_IMPLICIT(varchar(60),substring([@strRemainder2],(1),[@intOffset3]-(1)),0)
		, [Expr1011]=[@strRemainder3] = stuff([@strRemainder2],(1),[@intOffset3]+(1),'')
		, [Expr1012]=[@intOffset4] = patindex('% [A-Z][A-Z][0-9][0-9] %',[@strRemainder3])
		, [Expr1013]=CONVERT_IMPLICIT(varchar(40),substring([@strRemainder3],[@intOffset4]+(1),(4)),0)
		, [Expr1014]=[@strRemainder4] = stuff([@strRemainder3],(1),[@intOffset4]+(5),'')
		, [Expr1015]=[@intOffset5] = charindex('[',[@strRemainder4],(0))
		, [Expr1016]=CONVERT_IMPLICIT(varchar(30),replace(stuff([@strRemainder4],(1),[@intOffset5],''),']',''),0)
		, [Expr1017]=CONVERT_IMPLICIT(varchar(300),substring([@strRemainder4],(1),[@intOffset5]-(2)),0)))
    |--Table Scan(OBJECT:(@testTable AS [U]))

and I'm not sure that tells us anything either?


#12

I would bet the graphical execution plan shows a table scan - and multiple scalar function calls. The CONVERT_IMPLICIT are also concerning because SQL Server is now converting the data multiple times between scalar functions.

For the 3 row sample - what was the execution times?


#13

6 ms to INSERT the data into the @TempTable, and 0 ms to do the UPDATE. Not at all helpful ...


#14

Added to the INSERT so after inserting the 3 rows I then inserted them joined to itself and a tally table to create 24,003 rows.

That INSERT took 80 ms and the UPDATE took 600 ms.

UPDATE:
Table '#B8B76495'. Scan count 1, logical reads 46994, physical reads 0, ...

 SQL Server Execution Times:
   CPU time = 593 ms,  elapsed time = 601 ms.

SHOWPLAN looks to be exactly the same


#15

In attempting to find a BASELINE I substituted this:

UPDATE	U
SET	[Action] = 'Admitted'
	, [Name] = 'Rozgonyi, Alexandra'
	, [Card] = '#106958'
	, [Code] = 'CO01'
	, [Direction] = '0116 Lobby to North Wing Entry'
	, [Building] = 'In'
FROM	@testTable AS U

Not sure that updating to CONST strings is valid, as compared to updating to SUBSTRING bits-of the [Journal], but its a start!

UPDATE:
Table '#A14CAA96'. Scan count 1, logical reads 48894, physical reads 0, ...

 SQL Server Execution Times:
   CPU time = 407 ms,  elapsed time = 453 ms.

SHOWPLAN:
|--Table Update(OBJECT:(@testTable AS [U])
	, SET:(@testTable.[Action] as [U].[Action] = [Expr1003]
	,@testTable.[Name] as [U].[Name] = [Expr1004]
	,@testTable.[Card] as [U].[Card] = [Expr1005]
	,@testTable.[Code] as [U].[Code] = [Expr1006]
	,@testTable.[Direction] as [U].[Direction] = [Expr1007]
	,@testTable.[Building] as [U].[Building] = [Expr1008]))
|--Compute Scalar(
	DEFINE:([Expr1003]='Admitted'
	, [Expr1004]='Rozgonyi, Alexandra'
	, [Expr1005]='#106958'
	, [Expr1006]='CO01'
	, [Expr1007]='0116 Lobby to North Wing Entry'
	, [Expr1008]='In'))
|--Table Scan(OBJECT:(@testTable AS [U]))

#16

Okay - the method using the function fnGetStringElement and CROSS APPLY takes much less time.

(3 row(s) affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(3 row(s) affected)
Table '#B31A2D39'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(3 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Here is the plan:

  |--Table Update(OBJECT:(@testTable AS [t]), SET:(@testTable.[Action] as [t].[Action] = [Expr1008],@testTable.[Name] as [t].[Name] = [Expr1009],@testTable.[Code] as [t].[Code] = [Expr1010],@testTable.[Building] as [t].[Building] = [Expr1011],@testTable.[Direction] as [t].[Direction] = [Expr1012]))
       |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(varchar(100),[Reports].[dbo].[fnGetStringElement](@testTable.[Journal] as [t].[Journal],' ',(1)),0), [Expr1009]=CONVERT_IMPLICIT(varchar(300),replace([Reports].[dbo].[fnGetStringElement](replace([Reports].[dbo].[fnGetStringElement](@testTable.[Journal] as [t].[Journal],'(',(1)),', ',','),' ',(2)),',',', '),0), [Expr1010]=CONVERT_IMPLICIT(varchar(40),[Reports].[dbo].[fnGetStringElement]([Reports].[dbo].[fnGetStringElement](@testTable.[Journal] as [t].[Journal],'#',(2)),')',(1)),0), [Expr1011]=CONVERT_IMPLICIT(varchar(300),[Reports].[dbo].[fnGetStringElement](substring(@testTable.[Journal] as [t].[Journal],charindex(' at ',@testTable.[Journal] as [t].[Journal],(1))+(4),(255)),'[',(1)),0), [Expr1012]=CONVERT_IMPLICIT(varchar(30),[Reports].[dbo].[fnGetStringElement](replace(@testTable.[Journal] as [t].[Journal],']',''),'[',(2)),0)))
            |--Table Scan(OBJECT:(@testTable AS [t]))

Looking at the plan - we see 2 things going on...the table scan and the compute scalar...and this took only 4ms.


#17

have you got a comparative run time on your machine (for my code)?

(My timing was for a table with 24,003 row)


#18

No - I do not have a comparison run time...I also do not have a table with 24,000 rows at this time. It will take some time for me to put together something to test...

Any way you can take the function in this string and run it against your test table?


#19

Hehehe! So I can ... what WAS I thinking?!! ... I'll report back ...


#20

I got some time to test further...

I created a table with 300003 rows and tested all 3 methods. As I expected - the quirky update method from @Kristen performed the best at about 1.5 seconds - the splitter version (DelimitedSplit8K) took about 5.5 seconds and the scalar version took about 100 seconds.

Quirky Update:

 SQL Server Execution Times:
   CPU time = 1482 ms,  elapsed time = 1491 ms.

DelimitedSplit8K:

 SQL Server Execution Times:
   CPU time = 9079 ms,  elapsed time = 5637 ms.

If you can use the update method that would scale the best - but if you are just looking at getting output to send to a client the splitter version probably would be better, however - you want to be careful with the splitter version as it is quite a bit more expensive

Table 'testTable2'. Scan count 9, logical reads 907794, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 916077, logical reads 1832172, physical reads 0, read-ahead reads 3474, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Both the quirky update and the scalar versions have minimal IO - in fact they have the same:

Table 'testTable2'. Scan count 1, logical reads 7095, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The difference for the scalar version is all in CPU time and that is significantly higher than any of the other methods.

Just my observation...none of the methods took too long even at 100+ seconds for the scalar method. If I had to parse this kind of data from a large set of inputs every day then I would have to consider performance - with that said - my preference would be the version that is simpler to manage and maintain.