Inserting strings Before and After a Match-String

Kristen · January 3, 2018, 2:57pm

I want to insert strings before and after a matched string so that, for example, aaa yyy bbb becomes aaa [yyy] bbb

I used to do this with

REPLACE(MyColumn, 'YYY', '[YYY]')

but a disadvantage of this is that if the match is case INsensitive the replacement may be the wrong case - as in this example because aaa yyy bbb would become aaa [YYY] bbb

It would also be helpful, as a refinement, if the Match could be a wildcard (e.g. to match a whole word, rather than partial words).

The replace needs work for multiple occurrences in the string

The code I've come up with seems very clunky, and there must be a better way?

DECLARE @strMatch	varchar(100) = 'yyy'
	, @strInsBefore	varchar(100) = '['
	, @strInsAfter	varchar(100) = ']'
	, @intOffset	int = 0
	, @intStrLen	int	-- length of search string (assumed no variable-length wildcards)
	, @intStrLenReplaced int
	, @intLenReplace int	-- total additional length of replacement before/after strings
	, @intLoop	int = 1
	, @intRowCount	int = 1	-- Force first iteration

CREATE TABLE #TEST
(
	[T_ID]		int identity(1, 1) NOT NULL
	, T_offset	int NULL
	, T_Len		int NULL
	, T_MatchPosition int NULL
	, T_RightLen	int NULL
	, StringOrig	varchar(30) NOT NULL
	, StringNew	varchar(30) NULL
	, StringExpected varchar(30) NOT NULL
)

The Test data and Expected Results:

INSERT INTO #test(StringOrig, 			StringExpected)
	  SELECT 'xxx yyy zzz',				'xxx [yyy] zzz'			-- Embedded
UNION ALL SELECT 'yyy zzz',				'[yyy] zzz'				-- Leading
UNION ALL SELECT 'xxx yyy',				'xxx [yyy]'				-- Trailing
UNION ALL SELECT 'xxx yyy zzz yyy zzz',	'xxx [yyy] zzz [yyy] zzz' -- Multiple Embedded
UNION ALL SELECT 'xxx yyy zzz yyy',		'xxx [yyy] zzz [yyy]'	-- Embedded and Trailing
UNION ALL SELECT 'yyy zzz yyy zzz',		'[yyy] zzz [yyy] zzz'	-- Leading and Embedded
UNION ALL SELECT 'xxx yyyyyy zzz',		'xxx [yyy][yyy] zzz'	-- Adjacent matches
UNION ALL SELECT 'xxx zzz',				'xxx zzz'				-- No match
-- EDIT: Additional tests:
UNION ALL SELECT 'xxx [yyy] zzz',		'xxx [[yyy]] zzz'		-- Already replaced
UNION ALL SELECT 'xxx YyY zzz',			'xxx [YyY] zzz'			-- Mixed-Case Test

Assign working values etc.

UPDATE	U
SET	StringNew = StringOrig	-- Copy Original String into New String
	-- Assign working values
	, T_offset = 1
	, T_Len = LEN(StringOrig)
FROM	#TEST AS U

SELECT	@intStrLen = LEN(@strMatch)
	, @intLenReplace = LEN(@strInsBefore) + LEN(@strInsAfter)
	, @intStrLenReplaced = @intLenReplace

--DEBUG:
SELECT	[@intStrLen]=@intStrLen, [@intLenReplace]=@intLenReplace, [@intStrLenReplaced]=@intStrLenReplaced

Match and Replace loop:

WHILE @intLoop <= 10 AND @intRowCount >= 1
BEGIN
	UPDATE	U
	SET	StringNew = CASE WHEN MatchPosition IS NOT NULL
				THEN LEFT(StringNew, MatchPosition - 1)
					+ @strInsBefore
					+ SUBSTRING(StringNew, MatchPosition,  @intStrLen)
					+ @strInsAfter
					+ CASE WHEN RightLen < 1 THEN '' ELSE RIGHT(StringNew, RightLen) END
				ELSE StringNew
				END
		, T_offset = MatchPosition + @intStrLenReplaced
		, T_Len = T_Len + @intLenReplace
		-- For debugging use only:
		, T_MatchPosition = MatchPosition
		, T_RightLen = RightLen
	FROM	#TEST AS U
		CROSS APPLY
		(
			SELECT	[MatchPosition] = NullIf(PATINDEX('%' + @strMatch + '%'
							, SUBSTRING(StringNew, T_offset, T_Len)
							), 0)
						+ T_offset - 1
		) AS X1
		CROSS APPLY
		(
			SELECT	[RightLen] = T_Len - MatchPosition - @intStrLen+1
		) AS X2
	WHERE	StringNew LIKE '%' + @strMatch + '%'
		AND T_offset IS NOT NULL
	SELECT	@intRowCount = @@ROWCOUNT

	-- DEBUG:
	PRINT CONCAT('Loop=', @intLoop, ', @intRowCount=', @intRowCount)
	SELECT [@intLoop]=@intLoop, * FROM #TEST

	SELECT	@intLoop = @intLoop + 1	-- Probably don't need a finite limit, @intRowCount=0 will do, so this just to stop "runaway" during DEV
END

Results output:

SELECT	[Diff] = CASE WHEN StringNew = StringExpected COLLATE Latin1_General_BIN THEN 'OK' ELSE 'Diff' END
	, *
FROM	#TEST
GO
DROP TABLE #TEST
GO

andrewbb · January 3, 2018, 3:20pm

I think Kristen needs a vacation. She coded too much over the holidays.

gbritton · January 3, 2018, 3:35pm

[quote="Kristen, post:1, topic:12159"]
THEN LEFT(StringNew, MatchPosition - 1)
					+ @strInsBefore
					+ SUBSTRING(StringNew, MatchPosition,  @intStrLen)
					+ @strInsAfter
					+ CASE WHEN RightLen &lt; 1 THEN '' ELSE RIGHT(StringNew, RightLen) END
[/quote]

In this section, you could use QUOTENAME instead of Before and After variables. Also, the CASE for RightLen may not be necessary. If RightLen >= 0, RIght(, RightLen) will do the right thing. OTOH if RightLen can be less than 0 you could do:

RIGHT(StringNew, IIF(RightLen < 0, 0, RightLen))

which is a little more compact.

Also, I wonder if you could borrow Jeff's technique for CSV parsing, using a tally table, instead of an explicit loop. Just a thought

Kristen · January 3, 2018, 3:41pm

Sorry, just an example, in practice likely to be a longer strnig - e.g. to "inject" something that will highlight the string - such as an HTML <SPAN> with CLASS to provide background colour etc.

Good idea, thanks.

That was in my mind, but I was hoping someone here would know of a way to do that, or have some code lying around already

If I could get a "list" of all matches for PATINDEX() I could then replace them right-to-left (so that earlier inserts don't knock-on later ones) e.g. using STUFF

Hmmm ... now I've written that it doesn't sound so hard, and I do have a Pattern Matching function that uses a Tally Table - which enables it to just use LIKE rather than PATINDEX() ... I've have a go

andrewbb · January 3, 2018, 3:44pm

This sounds like a job for C# (or similar), not SQL.

SQL editing HTML belongs in software, not database. Arguably CSV parsing as well, but it depends.

Kristen · January 3, 2018, 3:45pm

Yeah, I do know that, but in this case that's not an option.

andrewbb · January 3, 2018, 3:46pm

Sounds like they are making you do all the work.

ScottPletcher · January 3, 2018, 4:41pm

What if the value already has the matching chars around it?

For an initial value of:
'xxx [yyy] zzz'
should the result be:
'xxx [[yyy]] zzz' OR
'xxx [yyy] zzz'?

Kristen · January 3, 2018, 4:49pm

^^^^ This

I think if I did NOT want that behaviour I would have to use a pattern to exclude that, so perhaps something like

'%[^\[A-Z]YYY[^\]A-Z]%'

to match where "YYY" is a "whole word" and not already replaced

UNION ALL SELECT 'xxx [yyy] zzz',			'xxx [[yyy]] zzz'			-- Already replaced
...
SELECT	StringOrig
	, [Matched] = CASE WHEN ' ' + StringOrig + ' ' LIKE '%[^\[A-Z]YYY[^\]A-Z]%' ESCAPE '\' THEN 1 ELSE 0 END -- ' (Fix for MarkDown error)
FROM	#TEST

which also excludes 'xxx yyyyyy zzz' because "YYY" is not a "whole word"

yosiasz · January 3, 2018, 4:56pm

there could be so many permutations but first cut at this in a simple way

;with cteBeforeAfter
as
(
SELECT 'xxx [yyy] zzz' as b,				'xxx [yyy] zzz' as a			-- Embedded
UNION ALL SELECT 'yyy zzz',				'[yyy] zzz'				-- Leading
UNION ALL SELECT 'xxx yyy',				'xxx [yyy]'				-- Trailing
UNION ALL SELECT 'xxx yyy zzz yyy zzz',	'xxx [yyy] zzz [yyy] zzz' -- Multiple Embedded
UNION ALL SELECT 'xxx yyy zzz yyy',		'xxx [yyy] zzz [yyy]'	-- Embedded and Trailing
UNION ALL SELECT 'yyy zzz yyy zzz',		'[yyy] zzz [yyy] zzz'	-- Leading and Embedded
UNION ALL SELECT 'xxx yyyyyy zzz',		'xxx [yyy][yyy] zzz'	-- Adjacent matches
UNION ALL SELECT 'xxx zzz',				'xxx zzz'				-- No match
)
select b, 
       a, 
	   replace(replace(b,'[yyy]','yyy'),'yyy','[yyy]') as swing	   
  From cteBeforeAfter

Kristen · January 3, 2018, 5:02pm

Thanks, but that does not preserve the case of the REPLACE match

UNION ALL SELECT 'xxx YyY zzz',	'xxx [YyY] zzz' -- Mixed-Case Test

EDIT: New test data added to O/P

gbritton · January 3, 2018, 8:14pm

fwiw, emitting/parsing CSV is easy in SQL Server 2017 using Python to do the dirty work.

ScottPletcher · January 3, 2018, 8:52pm

You can use recursion for this. I wouldn't suggest this on an extremely large volume of data, but if the data size is reasonable, this should work ok. Note that the STUFFs must proceed from the last match back to the first, otherwise the calcs for which byte to stuff at get very complicated.

This code assumes the original string plus the added chars don't exceed 8000 bytes. If they do, naturally you'd have to switch to varchar(max), which would hurt performance that much more.

((Code still in progress))

ScottPletcher · January 3, 2018, 9:10pm

DECLARE @strMatch	varchar(100) = 'yyy'
	, @strInsBefore	varchar(100) = '['
	, @strInsAfter	varchar(100) = ']'
--SELECT @strInsBefore = '{BEFORE', @strInsAfter = 'AFTER}' --second test values

;WITH 
cte_find_strings_prep AS (
    SELECT T_ID, StringOrig, start_byte
    FROM #test
    CROSS APPLY (
        SELECT CHARINDEX(@strMatch, StringOrig) AS start_byte
    ) AS ca1
    WHERE start_byte > 0
    UNION ALL
    SELECT T_ID, StringOrig, start_byte2
    FROM cte_find_strings_prep
    CROSS APPLY (
        SELECT CHARINDEX(@strMatch, StringOrig, start_byte + LEN(@strMatch)) AS start_byte2
    ) AS ca2
    WHERE start_byte2 > 0
)
,
cte_find_strings AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY T_ID ORDER BY start_byte DESC) AS row_num
    FROM cte_find_strings_prep
)
,
cte_stuff AS (
    SELECT T_ID, start_byte, row_num, 
        STUFF(STUFF(StringPadded, cfs.start_byte + LEN(@strMatch), 0, @strInsAfter),
            cfs.start_byte, 0, @strInsBefore) AS StringNew    
    FROM cte_find_strings cfs
    CROSS APPLY (
        SELECT CAST(StringOrig + SPACE(8000) AS varchar(8000)) AS StringPadded
    ) AS ca1
    WHERE row_num = 1
    UNION ALL
    SELECT cfs.T_ID, cfs.start_byte, cfs.row_num AS row_num, 
        STUFF(STUFF(StringNew, cfs.start_byte + LEN(@strMatch), 0, @strInsAfter),
            cfs.start_byte, 0, @strInsBefore) AS StringNew
    FROM cte_stuff cs
    INNER JOIN cte_find_strings cfs ON cs.T_ID = cfs.T_ID AND cfs.row_num = cs.row_num + 1
)
SELECT t.*, ISNULL(RTRIM(cs.StringNew), t.StringOrig) AS StringNew
FROM #TEST t
LEFT OUTER JOIN cte_stuff cs ON t.T_ID = cs.T_ID AND
    cs.row_num = (SELECT MAX(row_num) FROM cte_find_strings cfs WHERE cfs.T_ID = cs.T_ID)
ORDER BY t.T_ID

andrewbb · January 4, 2018, 9:04am

Don't use Python. It's not strongly typed. Yes, I know the 'compiler figures it out'. The point is: the programmer doesn't know what type it is. You can pass around anything.

It's a tool designed by hackers for hackers.

gbritton · January 4, 2018, 1:36pm

Careful Andrew! Your prejudices are showing.

Python was designed by Guido van Rossum, a language guru on the same level as K&R, Stroustrup, the GoF and many others. I'd call them all hackers, in the Richard Stallman sense of the word. And, it is strongly typed (don't know where you got the other idea). I'm a Python programmer and I always know what the type is.

Many major websites are built on Django, which is written in Python. Nasa, Instagram, Pinterest, Mozilla and oh, about 5000 others. Fortune 100 business use Python in mission-critical systems. Hardly the space of a hacker tool. Data scientists around the world use Python in their daily work. The package management systems of virtually all Linux distros, and of course MacOS, are written in Python.

There was no good way to parse or emit complete RFC-compliant CSV in SQL Server before 2017, except by using a CLR assembly or an SSIS package. Now, you can do it with a few lines of Python called from T-SQL. Welcome to the new world!

bitsmed · January 5, 2018, 12:39am

Alternative you can use the DelimitedSplit8K by @JeffModen

Query

update a
   set a.stringnew=stuff((select ' '
                                +case
                                    when replace(replace(b.item,@strInsBefore,''),@strinsafter,'')=@strmatch
                                    then @strinsbefore+b.Item+@strinsafter
                                    else b.item
                                 end
                            from dbo.DelimitedSplit8K(a.stringorig,' ') as b
                           order by b.itemnumber
                             for xml path('')
                         )
                        ,1
                        ,1
                        ,''
                        )
  from #test as a
;

Edit: Forgot to sort, so added "order by".