Find rows with duplicate sub strings

kpgraci · June 6, 2020, 11:45am

Given a table with a column of varchar(max), I want to find all rows have a certain sub string that occurs more than once.

For example:
row 1: hello test world
row 2: hello world hello
row 3: hello test world test
row 3: hello test test test

I want to find all rows that contain the sub string 'test' at least twice, so rows 3 and 4 only in this case.

A simple attempt:

select * from mytable where mycol like '%test%%test%'

No real idea how to proceed beyond that.

Any help appreciated!

harishgg1 · June 6, 2020, 12:57pm

hi

there are several ways to do this

please see this article in which there are several
approaches

harishgg1 · June 6, 2020, 1:15pm

hi

one way is using XML method .. hope this helps

;WITH [cte]
AS (
       SELECT
            string1
            , LTRIM(RTRIM([m].[n].[value]('.[1]', 'varchar(8000)'))) AS [string_split]
       FROM
        (   SELECT
                string1
                , CAST('<XMLRoot><RowData>' + REPLACE(#temp.string1, 'test', '</RowData><RowData>')
                       + '</RowData></XMLRoot>' AS XML) AS [x]
            FROM
                #temp) AS [t]
        CROSS APPLY [x].[nodes]('/XMLRoot/RowData') AS [m]([n])
   )
     select 'SQL Output',string1,count(*)-1 from   [cte]
	 group by string1

harishgg1 · June 6, 2020, 4:20pm

drop table #temp

create table #temp 
(
string1  VARCHAR(200) 
)

insert into #temp select 'hello test world'
insert into #temp select 'hello world hello'
insert into #temp select 'hello test world test'
insert into #temp select 'hello test test test'

declare @string varchar(200) = 'test'

select 'String',@string

select 'Data',* from #temp

;WITH [cte]
AS (
       SELECT
            string1
            , LTRIM(RTRIM([m].[n].[value]('.[1]', 'varchar(8000)'))) AS [string_split]
       FROM
        (   SELECT
                string1
                , CAST('<XMLRoot><RowData>' + REPLACE(#temp.string1, 'test', '</RowData><RowData>')
                       + '</RowData></XMLRoot>' AS XML) AS [x]
            FROM
                #temp) AS [t]
        CROSS APPLY [x].[nodes]('/XMLRoot/RowData') AS [m]([n])
   )
     select 'SQL Output greater than or equal to 2',string1,count(*)-1 from   [cte]
	 group by string1
	 having count(*)-1 >= 2

yosiasz · June 6, 2020, 9:28pm

another way

select distinct string1
from (
select ROW_NUMBER() OVER (
      PARTITION BY string1
      ORDER BY string1
   ) row_num, *
  from #temp
  cross apply string_split(string1, ' ') a
  where value = @string
  ) a
  where row_num >= 2

ScottPletcher · June 7, 2020, 3:28am

Those seem overly complicated to me. How about this?:

DECLARE @number_of_string_matches_required smallint
DECLARE @string_to_find varchar(200)

SET @string_to_find = 'test'
SET @number_of_string_matches_required = 2

;WITH cte_test_data AS (
    SELECT * FROM (VALUES
        (1, 'hello test world'),
        (2, 'hello world hello'),
        (3, 'hello test world test'),
        (4, 'hello test test test'))
    AS data(row#, string)
)
SELECT *
FROM cte_test_data
WHERE (LEN(string) - LEN(REPLACE(string, @string_to_find, ''))) / 
    LEN(@string_to_find) >= @number_of_string_matches_required
ORDER BY row#

harishgg1 · June 7, 2020, 5:47am

hi Scott

I like your way .... had something like this in mind .. but could not implement

Mateen · October 3, 2020, 2:54pm

Fascinating!

Easy to remember trick to find nth number of occurrence of a string.

Love you.

ScottPletcher · October 4, 2020, 7:36pm

Thanks! Yep, it comes in very handy at times.