Data Quality checking of strings, delimited by [_]

Hello all!

I would like some help creating a data quality check query, I have a tag which is made up of five elements separated by [] so one check would be:
WHERE first_source_tag NOT LIKE '%[
]%[]%[]%[_]%'

However, I then want to check the different elements so I need to know how to tell SQL that I want to check say the fifth element to check that it is not a date in the future and not a date in the wrong format, does anyone know if and how this would be possible?

Thanks in advance!

My initial reaction is that your design is wrong and should be corrected. If you have five different attributes of an entity, then there should be five attributes in your relation; five columns in your table. While it would be possible to parse the string using the delimiter and string functions (like charindex and substring), you'd be better off splitting the data at the application layer and letting SQL enforce data integrity for each element.

1 Like

You can use CROSS APPLYs to assign variable names to intermediate values to make the process of validation easier and "cleaner" to read. You didn't provide many details, but here's the general outline:

SELECT ...
FROM table_name
CROSS APPLY (
    SELECT RIGHT(first_source_tag, CHARINDEX('_', REVERSE(first_source_tag)) - 1) AS tag5
) AS assign_alias_names
CROSS APPLY (
    SELECT <check_tag5_for_valid_date>
) AS assign_alias_names2
...

Thanks Stephen, I would love that but I work in an international company and I don't have database administration rights as that sits in another country. :frowning:

Thanks Scott, I will have a look at this soon! :slight_smile:

You really need to provide sample data as follows

use sqlteam
go

create table #tags(first_source_tag  nvarchar(150))

insert into #tags
select '2019-01-01[_]2019-01[_]I_am_not_a_date[_]2020-01-01[_]wack_data_design'

SELECT Split.a.value('.', 'VARCHAR(100)') AS Data , 
        case 
		when ISDATE(Split.a.value('.', 'VARCHAR(100)')) = 1 then 
		   case 
		      when cast(Split.a.value('.', 'VARCHAR(100)') as date) > getdate() then 1
			  else 0
			  end
        else 0
		end as isFutureDate
 FROM  
 (
     SELECT CAST ('<M>' + REPLACE(first_source_tag, '[_]', '</M><M>') + '</M>' AS XML) AS Data  
     FROM  #tags
 ) AS A CROSS APPLY Data.nodes ('/M') AS Split(a); 

drop table #tags