Identifying case anomalies

Hello,

How can I identify records where there is a case anomaly?

Example: RastaPickles, rastaPickles, rastapickles and Rastapickles are all treated the same by SQL.

I'm trying to clean up some data that has been imported via a spreadsheet.

Is there a way?

Play with the collation in an ORDER BY context:

with cte(col) as (select 'RastaPickles' union all select 'rastaPickles' union all select 'rastapickles' union all select 'Rastapickles')
select *, cast(col as varbinary(128)) binary_value
, rank() OVER (order by col collate SQL_Latin1_General_CP1_CS_AS) CS_order
, rank() OVER (order by col collate SQL_Latin1_General_CP1_CI_AS) CI_order
from cte order by col collate SQL_Latin1_General_CP1_CS_AS

You'd have to define "anomaly" though, meaning which letter case, for which letters, is the valid case you want, and then marking others as invalid/anomalous.

1 Like

Very good but if I have 20,000 records and 3,000 of them might be iffy, is there a better way than having to qualify them in a SELECT statement?

Not that I can think of.

Again, you have to establish the valid casing, and if in doubt, make its binary representation the standard. After that, you'd have to compare using a case-insensitive collation and update the anomalies to match the standard.

Here's an example, using the lowest case-sensitive sort value for each group:

with cte(col) as (select 'RastaPickles' union all select 'rastaPickles' union all select 'rastapickles' union all select 'Rastapickles'
union all select 'RobVolk' union all select 'robVolk' union all select 'robvolk' union all select 'Robvolk')
,cte2 as (
select *, cast(col as varbinary(128)) binary_value
, rank() OVER (partition by upper(col) order by col collate SQL_Latin1_General_CP1_CS_AS) CS_order
, rank() OVER (partition by upper(col) order by col collate SQL_Latin1_General_CP1_CI_AS) CI_order
from cte)
select *, min(col) over (partition by upper(col) order by cs_order) standard_value
from cte2
1 Like

The update query

update MyTable set MyField = 'RastaPickles' where MyField = 'rastaPickles'

will update all fields (RASTApickles, RaStaPicles, ...) to the value 'RastaPickles' whatever characters are in capitals or not.

So you can create a worktable and populate it with the distinct values of the column. Correct these values to the correct writing and then create an update query joining the worktable to the target table.

1 Like