SQLTeam.com | Weblogs | Forums

Trying to detect incorrect capitalisation in names


#1

Hello,

I have been asked to create a test to check whether names have the first letter capitalised and then the remainder of the name in lower case.

I have tried to come up with a script to try this but as expected since SQL doesn't recognise the difference between lower case and upper case, it isn't working.

Does anyone have an idea how they would use SQL to do this?

My attempt:

select client_id, crms_number, first_name, last_name,
SUBSTRING(first_name, 2, 1),
lower(SUBSTRING(first_name, 2, 1)),
case when SUBSTRING(first_name, 2, 1) = lower(SUBSTRING(first_name, 2, 1)) then 1 else 0 end as [Agree]
from dbo.ZZ_crms_student

Thanks in advance for your help!


#2

Presumably you have a Case Insensitive collation so your comparison of Upper and Lower case will always be true :frowning: You can fix that by forcing the Collation (of the comparison) to BINARY:

when SUBSTRING(first_name, 2, 1) = lower(SUBSTRING(first_name, 2, 1)) COLLATE Latin1_General_BIN2

What about name prefixes like "d'Arcy" that would not be capitalised?


#3
CREATE TABLE #SimpleText (
	FirstName VARCHAR(20),
    LastName VARCHAR(20)
	);
INSERT #SimpleText(FirstName, LastName) VALUES
    ('bob', 'jones'),
    ('Alice', 'Cooper'),
    ('matt', 'Miller'),
    ('Steve', 'jones');


-- if all you want is detection, this should work.
SELECT 
   st.FirstName, 
   st.LastName,
   is_FN_cap = CASE WHEN ASCII(SUBSTRING(st.FirstName, 1, 1)) BETWEEN 65 AND 90 THEN 1 ELSE 0 END,
   is_LN_cap = CASE WHEN ASCII(SUBSTRING(st.LastName, 1, 1)) BETWEEN 65 AND 90 THEN 1 ELSE 0 END
FROM 
    #SimpleText st;

--  IF you want TO actually make a correction, you can do something like this.
SELECT 
  st.FirstName,
  st.LastName,
  ic.is_FN_cap,
  ic.is_LN_cap,
  f_fix = STUFF(st.FirstName, 1, 1, u.f_upper),
  l_fix = STUFF(st.LastName, 1, 1, u.l_upper)
FROM 
    #SimpleText st
    CROSS APPLY ( VALUES (SUBSTRING(st.FirstName, 1, 1), SUBSTRING(st.LastName, 1, 1)) ) fl (f_initial, l_initial)
    CROSS APPLY ( VALUES (
                CASE WHEN ASCII(fl.f_initial) BETWEEN 65 AND 90 THEN 1 ELSE 0 END,
                CASE WHEN ASCII(fl.l_initial) BETWEEN 65 AND 90 THEN 1 ELSE 0 END
                ) ) ic (is_FN_cap, is_LN_cap)
    CROSS APPLY ( VALUES (
                CASE WHEN ASCII(fl.f_initial) BETWEEN 65 AND 90 THEN fl.f_initial ELSE CHAR(ASCII(fl.f_initial) - 32) END,
                CASE WHEN ASCII(fl.l_initial) BETWEEN 65 AND 90 THEN fl.l_initial ELSE CHAR(ASCII(fl.l_initial) - 32) END
                ) ) u (f_upper, l_upper);

First result set...

FirstName            LastName             is_FN_cap   is_LN_cap
-------------------- -------------------- ----------- -----------
bob                  jones                0           0
Alice                Cooper               1           1
matt                 Miller               0           1
Steve                jones                1           0

Second result set...

FirstName            LastName             is_FN_cap   is_LN_cap   f_fix                l_fix
-------------------- -------------------- ----------- ----------- -------------------- --------------------
bob                  jones                0           0           Bob                  Jones
Alice                Cooper               1           1           Alice                Cooper
matt                 Miller               0           1           Matt                 Miller
Steve                jones                1           0           Steve                Jones

#4

Thanks @Kristen, this seems to work. I don't know how so I will google this to try and understand. Thanks! On your second point, it's worth thinking about at this stage but this is just a test to check for errors so we can ignore ones that are correct, even though I might come up with a more sophisticated methodology in the future!

Thanks! :grinning:


#5

Thanks @Jason_A_Long, this also works perfectly. Thanks so much for taking the time to write the syntax for this. Really appreciated. I also don't know what ASCII means so will google that too!

Good day!


#6

My pleasure. I'm glad I could help.
ASCII (American Standard Code for Information Interchange) is a type of standardized 8 bit character encoding. So any time you're using the CHAR or VARCHAR data types, you using ascii characters (I think... it may vary by collation setting...) and opposed to NCHAR and NVARCHAR which is UNICODE, a 16 bit character encoding (which is why unicode takes up twice as much space.

Fun history, but not a practical answer either... Both ASCII & UNICODE have a numeric value assigned to each character... The ASCII function (for char & varchar) allows to to will accept a character as an input and display that characters assigned number. There is also a reciprocal function, CHAR() which will take a numeric value and return the corresponding text character. In case you were wandering, yes, there are similar functions for Unicode. UNICODE() & NCHAR().

The solution works because, even though SQL Server may be using a case insensitive collation, the standard still assigns every letter 2 different numerical values... one for the upper and one for the lower. which means that you can always detect a case difference by comparing those numerical values.

SELECT 
	t.n,
	ASCII_Char = CHAR(t.n),
	Unicode_Char = NCHAR(t.n)
FROM
	(
	SELECT TOP 5000
		n = ROW_NUMBER() OVER (ORDER BY ac.object_id)
	FROM
		sys.all_columns ac
	) t