Having Trouble Removing Hidden Character in SQL Data

jgoodsonusa · February 26, 2020, 5:16pm

When cleaning up data held in a nVarchar(4000) column, we discovered a specific unwanted character that we can't remove without removing a legitimate character. We call this unwanted character the "long space". it has an ascii value of 63

select ascii(' ') -- returns 63

Unfortunately a "?" has the same ascii value, so any efforts to remove the long space result in removing question marks.

select ascii('?') -- returns 63

We are stumped!! Guidance appreciated!!!

Below are some examples that we have experimented with

declare @string1 varchar(100) =  'Green &#160;Hello world'
select charindex(' ',@string1,1)  -- returns 6

declare @string2 varchar(100) =  '?Green &#160;Hello world'
select charindex(' ',@string2,1)  -- returns 1

select ascii(' ')  -- returns 63
select ascii('?')  -- returns 63

mike01 · February 26, 2020, 8:05pm

When I run your statements on SQL 2016, I get different results

declare @string1 varchar(100) = 'Green Hello world'
select charindex(' ',@string1,1) -- returns 6

declare @string2 varchar(100) = '?Green Hello world'
select charindex(' ',@string2,1) -- returns 7

select ascii(' ') -- returns 32
select ascii('?') -- returns 63

Solomon.Rutzky · February 27, 2020, 6:46am

Hello. There are two main issues here:

You are mixing NVARCHAR data with VARCHAR literals and variables. VARCHAR is an 8-bit datatype that can only hold characters found on the code page associated with the collation being used. For string literals and variables, the collation being used is the database's default collation. Outside of UTF-8, which is the 8-bit Unicode encoding (and is only an option starting with SQL Server 2019), single byte code pages (the majority of them) can only contain, at most, 256 characters. NVARCHAR is Unicode (which contains a potential 1,114,112 characters ; but only about 138k of them have been mapped to actual characters).
What all of that means is: when you store Unicode values in non-UTF8 VARCHAR variables, literals, or even columns, then you lose the original Unicode character IF that character does not exist on the code page being used by the collation of that VARCHAR destination. That is why your ascii(' ') test returned 63. It's not that the original character and ? both share the same numeric value (as that is not possible), but the original character got converted into a "?", which is the default replacement character if a character can't be found in the destination code page.
Since you are dealing with Unicode / NVARCHAR data, you need to do 3 things:
1. Use NVARCHAR for variables, not VARCHAR
2. Prefix string literals with an upper-case "N"
3. Use the UNICODE() function instead of the ASCII() function to get the numeric value of the first character in a string.
The following example should help:
```
SELECT NCHAR(2345), ASCII(NCHAR(2345)), 'ऩ', N'ऩ';
-- ऩ	63	?	ऩ
SELECT ASCII('ऩ'), ASCII(N'ऩ'), UNICODE('ऩ'), UNICODE(N'ऩ');
-- 63	63	63	2345
```
When needing to deal with a special code point / character, use a binary collation (i.e. one ending in _BIN2). You use this in a COLLATE clause added to any predicate where you want to override what the collation would have been (or prevent an error if there was no initial collation). For example:
```
DECLARE @Test NVARCHAR(500) = 
   N'aaa bbb' + NCHAR(0xFFFE) + N'ccc';

SELECT @Test,
       REPLACE(@Test, NCHAR(0xFFFE), N'!'),
       REPLACE(@Test COLLATE Latin1_General_100_BIN2,
               NCHAR(0xFFFE), N'!');
-- aaa bbbccc		aaa bbbccc		aaa bbb!ccc
```

For more info on working with collations (especially in SQL Server), please visit:
Collations Info

jgoodsonusa · February 27, 2020, 2:17pm

Awesome!!! Thank you VERY MUCH!!!!

Solomon.Rutzky · February 27, 2020, 6:05pm

You're welcome. To be clear, using a binary collation won't help if you forget to indicate that the data is NVARCHAR, either by declaring the variable to be NVARCHAR or prefixing the string literal with an upper-case "N". For example (and this time I used Code Point U+2007 which is "Figure Space", another type of space character):

SELECT CHARINDEX(' ', 'aa?bb cc');
-- 3

SELECT CHARINDEX(' ' COLLATE Latin1_General_100_BIN2, 'aa?bb cc');
-- 3

SELECT CHARINDEX(N' ', N'aa?bb cc');
-- 6

Also, you might find it helpful to read:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide

Solomon.Rutzky · February 27, 2020, 6:09pm

Mike, for whatever reason, when the O.P. copied and pasted that code into the editor, it did not keep the original characters, and so changed the actual character that is causing the issue to a regular space. Try running my example from this reply: Having Trouble Removing Hidden Character in SQL Data

Solomon.Rutzky · March 3, 2020, 3:33pm

You're welcome. If that info solved your issue, please don't forgot to accept my first post as the "answer". Thanks!