Find non-ASCII characters in varchar columns using SQL Server

152,920

Solution 1

try something like this:

DECLARE @YourTable table (PK int, col1 varchar(20), col2 varchar(20), col3 varchar(20));
INSERT @YourTable VALUES (1, 'ok','ok','ok');
INSERT @YourTable VALUES (2, 'BA'+char(182)+'D','ok','ok');
INSERT @YourTable VALUES (3, 'ok',char(182)+'BAD','ok');
INSERT @YourTable VALUES (4, 'ok','ok','B'+char(182)+'AD');
INSERT @YourTable VALUES (5, char(182)+'BAD','ok',char(182)+'BAD');
INSERT @YourTable VALUES (6, 'BAD'+char(182),'B'+char(182)+'AD','BAD'+char(182)+char(182)+char(182));

--if you have a Numbers table use that, other wise make one using a CTE
WITH AllNumbers AS
(   SELECT 1 AS Number
    UNION ALL
    SELECT Number+1
        FROM AllNumbers
        WHERE Number<1000
)
SELECT 
    pk, 'Col1' BadValueColumn, CONVERT(varchar(20),col1) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col1)
    WHERE ASCII(SUBSTRING(y.col1, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col1, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col2' BadValueColumn, CONVERT(varchar(20),col2) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col2)
    WHERE ASCII(SUBSTRING(y.col2, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col2, n.Number, 1))>127
UNION
SELECT 
    pk, 'Col3' BadValueColumn, CONVERT(varchar(20),col3) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
    FROM @YourTable           y
        INNER JOIN AllNumbers n ON n.Number <= LEN(y.col3)
    WHERE ASCII(SUBSTRING(y.col3, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col3, n.Number, 1))>127
order by 1
OPTION (MAXRECURSION 1000);

OUTPUT:

pk          BadValueColumn BadValue
----------- -------------- --------------------
2           Col1           BA¶D
3           Col2           ¶BAD
4           Col3           B¶AD
5           Col1           ¶BAD
5           Col3           ¶BAD
6           Col1           BAD¶
6           Col2           B¶AD
6           Col3           BAD¶¶¶

(8 row(s) affected)

Solution 2

Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.

select line,
  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
  substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
  ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
from  staging.APARMRE1
where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0

Solution 3

I've been running this bit of code with success

declare @UnicodeData table (
     data nvarchar(500)
)
insert into 
    @UnicodeData
values 
    (N'Horse�')
    ,(N'Dog')
    ,(N'Cat')

select
    data
from
    @UnicodeData 
where
    data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))

Which works well for known columns.

For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.

declare 
    @sql    varchar(max)    = ''
    ,@table sysname         = 'mytable' -- enter your table here

;with ColumnData as (
    select
        RowId               = row_number() over (order by c.COLUMN_NAME)
        ,c.COLUMN_NAME
        ,ColumnName         = '[' + c.COLUMN_NAME + ']'
        ,TableName          = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']' 
    from
        INFORMATION_SCHEMA.COLUMNS c
    where
        c.DATA_TYPE         = 'nvarchar'
        and c.TABLE_NAME    = @table
)
select
    @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''',         InvalidCharacter = [' + c.COLUMN_NAME + ']  from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) '  +  case when c.RowId <> (select max(RowId) from ColumnData) then  ' union all ' else '' end + char(13)
from
    ColumnData c

-- check
-- print @sql
exec (@sql)

I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.

Solution 4

This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:

declare @str varchar(128);
declare @i int;
set @str = '';
set @i = 32;
while @i <= 127
    begin
    set @str = @str + '|' + char(@i);
    set @i = @i + 1;
    end;

select  col1
from    YourTable
where   col1 like '%[^' + @str + ']%' escape '|';

Solution 5

running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.

(pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)

patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows

select dodgyColumn from myTable fcc
WHERE  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0

the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows

select dodgyColumn from myTable fcc
INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn ))
WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32 
    OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127

udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows

select dodgyColumn 
from myTable 
where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1
Share:
152,920
craphunter
Author by

craphunter

Solutions Architect at BeneSys, Inc LinkedIn | Twitter @GerhardWeiss Secretary at Great Lakes Area .NET Users Group (GANG) GANG LinkedIn Group | Meetings | Twitter @gangannounce I am a VB.NET and PL/B solutions and software architect, application designer, team leader, and senior programmer/analyst with 25 years of solid experience in all phases of software application development. I have had team leadership, project management, and mentoring responsibilities throughout the software development lifecycle. My experience has also included requirements gathering, analysis, database design, technical writing, testing, and deploying applications.

Updated on July 05, 2022

Comments

  • craphunter
    craphunter almost 2 years

    How can rows with non-ASCII characters be returned using SQL Server?
    If you can show how to do it for one column would be great.

    I am doing something like this now, but it is not working

    select *
    from Staging.APARMRE1 as ar
    where ar.Line like '%[^!-~ ]%'
    

    For extra credit, if it can span all varchar columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:

    • The identity field for that record. (This will allow the whole record to be reviewed with another query.)
    • The column name
    • The text with the invalid character
     Id | FieldName | InvalidText       |
    ----+-----------+-------------------+
     25 | LastName  | Solís             |
     56 | FirstName | François          |
    100 | Address1  | 123 Ümlaut street |
    

    Invalid characters would be any outside the range of SPACE (3210) through ~ (12710)

  • craphunter
    craphunter over 13 years
    This works with one minor change Varchar(128) needs to be bigger because 2 characters are being stored. I made it Varchar(200). It does take some time to run through my database. I am also suprised that a range cannot be used to simplified this process. i.e. like '%[^| -|~]%' escape '|' I tried to get a range working but it does not return the correct information.
  • craphunter
    craphunter over 13 years
    I also changed 127 to 126. I did not want the DEL character.
  • Twelfth
    Twelfth over 13 years
    Interesting approach KM. For my own curiousity...can I ask why the line "OPTION (MAXRECURSION 1000) " at the end of your statement is needed and what it will do in this case?
  • Twelfth
    Twelfth over 13 years
    Comment on myself...the case statement version, I mentioned a single row having multiple columns with bad values. If both first_name and last_name had a bad value in it...I think the case statement will find the first_name portion and show it correctly, but would end there and not show the last_name value correctly. Probably not an optimal solution....the subquery version at the bottom of my post that unions all the tables values into id,columnname,value format appears to be much more functional and easier to follow
  • KM.
    KM. over 13 years
    "OPTION (MAXRECURSION 1000)" is necessary for the CTE, which recursively builds a set of rows from 1 to 1000, the default value is 100 (I think) any nested recursion calls in a cte to exceed the default requires this option to be set. If you had a numbers table stackoverflow.com/q/1393951/65223 you would not need the CTE or this "OPTION (MAXRECURSION 1000)" line
  • StevenWhite
    StevenWhite about 11 years
    This is really interesting. Would you explain how this works?
  • Anssssss
    Anssssss about 11 years
    Gerhard is providing a regular expression to the PATINDEX function. The regex is [^ !-~]. I'm not sure why he includes the exclamation character in there since it is right after the space character numerically. The point is that the regex finds things that are characters not in the range of Space-Tilde (32-126).
  • thrawnis
    thrawnis about 7 years
    Simple and quick. Thanks!
  • Daz
    Daz over 4 years
    It's worth noting that the PATINDEX function doesn't accept any regular expression pattern. It's has it's own syntax which is similar to regular expressions in some respects.
  • Chris Diver
    Chris Diver about 4 years
    @vash great solution, love it.
  • Stewart
    Stewart over 3 years
    While I sometimes edit answers to include semicolons that have been left off, it wouldn't be right to do so here as it the answer would no longer be accurate as to the code you're using. But it's important not to leave them off. See: stackoverflow.com/questions/710683/…