How do I remove extended ASCII characters from a string in T-SQL?

85,355

Solution 1

OK, give this a try. It seems the same issue they have. Anyway you need to modify it based on your requirements.

CREATE FUNCTION RemoveNonASCII 
(
    @nstring nvarchar(255)
)
RETURNS varchar(255)
AS
BEGIN

    DECLARE @Result varchar(255)
    SET @Result = ''

    DECLARE @nchar nvarchar(1)
    DECLARE @position int

    SET @position = 1
    WHILE @position <= LEN(@nstring)
    BEGIN
        SET @nchar = SUBSTRING(@nstring, @position, 1)
        --Unicode & ASCII are the same from 1 to 255.
        --Only Unicode goes beyond 255
        --0 to 31 are non-printable characters
        IF UNICODE(@nchar) between 32 and 255
            SET @Result = @Result + @nchar
        SET @position = @position + 1
    END

    RETURN @Result

END
GO

Check it out at SqlServerCentral

Solution 2

The accepted answer is using a loop which should be avoided...

My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.

The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.

DECLARE @tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO @tbl(EvilString) VALUES('ËËËËeeeeËËËË'),('ËaËËbËeeeeËËËcË');

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
,SingleChars AS
(
    SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
    FROM @tbl AS tbl
    CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn 
)
SELECT ID,EvilString
      ,(
        SELECT '' + Chr 
        FROM SingleChars AS sc
        WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
        ORDER BY sc.Nmbr
        FOR XML PATH('')
      ) AS GoodString
FROM @tbl As tbl

The result

1   ËËËËeeeeËËËË    eeee
2   ËaËËbËeeeeËËËcË abeeeec

Here is another answer from me where this approach is used to replace all special characters with secure characters to get plain latin

Solution 3

Just correcting above code (it was cutting DOTs)

CREATE FUNCTION [dbo].[RemoveNonASCII] 
(
    @nstring nvarchar(255)
)
RETURNS nvarchar(255)
AS
BEGIN

    DECLARE @Result nvarchar(255)
    SET @Result = ''

    DECLARE @nchar nvarchar(1)
    DECLARE @position int

    SET @position = 1
    WHILE @position <= LEN(@nstring)
    BEGIN
        SET @nchar = SUBSTRING(@nstring, @position, 1)
        --Unicode & ASCII are the same from 1 to 255.
        --Only Unicode goes beyond 255
        --0 to 31 are non-printable characters
        IF (UNICODE(@nchar) between 192 and 198) or (UNICODE(@nchar) between 225 and 230) -- letter A or a with accents
            SET @nchar = 'a'
        IF (UNICODE(@nchar) between 200 and 203) or (UNICODE(@nchar) between 232 and 235) -- letter E or e with accents
            SET @nchar = 'e'
        IF (UNICODE(@nchar) between 204 and 207) or (UNICODE(@nchar) between 236 and 239) -- letter I or i with accents
            SET @nchar = 'i'
        IF (UNICODE(@nchar) between 210 and 214) or (UNICODE(@nchar) between 242 and 246) or (UNICODE(@nchar)=240) -- letter O or o with accents
            SET @nchar = 'o'
        IF (UNICODE(@nchar) between 217 and 220) or (UNICODE(@nchar) between 249 and 252)  -- letter U or u with accents
            SET @nchar = 'u'
        IF (UNICODE(@nchar)=199)  or (UNICODE(@nchar)=231)  -- letter Ç or ç 
            SET @nchar = 'c'
        IF (UNICODE(@nchar)=209)  or (UNICODE(@nchar)=241)  -- letter Ñ or ñ 
            SET @nchar = 'n'
        IF (UNICODE(@nchar) between 45 and 46) or (UNICODE(@nchar) between 48 and 57) or (UNICODE(@nchar)  between 64 and 90) or (UNICODE(@nchar) = 95) or (UNICODE(@nchar)  between 97 and 122)
            SET @Result = @Result + @nchar
        SET @position = @position + 1
    END
    set @Result = lower(@Result) -- e-mails in lower case
    RETURN @Result

END

Solution 4

Thank you for sharing your code.

I needed something like it, not just for cleaning up e-mail addresses but for general purpose, filtering user web site input before it reaches SAP ERP via integration modules.

Running in Brazil, it must obey language accents...

Here goes the resulting code.

Maybe it can help someone someday, like it did to me.

     IF EXISTS
     (
         SELECT * 
           FROM sys.objects 
          WHERE object_id = OBJECT_ID(N'[dbo].[fnRemoveNonASCII]') 
            AND type IN (N'FN')
     )
         DROP FUNCTION dbo.fnRemoveNonASCII
     GO

     CREATE FUNCTION [dbo].[fnRemoveNonASCII] 
     (
         @nstring nvarchar(MAX) 
     )
     RETURNS nvarchar(MAX)
     AS
     BEGIN

         DECLARE @nchar nvarchar(1)      -- individual char in string
         DECLARE @nUnicode nvarchar(3)   -- ASCII for individual char in string
         DECLARE @position int           -- subscript to control loop in the string
         DECLARE @Result nvarchar(MAX)   -- return valus

         SET @Result = ''
         SET @position = 1

         WHILE @position <= LEN(@nstring)

            BEGIN

                --Unicode & ASCII are the same from 1 to 255.
                --Only Unicode goes beyond 255
                --0 to 31 are non-printable characters

                SET @nchar = SUBSTRING(@nstring, @position, 1)

                SET @nUnicode = UNICODE(@nChar) 

                IF @nUnicode = 10 
                   OR @nUnicode = 13
                   OR @nUnicode BETWEEN 32  AND 126
                   OR @nUnicode = 160
                   OR @nUnicode BETWEEN 192 AND 207
                   OR @nUnicode BETWEEN 210 AND 213
                   OR @nUnicode BETWEEN 217 AND 219
                   OR @nUnicode BETWEEN 224 AND 227
                   OR @nUnicode BETWEEN 231 AND 234
                   OR @nUnicode = 236
                   OR @nUnicode = 237
                   OR @nUnicode BETWEEN 242 AND 245
                   OR @nUnicode = 247
                   OR @nUnicode = 249
                   OR @nUnicode = 250        

                   SET @Result = @Result + @nchar

                ELSE IF @nUnicode = 9 -- TAB
                   SET @Result =  @Result + '  '

                ELSE
                   SET @Result =  @Result + ' '

                SET @position = @position + 1
         END

         RETURN @Result
     END

     /*
     ---------------------------------------------------------------------------------------------------------------

     -- Tabela dos caracteres Unicode/ASCII exportáveis

     &#9;   |   &#x9;      |    %9     =    TAB
     &#10;     |    &#xa;      |    %a     =    0A   Line Feed
     &#13;     |    &#xd;      |    %d     =    0D   Carriage Return
     &#32;      |   &#x20;  |   %20 =   <space>  
     &#33;     |    &#x21;  |   %21 =   !            
     &#34;     |    &#x22;  |   %22 =   "            
     &#35;     |    &#x23;  |   %23 =   #            
     &#36;     |    &#x24;  |   %24 =   $            
     &#37;     |    &#x25;  |   %25 =   %            
     &#38;     |    &#x26;  |   %26 =   &            
     &#39;     |    &#x27;  |   %27 =   '            
     &#40;     |    &#x28;  |   %28 =   (            
     &#41;     |    &#x29;  |   %29 =   )            
     &#42;     |    &#x2a;  |   %2a =   *            
     &#43;     |    &#x2b;  |   %2b =   +            
     &#44;     |    &#x2c;  |   %2c =   ,            
     &#45;     |    &#x2d;  |   %2d =   -            
     &#46;     |    &#x2e;  |   %2e =   .            
     &#47;     |    &#x2f;  |   %2f =   /            
     &#48;     |    &#x30;  |   %30 =   0            
     &#49;     |    &#x31;  |   %31 =   1            
     &#50;     |    &#x32;  |   %32 =   2            
     &#51;     |    &#x33;  |   %33 =   3            
     &#52;     |    &#x34;  |   %34 =   4            
     &#53;     |    &#x35;  |   %35 =   5            
     &#54;     |    &#x36;  |   %36 =   6            
     &#55;     |    &#x37;  |   %37 =   7            
     &#56;     |    &#x38;  |   %38 =   8            
     &#57;     |    &#x39;  |   %39 =   9            
     &#58;     |    &#x3a;  |   %3a =   :            
     &#59;     |    &#x3b;  |   %3b =   ;            
     &#60;     |    &#x3c;  |   %3c =   <            
     &#61;     |    &#x3d;  |   %3d =   =            
     &#62;     |    &#x3e;  |   %3e =   >            
     &#63;     |    &#x3f;  |   %3f =   ?            
     &#64;     |    &#x40;  |   %40 =   @   
     &#65;     |    &#x41;  |   %41 =   A
     &#66;     |    &#x42;  |   %42 =   B
     &#67;     |    &#x43;  |   %43 =   C
     &#68;     |    &#x44;  |   %44 =   D
     &#69;     |    &#x45;  |   %45 =   E
     &#70;     |    &#x46;  |   %46 =   F
     &#71;     |    &#x47;  |   %47 =   G
     &#72;     |    &#x48;  |   %48 =   H
     &#73;     |    &#x49;  |   %49 =   I
     &#74;     |    &#x4a;  |   %4a =   J
     &#75;     |    &#x4b;  |   %4b =   K
     &#76;     |    &#x4c;  |   %4c =   L
     &#77;     |    &#x4d;  |   %4d =   M
     &#78;     |    &#x4e;  |   %4e =   N
     &#79;     |    &#x4f;  |   %4f =   O 
     &#80;     |    &#x50;  |   %50 =   P
     &#81;     |    &#x51;  |   %51 =   Q
     &#82;     |    &#x52;  |   %52 =   R
     &#83;     |    &#x53;  |   %53 =   S
     &#84;     |    &#x54;  |   %54 =   T
     &#85;     |    &#x55;  |   %55 =   U
     &#86;     |    &#x56;  |   %56 =   V
     &#87;     |    &#x57;  |   %57 =   W
     &#88;     |    &#x58;  |   %58 =   X
     &#89;     |    &#x59;  |   %59 =   Y
     &#90;     |    &#x5a;  |   %5a =   Z
     &#91;     |    &#x5b;  |   %5b =   [
     &#92;     |    &#x5c;  |   %5c =   \
     &#93;     |    &#x5d;  |   %5d =   ]
     &#94;     |    &#x5e;  |   %5e =   ^
     &#95;     |    &#x5f;  |   %5f =   _
     &#96;     |    &#x60;  |   %60 =   `
     &#97;     |    &#x61;  |   %61 =   a
     &#98;     |    &#x62;  |   %62 =   b
     &#99;     |    &#x63;  |   %63 =   c
     &#100; |   &#x64;  |   %64 =   d
     &#101; |   &#x65;  |   %65 =   e
     &#102; |   &#x66;  |   %66 =   f
     &#103; |   &#x67;  |   %67 =   g
     &#104; |   &#x68;  |   %68 =   h
     &#105; |   &#x69;  |   %69 =   i
     &#106; |   &#x6a;  |   %6a =   j
     &#107; |   &#x6b;  |   %6b =   k
     &#108; |   &#x6c;  |   %6c =   l
     &#109; |   &#x6d;  |   %6d =   m
     &#110; |   &#x6e;  |   %6e =   n
     &#111; |   &#x6f;  |   %6f =   o
     &#112; |   &#x70;  |   %70 =   p
     &#113; |   &#x71;  |   %71 =   q
     &#114; |   &#x72;  |   %72 =   r
     &#115; |   &#x73;  |   %73 =   s
     &#116; |   &#x74;  |   %74 =   t
     &#117; |   &#x75;  |   %75 =   u
     &#118; |   &#x76;  |   %76 =   v
     &#119; |   &#x77;  |   %77 =   w
     &#120; |   &#x78;  |   %78 =   x
     &#121; |   &#x79;  |   %79 =   y
     &#122; |   &#x7a;  |   %7a =   z
     &#123; |   &#x7b;  |   %7b =   {
     &#124; |   &#x7c;  |   %7c =   |
     &#125; |   &#x7d;  |   %7d =   }
     &#126; |   &#x7e;  |   %7e =   ~
     &#160; |   &#xa0;  |   %a0 =   <nbsp>      
     &#192; |   &#xc0;  |   %c0 =   À
     &#193; |   &#xc1;  |   %c1 =   Á
     &#194; |   &#xc2;  |   %c2 =   Â
     &#195; |   &#xc3;  |   %c3 =   Ã
     &#196; |   &#xc4;  |   %c4 =   Ä
     &#197; |   &#xc5;  |   %c5 =   Å
     &#198; |   &#xc6;  |   %c6 =   Æ
     &#199; |   &#xc7;  |   %c7 =   Ç
     &#200; |   &#xc8;  |   %c8 =   È
     &#201; |   &#xc9;  |   %c9 =   É
     &#202; |   &#xca;  |   %ca =   Ê
     &#203; |   &#xcb;  |   %cb =   Ë
     &#204; |   &#xcc;  |   %cc =   Ì
     &#205; |   &#xcd;  |   %cd =   Í
     &#206; |   &#xce;  |   %ce =   Î
     &#207; |   &#xcf;  |   %cf =   Ï
     &#210; |   &#xd2;  |   %d2 =   Ò
     &#211; |   &#xd3;  |   %d3 =   Ó
     &#212; |   &#xd4;  |   %d4 =   Ô
     &#213; |   &#xd5;  |   %d5 =   Õ
     &#217; |   &#xd9;  |   %d9 =   Ù
     &#218; |   &#xda;  |   %da =   Ú
     &#219; |   &#xdb;  |   %db =   Û
     &#224; |   &#xe0;  |   %e0 =   à
     &#225; |   &#xe1;  |   %e1 =   á
     &#226; |   &#xe2;  |   %e2 =   â
     &#227; |   &#xe3;  |   %e3 =   ã
     &#231; |   &#xe7;  |   %e7 =   ç
     &#232; |   &#xe8;  |   %e8 =   è
     &#233; |   &#xe9;  |   %e9 =   é
     &#234; |   &#xea;  |   %ea =   ê
     &#236; |   &#xec;  |   %ec =   ì
     &#237; |   &#xed;  |   %ed =   í
     &#242; |   &#xf2;  |   %f2 =   ò
     &#243; |   &#xf3;  |   %f3 =   ó
     &#244; |   &#xf4;  |   %f4 =   ô
     &#245; |   &#xf5;  |   %f5 =   õ
     &#247; |   &#xf7;  |   %f7 =   ÷
     &#249; |   &#xf9;  |   %f9 =   ù
     &#250; |   &#xfa;  |   %fa =   ú
     */

     GO     
Share:
85,355
Codeman
Author by

Codeman

I'm a software engineer in Seattle

Updated on August 31, 2020

Comments

  • Codeman
    Codeman over 3 years

    I need to filter out (remove) extended ASCII characters from a SELECT statement in T-SQL.

    I'm using a stored procedure to do so.

    Expected input:

    ËËËËeeeeËËËË
    

    Expected output:

    eeee
    

    All that I've found is for MySQL.

    I'm using :

    Microsoft SQL Server Management Studio  11.0.2100.60
    Microsoft .NET Framework    4.0.30319.17929
    
  • dan04
    dan04 almost 11 years
    ASCII only goes to 127.
  • sotn
    sotn about 7 years
    @dan04 is right. You need to change that 255 to 127 imo
  • Mike
    Mike almost 7 years
    @dan04 ASCII does go up to 255. Below 32 or above 127 are the extended ASCII characters. Since the OP was asking for removing them it should be changed to be between 32 and 127, but only because that is what is asked not because they are not ASCII. asciitable.com
  • Mike
    Mike almost 7 years
    Your answer doesn't seem to work for my problem over here: stackoverflow.com/questions/44804879/…
  • Dan Walters
    Dan Walters almost 5 years
    I couldn't get this to work against chars like '�´AƒTÏ)Œ£3`ŒœŠ’9¤'
  • Dan Walters
    Dan Walters almost 5 years
    this is great. any way to keep spaces?