SQL Server Code Pages and Collations

16,159

Solution 1

While MS SQL supports both code pages and Unicode unhelpfully it doesn't provide any functions to convert between the two so figuring out what character is represented by a value in a different code page is a pig.

There are two potential methods I've seen to handle conversions, one is detailed here http://www.codeguru.com/cpp/data/data-misc/values/article.php/c4571 and involves bolting a custom conversion program onto the database and using that for conversions.

The other is to construct a db table consisting of

[CodePage], [ANSI Value], [UnicodeValue]

with the unicode value stored as either the int representing the unicode character to be converted using nchar()or the nchar itself

Your using the collation SQL_Ukrainian_CP1251_CS_AS which is code page 1251 (CP1251 from the centre of the string). You can grab its translation table here http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT

Its a TSV so after trimming the top off the raw data should import fairly cleanly.

Personally I'd lean more towards the latter than the former especially for a production server as the former may introduce instability.

Solution 2

Actually I have found an answer to my question now. A bit clunky but does the job unless there's a better way out there?

SET NOCOUNT ON;

CREATE TABLE #Collations
(
     code TINYINT PRIMARY KEY
);

WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),   --2
        E02(N) AS (SELECT 1 FROM E00 a, E00 b), --4
        E04(N) AS (SELECT 1 FROM E02 a, E02 b), --16
        E08(N) AS (SELECT 1 FROM E04 a, E04 b) --256
INSERT INTO #Collations
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1
FROM E08    

DECLARE @AlterScript NVARCHAR(MAX) = ''

SELECT @AlterScript = @AlterScript + ' 
RAISERROR(''Processing' + name + ''',0,1) WITH NOWAIT;
ALTER TABLE #Collations ADD ' + name + ' CHAR(1) COLLATE ' + name + ';
EXEC(''UPDATE #Collations SET ' + name + '=CAST(code AS BINARY(1))'');
EXEC(''UPDATE #Collations SET ' + name + '=NULL WHERE ASCII(' + name + ') <> code'');
'
FROM   sys.fn_helpcollations()
WHERE  name LIKE '%CS_AS'
       AND name NOT IN    /*Unicode Only Collations*/
                        ( 'Assamese_100_CS_AS', 'Bengali_100_CS_AS',
                         'Divehi_90_CS_AS', 'Divehi_100_CS_AS' ,
                         'Indic_General_90_CS_AS', 'Indic_General_100_CS_AS',
                             'Khmer_100_CS_AS', 'Lao_100_CS_AS',
                         'Maltese_100_CS_AS', 'Maori_100_CS_AS',
                         'Nepali_100_CS_AS', 'Pashto_100_CS_AS',
                         'Syriac_90_CS_AS', 'Syriac_100_CS_AS',
                         'Tibetan_100_CS_AS' )


EXEC (@AlterScript)

SELECT * FROM #Collations

DROP TABLE #Collations
Share:
16,159
Martin Smith
Author by

Martin Smith

Updated on June 09, 2022

Comments

  • Martin Smith
    Martin Smith almost 2 years

    Is there any way in SQL Server of determining what a character in a code page would represent without actually creating a test database of that collation?

    Example. If I create a test database with collation SQL_Ukrainian_CP1251_CS_AS and then do CHAR(255) it returns я.

    If I try the following on a database with SQL_Latin1_General_CP1_CS_AS collation however

    SELECT CHAR(255) COLLATE SQL_Ukrainian_CP1251_CS_AS
    

    It returns y

    SELECT CHAR(255)
    

    Returns ÿ so it is obviously going first via the database's default collation then trying to find the closest equivalent to that in the explicit collation. Can this be avoided?