Accented characters not correctly imported with BULK INSERT

19,235

Solution 1

It was answered in the comment. Did you try it?

http://msdn.microsoft.com/en-us/library/ms189941.aspx

Option DATAFILETYPE ='widenative'

Based on comment from Esailiga did the text get truncated before or after the bulk import. I agree it sounds like the CSV file itself is single byte. Unicode requires option DATAFILETYPE ='widenative'. If the CSV file is single byte the is not magic translation back.

What is too bad is é is extended ASCII and supported with SQL char so more evidence the problem is at the CSV.
SELECT CAST('é' AS char(1))
notice this works as extended ASCII (<255)

Sounds like you need to go back to the source.

The ? in SQL is unknown. Same as � in notepad.

Solution 2

In my case I can fix the encoding problem with the CODEPAGE option:

BULK
INSERT #CSV
FROM 'D:\XY\xy.csv'
WITH
(
   CODEPAGE = 'ACP',
   DATAFILETYPE ='char',
   FIELDTERMINATOR = ',',
   ROWTERMINATOR = '\n',
   FIRSTROW = 2
)

Possible values: CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]

You can find more information about the option here: BULK INSERT

Solution 3

I still cannot believe that after all these years Microsoft has not fixed this obvious bug. There should be no problem with èéêë etc because they are all ascii(<255). This quest is posed over and over again on many sites and the question has yet to be answered

My data is in a table in excel. having generated the insert into statements the table is parsed a 2nd time looking for asccii > 'z' and generating and update table set column statement to overwrite the imported data. Cumbersome but workable

Solution 4

I've done it! After all these years and we were all looking in the wrong place. No work needed no rewriting scripts...

The problem lies with SSMS... if you "New Query" by right-clicking on "Queries" you get to rename the file but not create it that is done for you...

But... if you "Ctrl+N" you get a new query window to edit but no file is created... So you save it yourself and choose encoding on the save button... towards the bottom of the list you'll find UTF-8(without signature) codepage 65001

And that is it...

script after script open a new query window with "ctrl+N" copy and paste from an existing query and save as directed above. And as if by magic it works

If like me you have tables in Excel... parse the table writing the output to the 1st column of a new workbook with 1 sheet in it and then saveas and choose utf-8 encoding

Speed things up with a template file containing a comment "-- utf-8" something like that. save it as utf-8 and use a file listing of *.sql pasted into excel to concatenate a list of =concatenate("ren templatefile.txt ", char(34), a1, char(34)) in b1 and drop it down

After all these years of manual solutions I am literally sweating with excitement at the discovery. Thank you for getting me so upset

Share:
19,235
tough
Author by

tough

Updated on June 04, 2022

Comments

  • tough
    tough almost 2 years

    I am importing a source CSV file, I don't know the source encoding and I can only see either � (ANSI encoding) or (UTF8-without-BOM encoding) when I open a the file with Notepad++ (related question).

    This file has been imported to the database mssql-2008 using bulk insert:

    DECLARE @bulkinsert NVARCHAR(2000)              
      SET @bulkinsert =               
     N'BULK INSERT #TempData FROM ''' +               
    @FilePath +               
    N''' WITH (FIRSTROW = 2,FIELDTERMINATOR = ''","'',ROWTERMINATOR =''\n'')'              
         EXEC sp_executesql @bulkinsert 
    

    This is then copied to the regular table1 from #tempData in a column1 (varchar()). Now when I look into this table1 I see some ? in place of those characters.

    I have tried to cast to nvarchar() but it does not help.

    when I digged into what those characters really are with support of the link we download at same time, I saw that the characters were é,ä,å and so on.

    I would use replace to fix the data but I need to make some ugly codes and look into individual pattern of words and replace, so seems difficult.

    database/table collation: SQL_Latin1_General_CP1_CI_AS column1(Varchar(80))

    Can I change these characters to English-like characters or the original characters instead of ? marks.

    I have looked at Collation and Unicode Support which did not help me. I understood what it means about encoding but did not supply me with what to do. I have looked into most of the posts here in stackoverflow yes there are some posts about it but did not match my search.

    I am unable to figure out where the problem lies.

  • Esailija
    Esailija over 11 years
    It doesn't matter, even if he does everything correctly, he will end up with because that's what the data is. You cannot fix a literal in any way. This is not the case where is just a superficial indication of error, it is the case where is the literal data.
  • paparazzo
    paparazzo over 11 years
    @Esailija OK I think I agree. Still contend OP should have tested with DATAFILETYPE ='widenative' as would need that once the data is fixed. Hopefully you will get the check on the other question.
  • tough
    tough over 11 years
    @Blam I will test and see if it works with 'widenative' as datafiletype type hope as well as change the #tempData defination to see if I can really get it right, else I will have to totally agree on Esailija's points. kudos for all the help. I wish could give one more.
  • paparazzo
    paparazzo over 11 years
    I agree with Esailija. You need to fix both. Get unicode and load unicode. I hope you accept his other answer.
  • monty
    monty over 5 years
    Yep, bulk insert uses, by default, OEM codepage (likely codepage #437).