Excel save behaviour of CSV file with UTF8 encoding vs UTF8-Bom encoding

29,691

What happens is the following.

  1. Excel looks at your systems Regional settings to find what list-separator is configured, in your case, a comma. Here it's important to point out that CSV is not a single, well-defined format. Any separator could be used...(Edit: Well, there is actually a guideline that pins the separator down to being a comma, but this is not a binding standard, and the guideline is not strictly followed in practice.)
  2. Then, excel transforms the data, in-memory, into an XLSX file. You do your excel-ly things. Keep in mind that this is effectively a file-type conversion, and Excel does not remember what encoding your source file was in, or what separator was being used.
  3. You make canges, and click save. At that moment, Excel does no longer have a CSV file, but a XLSX file, and needs to convert it back to something "CSV-like". Because it knows the encoding needs to be UTF-8 with BOM, it decides to Save As "Unicode text" (it does this without explicitly mentioning this). In this format, they use UCS-2 LE with BOM as encoding, and tab as the separator. This is not the CSV standard you had in mind, or the standard your original file was formatted as. It's weird and counterintuitive, but strictly speaking not wrong.

So, what can you do?
In theory you should be able to use the Save as CSV option, and through the Tools->"Web options" menu (in the save as dialog) set the encoding to UTF-8. The only problem here is that this doesn't seem to work properly. My guess is that it is a broken feature. More on this on Stack Overflow: Excel to CSV with UTF-8 encoding

So, what can you really do?
Just use UTF-8 BOM, use save, and open the resulting CSV file to replace all tabs with commas.

And what should you really do?
Don't use Excel to load and edit CSV files. Or at least be a lot more careful about it...
Why not?
I've written a (way too long) article about it over here.

Share:
29,691

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    My original csv files are encoded with UTF8-BOM, this is so Excel can recognise the Unicode charcters (If its just encoded with UTF8 then Excel does not recognise the unicode characters).

    My issue is that with a UTF8 encoded CSV file, when you click save in Excel it gives you a popup option to keep the original formatting. However if the file is UTF8-BOM encoded then clicking the save is instant and there is no option to keep the formatting. The CSV file then has it commas replaced with tabs.

  • underscore_d
    underscore_d over 6 years
    Do you mean that this implicit conversion to Excel format only occurs if the CSV that was originally opened contained a BOM? The OP indicates that it's not an issue if the file lacks a BOM. Anyway - great article, thanks; I'll be forwarding that to some colleagues!
  • Wouter
    Wouter over 6 years
    Good question. The conversion always happens. There is no way Excel can suddenly display a flat-text file in a table-format without interpreting the text first. It is guessing what the separator is, and guessing what the encoding is. When there's no BOM, as OP indicates, it gets the encoding part wrong. tbc...
  • Wouter
    Wouter over 6 years
    When Excel saves a non-UTF8 file (at this point it doesn't know your source was UTF-8) it saves this as "CSV" instead of "Unicode Text". These are the save methods you can also choose from when using "Save as". So at this point, Excel doesn't let you choose, but defaults to "CSV". This is a format that will use your system defined separator, while "Unicode Text" does not, and always uses tabs. Yes... it's a total mess :) So, again, you realy realy should not be letting Excel make best-effort guesses as to the structure of your data... :) Glad you liked the article!