Space Cat, Prince Among Thieves

CSV: An Encoding Nightmare

UPDATE 2019-01-16: In the three years since this article was written, parts of the article, in particular talking about UTF-8 are thankfully no longer accurate.

It would appear in a recent update Microsoft has added support for safely reading and writing UTF-8 CSVs to Excel. There is a new format in the save dialog CSV UTF-8 (Comma delimited) which is distinct from Comma Separated Values which is also still in there.

In my testing it appears to safely both load and save. Best of all it appears if there is a BOM, it leaves it, if there isn't it doesn't add one. Very nice handling indeed.

UPDATE 2022-07-22:

It would appear that the stricken above was incorrect. Current versions of Excel will only reliably load a UTF-8 CSV if a UTF-8 BOM is in place.


The company I work for manages a self-service data import system, handling information from school districts “Student Information Systems”. There are hundreds if not thousands of SIS’s out there, but what the vast majority have in common is the ability to export CSV. It is the lingua franca, everyone can get us CSVs.

CSV to the casual observer seems a simple portable format, but its looks are deceiving. If the data is pure ASCII (bytes 0-127) you’ll be fine. However, if there is a need for any sort of non-ASCII character, there is some work ahead.

In our case, handling school data from around the world, correctly handling non-ASCII characters is of the utmost importance.

Excel: A World of Hurt

The biggest problem is not CSV itself, but that the primary tool used to interact with it is Excel. Excel handles CSV encodings badly.

Creating a new document in Excel and saving as ”Comma Separated Values (.csv)” it uses your locale’s Windows or Mac codepage. Win-1252 and MacRoman respectively in the United States. If your codepage doesn’t support a character in your document, it will be silently replaced with an underscore _ character.

Because it uses codepages and not a Unicode encoding, it makes processing a painful chore. There is no way to tell the difference between different 8-bit codepages programatically.

One can use heuristics to sort them into an order of likelihood, but there is no way to ever know for sure. We chose to present the user with previews of the most likely codepages, and let them pick the correct one.

Excel for Mac: Broken Beyond Belief

The Macintosh version of Microsoft Excel is particularly harrowing.

As I mentioned above, Excel saves your CSV in your locales codepage. One should note that the Mac codepages fell out of use with OS 9, almost 15 years ago. Microsoft did not get that memo.

While that by definition makes the CSVs the Mac version of Excel exports unusable on Windows, the problem is more unfortunate than that. The Mac version can only read the locales Windows codepage.

That means the Mac version of Excel cannot read CSVs it wrote. That's pathetic.

If you had any extended characters when you saved, they are scrambled when you reopen it. This problem has persisted in every version of Mac Excel up to the current Excel 2016.

One simply cannot safely open a CSV created with the Mac version of Excel, on any platform, anywhere.

UTF-8? What’s that?

UTF-8 is the encoding of the 21st Century. It was quick in overtaking every other encoding. Almost every modern application supports it. This will be easy and we’ll be fine? Not so fast.

If one attempts to open a CSV file encoded as UTF-8 without a Byte Order Mark (BOM) as recommended, any non-ASCII characters are again scrambled.

This is the CSV format Apple’s Numbers exports by default, UTF-8 sans BOM.

If we try it again with a UTF-8 BOM prepended to the file and Excel will read it. This is deceptive because once saved the text will remain correctly encoded UTF-8, but bizarrely the BOM will be stripped causing the file to no longer be correctly readable.

Many naïve application output UTF-8 + BOM CSVs, and they read correctly but do not write correctly. I’ve seen this cause all sorts of headaches because it appears to work but does not.

The Right Way Functional Workaround

As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.

Using Excel to output into this format to begin with, you have to use the Save As dialog and choose “UTF-16 Unicode Text (.txt)”. Frustratingly, it includes a .txt extension by default which can be confusing for non-technical users. However, if you provide clear instructions to Windows users to save their filename in quotes with a CSV extension such as "example.csv" it is workable.

If you operate an application with a CSV exporter please use this as your default export format. It will save us all many headaches in the long run.


Comment by: Sam on

Sam's Gravatar Simple instructions:
Use google docs or libre office. Easy and FREE
Also: when your system already exports a csv file and you frequently need to edit them, your doing it wrong.

Comment by: Jesse G. Donat on

Jesse G. Donat's Gravatar You forget how much mistrust there is for free software, particularly when dealing with confidential information. On top of that, most of the people we're dealing with aren't at liberty to install things on their computers.

Comment by: evonsdesigns on

evonsdesigns's Gravatar I was able to import correctly to excel by adding `\uFEFF` to the beginning of the file I was sending to the client from the server.

Comment by: Jesse G Donat on

Jesse G Donat's Gravatar `\uFEFF` is a UTF-16BE (Big Endian) Byte Order Mark (BOM) used to indicate encoding.

Was your CSV content similarly UTF-16BE? I just tested a sample UTF-16BE CSV on Windows and Mac. Both come out scrambled, Mac being far more so.

I suspect that the content you prepended the BOM to are either Latin-1 or UTF-8 encoded text, in which case it might open correctly because you've forced Excel into an error mode, but it will not save/reopen correctly.

Comment by: David Giorgi on

David Giorgi's Gravatar If you have an old version of Excel on Mac (eg Microsoft Office 2008), use LibreOffice instead

1] Open XLSX file in libre office
2] Export as CSV set filters to UTF8 encoding
3] Open again in Libre office and save as XLSX

Comment by: J Peraita on

J Peraita's Gravatar If, instead of double clicking to open the file, you use Excel's "Data", "From text" option, to import the file, you can select the encoding. Usually you'll get the results you want... It handles nicely UTF-8 with no BOM (in Windows, at least)

Comment by: leejh3224 on

leejh3224's Gravatar Amazing! You saved my time! Thanks a lot for a great tip!!!!!

Comment by: Cristian on

Cristian's Gravatar You could at least use the extension .tsv (tab-separated values) instead of the misleading .csv.

Comment by: anninosaka on

anninosaka's Gravatar Use Table Tools for mac (free in the App store) for csv files. You can easily select the encoding.

Comment by: Murphy on

Murphy's Gravatar Thanks for this, it saved me a lot of time and effort!

Comment by: Malc on

Malc's Gravatar The database software I am using has the options to export to CSV with "default", Unicode, UTF-8, UTF-8 - BOM, ISO-8859-1, ISO-8859-9, ISO-8859-15, ISO-8859-16 and ASCII encodings, so unless the "default" option or the "unicode" option is UTF-16LE, it looks like I will have to continue avoiding the use of accented characters.

Comment by: Ev. Drikos on

Ev. Drikos's Gravatar Nice tip, helped me exchange tab delimited files containing greek text between Windows & Mac versions of Excel and at the same time such "*.csv" files are imported into Excel for Windows silently with a double click.

Regarding Excel for Mac, a VB script can likely save tab delimited unicode files with the extension ".csv" and also bypass any warning messages on save and on exit. As an amateur in VB, I added such a script along with a custom button to a template named "Personal.xlsb".


Thanks,
Ev. Drikos

Comment by: heriberto perez on

heriberto perez's Gravatar It didn't work for me this approach at the end, as I need to support latin characters such as ñ, é, etc, I ended up using the standar utf-8 and injecting the Boom character"\uFEFF" and now I see the right characters the only disadvantage is that it creates an empty line at the beginning of the file :(

Comment by: Jesse G Donat on

Jesse G Donat's Gravatar UTF-8 with a BOM isn't a real solution. It just smells like one from a distance. It will give you trouble if you Open the CSV with Excel and then save. The BOM gets stripped, and if you open it with Excel a second time, your characters are now messed up.

UTF-16 supports every character UTF-8 does. It just does it in a totally incompatible way - whereas every character takes two bytes.

Making your code read UTF-16 is however less easy, particularly if you're not familiar with working with multiple encodings.

Presuming you're not operating natively in UTF-16, at some point in the process you need to convert UTF-16 to something you do operate in like UTF-8. You could even do something as simple as just converting the uploaded file at the time of upload with a tool like iconv.

Comment by: Frank on

Frank's Gravatar Hey there. You just helped me find the holy grail for generating CSV in Swift:

I ended up prefixing my CSV-string with "\u{FEFF}". And export it with UTF16LIttleEndian.

Now it works with both Excel Mac and Windows and Apple Numbers.

Comment by: Bas on

Bas's Gravatar This seems to work for me on the command line:

$ file export_from_excel.csv
export_from_excel.csv: Non-ISO extended-ASCII text, with CR line terminators

$ cat export_from_excel.csv | dos2unix -c mac | iconv -f WINDOWS-1252 -t utf8 > export_in_utf8_with_standard_newline.csv

$ file export_in_utf8_with_standard_newline.csv
export_in_utf8_with_standard_newline.csv: UTF-8 Unicode text

Comment by: Annette Wilson on

Annette Wilson's Gravatar Is this still up to date? I don't observe the round-trip problem described here in Excel for Mac 16.19. When I open a file in CSV UTF-8-with-BOM, edit it, save it, close and reopen, the BOM has been preserved and the non-ASCII content remains intact.

Comment by: Jesse G Donat on

Jesse G Donat's Gravatar You are correct. It appears they have added in a recent update a distinct "CSV UTF-8 (Comma delimited)" from their usual "Comma Separated Values" format which handles UTF-8 loading and saving safely.

This is wonderful news and I would like to thank you for bringing this to my attention. I have amended the post with a note at the top noting the fix.

Comment by: Abul Hasan Lakhani on

Abul Hasan Lakhani's Gravatar I just tested in Excel for Office 365 with CSV formatted in UTF-8 without BOM and after you save the file, it adds the BOM if its not there. Can someone verify please?

Comment by: Maarten D on

Maarten D's Gravatar the new "CSV UTF-8" option in Excel is indeed a good step forward.
But it is still missing an important option, saving all fields in double quotes. many cloud-based databases require it.
Sigh, why can't Excel just to it like OpenOffice Calc and similar programs?

Comment by: Michael Johansen on

Michael Johansen's Gravatar Just want to say thanks. This post solved a proper headache for me! Excellent writeup.

Comment by: 123 on

123's Gravatar Life is not to surpass others, but to surpass oneself

Email address will never be publicly visible.

Basic HTML allowed.