Converting CSV to TSV

29,305

Solution 1

Python

Add to file named csv2tab, and make it executable

touch csv2tab && chmod u+x csv2tab

Add to it

#!/usr/bin/env python
import csv, sys
csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))

Test runs

$ echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' | ./csv2tab                     
A       C   D,E,F   G   I   K,L,M   Z

$ ./csv2tab < data.csv > data.tsv && head data.tsv                                                   
1A      C   D,E,F   G   I   K,L,M   Z
2A      C   D,E,F   G   I   K,L,M   Z
3A      C   D,E,F   G   I   K,L,M   Z

Solution 2

Using csvkit (Python), for example:

$ csvformat -T in.csv > out.txt

Does streaming, with correct CSV and TSV quoting and escaping

It's in apt and other package managers

Solution 3

For fun, sed.

sed -E 's/("([^"]*)")?,/\2\t/g' file

If your sed doesn't support -E, try with -r. If your sed doesn't support \t for a literal tab, try putting a literal tab (in many shells, ctrl-v tab) or in Bash, use a $'...' C-style string (in which case the backslash in \2 needs to be doubled). If you want to keep the quotes, use \1 instead of \2 (in which case the inner pair of parentheses is useless, and can be removed).

If your sed doesn't support either -E or -r, try

sed 's/\("\([^"]*\)"\)\?,/\2\t/g' file

again possibly with the tweaks suggested above if \t is not supported.

For additional fun, here's the same thing with the Bash "here-string" syntax, just to demonstrate what it looks like. Notice how the literal backslashes we want sed to receive are now doubled:

sed $'s/\\("\\([^"]*\\)"\\)\\?,/\\2\t/g' file

This makes no attempt to handle escaped double quotes inside double quotes; some CSV dialects support this by doubling the quoted double quote (sic).

Solution 4

One option might be perl's Text::CSV module e.g.

perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
  print join "\t", $csv->fields() if $csv->parse($_)
' somefile

to demonstrate

echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' |
  perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
  print join "\t", $csv->fields() if $csv->parse($_)
'
A       C   D,E,F   G   I   K,L,M   Z

Solution 5

I authored an open-source CSV to TSV converter that handles the transformations described. It's quite fast, may be worth a look if there's an on-going need to convert large CSV files. Tool is part of eBay's TSV utilities toolkit (csv2tsv documentation here). Default options suffice for the input described:

$ csv2tsv file.csv > file.tsv

A consideration when converting CSV to TSV is handling of field and record delimiters (comma and newline) in the data. CSV uses an escape syntax. If the goal is to use the output with Unix tools like cut, awk, etc., the output needs to be free of escapes. Most solutions listed here produce CSV style escapes when delimiters are in the data. csv2tsv is differentiated from other solutions in that it produces TSV without escapes. See the documentation for details.

To see what a particular solution does, convert a CSV containing commas, tabs, quotes, and newlines in the data. For example:

$ echo $'Line,Field1,Field2\n1,"Comma: |,|","Quote: |""|"\n"2","TAB: |\t|","Newline: |\n|"' | <conversion-script-or-command>

Solutions generating escapes will put double quotes around the fields containing quotes, newlines, or tabs.

Share:
29,305

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    I have a number of large CSV files and would like them in TSV (tab separated format). The complication is that there are commas in the fields of the CSV file, eg:

     A,,C,"D,E,F","G",I,"K,L,M",Z
    

    Expected output:

     A      C   D,E,F   G   I   K,L,M   Z
    

    (where whitespace in between are 'hard' tabs)

    I have Perl, Python, and coreutils installed on this server.

  • George Vasiliou
    George Vasiliou about 7 years
    I think that i tried about 100 different sed scripts to achieve this one but all my attempts failed. This is awesome.
  • Pont
    Pont about 7 years
    I think the true thermonuclear flyswatter would be writing a Java utility to do it via LibreOffice's UNO API :).
  • agc
    agc about 6 years
    When the input data 'A,,C,"D,E,F","G",I,"K,L,M","Z,A"' is input to this answer, then the "Z,A" is incorrectly replaced with Z A, rather than the correct Z,A.
  • kev
    kev about 4 years
    i get the following error when I try your solution: No protocol specified import-im6.q16: unable to open X server :0' @ error/import.c/ImportImageCommand/358. csv2tab.sh: line 3: syntax error near unexpected token sys.stdout,' csv2tab.sh: line 3: csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))'
  • OneCricketeer
    OneCricketeer about 4 years
    @kev This code doesn't require an X Server, so seems you may have some other issue
  • Ricardo Mayerhofer
    Ricardo Mayerhofer almost 4 years
    Command line I used to fix the quoting: csvformat -U 3 -Q "" -T in.csv > out.tsv
  • JJoao
    JJoao almost 4 years
    obfuscated but very cool!
  • nisetama
    nisetama almost 4 years
    This version also works if a line ends with a double-quoted field: echo '1337,"test input"'|sed -r 's/""/\v/g;s/("([^"]+)")?,/\2\t/g;s/"([^"]+)"$/\1/;s/\v/"/g' (the dollar sign is not required, but it's used to clarify the intent of the command). Both your and my version fail when a field consists of 4 double quotes (i.e. a single double quote in escaped form): "next field is a single escaped double quote","""".
  • Mr. Lance E Sloan
    Mr. Lance E Sloan over 3 years
    I didn't see anything in the original question about disabling headers. When I decided to use mlr to convert a CSV file (with headers) to an equivalent TSV file (keeping headers, naturally), I just used: mlr --c2t cat input.csv > output.tsv The results were perfect.
  • nstatam
    nstatam over 2 years
    your command works with my sed version (4.4). Unfortunatly with another Linux the sed version is 4.2.2 and don't work with the same data, even after replace -E by -r. Do you have a clue how to fix that ? Thanks
  • tripleee
    tripleee over 2 years
    @nstatam If you get an error message for both -E and -r, you will need to refactor to a POSIX BRE regex (see updated answer now). If you don't, I'm guessing you transcribed the script incorrectly. Did you try with a literal tab instead of the symbolic sequence \t? Which Linux distro exactly is this?
  • tom
    tom over 2 years
    Great answer, but note that it doesn't strip the double quotes from the final field if there is no trailing comma (e.g. "a","b" becomes a<TAB>"b"). This can be fixed by appending a comma at the start, then removing the corresponding trailing tab afterward: sed -E 's/$/,/; s/("([^"]*)")?,/\2\t/g; s/\t$//'