Can I import a CSV file and automatically infer the delimiter?

65,209

Solution 1

To solve the problem, I have created a function which reads the first line of a file (header) and detects the delimiter.

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"

Solution 2

The csv module seems to recommend using the csv sniffer for this problem.

They give the following example, which I've adapted for your case.

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

Let's try it out.

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

And our sample inputs

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

And if we execute the example program:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

It's also probably worth noting what version of python I'm using.

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2

Solution 3

Given a project that deals with both , (comma) and | (vertical bar) delimited CSV files, which are well formed, I tried the following (as given at https://docs.python.org/2/library/csv.html#csv.Sniffer):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

However, on a |-delimited file, the "Could not determine delimiter" exception was returned. It seemed reasonable to speculate that the sniff heuristic might work best if each line has the same number of delimiters (not counting whatever might be enclosed in quotes). So, instead of reading the first 1024 bytes of the file, I tried reading the first two lines in their entirety:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

So far, this is working well for me.

Solution 4

And if you're using DictReader you can do that:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

I used this with Python 3.5 and it worked this way.

Solution 5

I don't think there can be a perfectly general solution to this (one of the reasons I might use , as a delimiter is that some of my data fields need to be able to include ;...). A simple heuristic for deciding might be to simply read the first line (or more), count how many , and ; characters it contains (possibly ignoring those inside quotes, if whatever creates your .csv files quotes entries properly and consistently), and guess that the more frequent of the two is the right delimiter.

Share:
65,209
rom
Author by

rom

Updated on July 09, 2022

Comments

  • rom
    rom almost 2 years

    I want to import two kinds of CSV files, some use ";" for delimiter and others use ",". So far I have been switching between the next two lines:

    reader=csv.reader(f,delimiter=';')
    

    or

    reader=csv.reader(f,delimiter=',')
    

    Is it possible not to specify the delimiter and to let the program check for the right delimiter?

    The solutions below (Blender and sharth) seem to work well for comma-separated files (generated with Libroffice) but not for semicolon-separated files (generated with MS Office). Here are the first lines of one semicolon-separated file:

    ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
    1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
    1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
    
  • rom
    rom about 11 years
    That works for the comma-separated file but the semicolon-separated file cannot be read properly (Could not determine delimiter). See my edit above...
  • Bill Lynch
    Bill Lynch about 11 years
    I've included an example with a comma separated, semicolon separated, and the example file that you suggest in the question.
  • Sohaib
    Sohaib over 9 years
    @sharth What is the file has to be delimited on either a space or a tab? Can this be done?
  • Bill Lynch
    Bill Lynch over 9 years
    @Sohaib: If you mean a single file that has both space and tab delimiters, then I don't know if csv supports that. But that is getting pretty far from the definition of a csv file. I would recommend either doing research, or asking a separate question here.
  • ghosh'.
    ghosh'. almost 9 years
    This is not working for me. I have a CSV file which also has right and left square brackets ( [ , ] ) and sniffer could not detect this as delimiters. I have tried dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=',[]')
  • Bill Lynch
    Bill Lynch almost 9 years
    @SoumyabrataGhosh: I encourage you to post a separate question with your issue.
  • mauve
    mauve about 8 years
    This was so helpful to me! I was having trouble with data in which one of the "pegged" values was numbers with commas in them, so it kept failing. This limiting it to the first 2 lines really helped.
  • tashuhka
    tashuhka over 7 years
    Your function will not work if the delimiter is part of a value, even if it is scaped or quoted. For example, a line like "Hi Peter;", "How are you?", "Bye John!" will return ; as delimiter, which is wrong.
  • 3isenHeim
    3isenHeim over 7 years
    Great, worked for me with my |-separated "csv" files. Thanks :)
  • alvaro562003
    alvaro562003 about 7 years
    I used it in python 2.7