How do you dynamically identify unknown delimiters in a data file?

30,837

Solution 1

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

Solution 2

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','

Solution 3

We can determine the delimiter right most of the time based on some prior information (such as list of common delimiter) and frequency counting that all the lines give the same number of delimiter

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

Often n=2 lines should be enough, check more lines for a more robust answers. Of course there are cases (often artificial ones) those lead to a false detection but it is unlikely happened in practice.

Here I use an efficient python implementation of head function that only read n-first line of a file. See my answer on How to read first N-lines of a file

Solution 4

I ended up going with the regex, because of the problem of spaces. Here's my finished code, in case anyone's interested, or could use anything else in it. On a tangential note, it would be neat to find a way to dynamically identify column order, but I realize that's a bit more tricky. In the meantime, I'm falling back on old tricks to sort that out.

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
            #couldn't quite figure out a way to make this a single block 
            #(rather than three separate if/elifs. But you can see the split is
            #generalized already, so if anyone can come up with a better way,
            #I'm all ears!! :)
            for row in open(infile,'r').readlines():
                if infile.find('comma') > -1: 
                    datefmt = "%m/%d/%Y"
                    last, first, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                elif infile.find('space') > -1: 
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, dobraw, color = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
elif infile.find('pipe') > -1: datefmt = "%m-%d-%Y" last, first, unused, gender, color, dobraw = \ [x.strip() for x in re.split(r'[ ,|;"\t]+', row)] #There is also a way to do this with csv.Sniffer, but the #spaces around the pipe delimiter also confuse sniffer, so #I couldn't use it. else: raise ValueError(infile + "is not an acceptable input file.")

Share:
30,837

Related videos on Youtube

Greg Gauthier
Author by

Greg Gauthier

I have 20 years experience in a wide variety of technical environments, both enterprise and startup. I enjoy tinkering, and spend most of my time on Python, and Java. Though, I have dabbled in other languages over the years. My favorite operating system is Debian! :)

Updated on July 09, 2022

Comments

  • Greg Gauthier
    Greg Gauthier almost 2 years

    I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:

    apples | bananas | oranges | grapes

    data file two looks like this:

    quarter, dime, nickel, penny

    data file three looks like this:

    horse cow pig chicken goat

    (the change in the number of columns is also intentional)

    The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:

    def count_chars(s):
        valid_seps=[' ','|',',',';','\t']
        cnt = {}
        for c in s:
            if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
        return cnt
    
    infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
    records = open(infile,'r').read()
    print count_chars(records)

    It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.

    But I can't think of a better way to do this.

    Any suggestions?

  • Greg Gauthier
    Greg Gauthier over 13 years
    That doesn't really solve the problem. <br/> What I end up with, in that case, is every single character in the file split into its own list, like: "['a'] ['p'] ['p'] ['l'] ['e'] ['s'] [' '] ['|'](...and so forth...). What I'd like, instead, is each line broken into a list like, "['apples', 'bananas', 'oranges', 'grapes']"
  • JoshD
    JoshD over 13 years
    I assume you're trying to identify the separator so you can separate the data. Why do you want to identify the separator?
  • JoshD
    JoshD over 13 years
    @Greg Gauthier: I'm terribly sorry. I meant to say re.split. I've changed the answer to reflect the proper method.
  • Greg Gauthier
    Greg Gauthier over 13 years
    <pre><code>infile = 'Data/pipe.txt' records = open(infile,'r').read() for line in records: print line.split('|,; \t')</pre></code>
  • JoshD
    JoshD over 13 years
    @Greg: I still see what you have. I've updated my answer. I had used the wrong split, and have corrected it with an example. I hope this clears things up.
  • John Machin
    John Machin over 13 years
    -1 This has no chance of working. The first arg of str.split is a string representing a single delimiter; it is NOT a string of multiple 1-character delimiters. Unless the input string actually includes '|, \t;', it will be returned unchanged. How the OP got "every single character ... split into its own list", I can't imagine.
  • Greg Gauthier
    Greg Gauthier over 13 years
    @JoshD - Dude, THANK YOU! That works! There's a bit of confusion with the spaces around the other seps (it puts null elements into the lists). BUT, this gets me farther than I got before! :)
  • JoshD
    JoshD over 13 years
    @John Machin: Yes. I've since corrected it to the split I intended. It seems to have worked for him. I do appreciate you pointing this out, though.
  • JoshD
    JoshD over 13 years
    @Greg Gauthier, You might try adding a + (see answer) in the regular expression. Then it will match consecutive delimiters and remove most the empty list items.
  • Greg Gauthier
    Greg Gauthier over 13 years
    ooh. That one is interesting! Is it available in version 2.6?
  • Greg Gauthier
    Greg Gauthier over 13 years
    Thanks again, Josh. My original attempt looked like this: new_line = [i for i in re.split(r'[ ,|;"\t]', line) if i != ''] --- but extending the regex is really a much cleaner approach.
  • Tim Pietzcker
    Tim Pietzcker over 13 years
    +1: Definitely use the csv module for this. Parsing delimited files, especially if they might contain escaped delimiters, delimiters within quoted strings, newlines within quoted strings etc. is no job for a regex. A regex solution will fail sooner or later, and the bugs will be subtle and mind-numbing to find.
  • Matthew Schinckel
    Matthew Schinckel over 13 years
    Wow, I didn't know this existed.
  • John Ledbetter
    John Ledbetter over 13 years
    This is a great answer -- but it won't work for the OPs first example. An input of apples | bananas | oranges | grapes claims that the delimiter is ' '. If you remove the spaces from around the pipes, it will work as expected.
  • agmezr
    agmezr about 7 years
    This is an interesting function on the csv module, but be careful, if you have ; as a separator (another common separator for an csv) and there is a comma on any other value, the Sniffer will return , as the separator. Example sniffer.sniff('quarter,cent;dime;nickel;penny').delimiter will return ,
  • Michael Gendin
    Michael Gendin over 6 years
    But if you have an idea about what your delimiter might look like you can set the ones that have precedence: sniffer.preferred = [';', '|']
  • Vaidøtas I.
    Vaidøtas I. over 4 years
    This does not really answer the question of identifying the separator/delimiter that works. I tried the code and got a delimited list but not the delimiter itself. How do I get the delimiter?
  • BeastOfCaerbannog
    BeastOfCaerbannog about 2 years
    Excellent! Works more reliably than other methods I tried.
  • Wojciech Jakubas
    Wojciech Jakubas almost 2 years
    I was going to create something with similar principles so thanks for the sample code ... but you have to be careful with enclosed string that contains line break CR/LF. It Is valid csv but must be enclosed.