Regular expression to match comma separated list of key=value where value can contain commas

12,453

Solution 1

Just for comparison purposes, here's a regex that seems to solve the problem as well:

([^=]+)    # key
=          # equals is how we tokenise the original string
([^=]+)    # value
(?:,|$)    # value terminator, either comma or end of string

The trick here it to restrict what you're capturing in your second group. .+ swallows the = sign, which is the character we can use to distinguish keys from values. The full regex doesn't rely on any back-tracking (so it should be compatible with something like re2, if that's desirable) and can work on abarnert's examples.

Usage as follows:

re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam')

Which returns:

[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

Solution 2

daramarak's answer either very nearly works, or works as-is; it's hard to tell from the way the sample output is formatted and the vague descriptions of the steps. But if it's the very-nearly-works version, it's easy to fix.

Putting it into code:

>>> bits=[x.rsplit(',', 1) for x in s.split('=')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]

The first line is (I believe) daramarak's answer. By itself, the first line gives you pairs of (value_i, key_i+1) instead of (key_i, value_i). The second line is the most obvious fix for that. With more intermediate steps, and a bit of output, to see how it works:

>>> s = 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam'
>>> bits0 = s.split('=')
>>> bits0
['foo', 'bar,breakfast', 'spam,eggs,blt', 'bacon,lettuce,tomato,spam', 'spam']
>>> bits = [x.rsplit(',', 1) for x in bits0]
>>> bits
[('foo'), ('bar', 'breakfast'), ('spam,eggs', 'blt'), ('bacon,lettuce,tomato', 'spam'), ('spam')]
>>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)]
>>> kv
[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

Solution 3

Could I suggest that you use the split operations as before. But split at the equals first, then splitting at the rightmost comma, to make a single list of left and right strings.

input =
"bob=whatever,king=kong,banana=herb,good,yellow,thorn=hurts"

will at first split become

first_split = input.split("=")
#first_split = ['bob' 'whatever,king' 'kong,banana' 'herb,good,yellow,thorn' 'hurts']

then splitting at rightmost comma gives you:

second_split = [single_word for sublist in first_split for item in sublist.rsplit(",",1)]
#second_split = ['bob' 'whatever' 'king' 'kong' 'banana' 'herb,good,yellow' 'thorn' 'hurts']

then you just gather the pairs like this:

pairs = dict(zip(second_split[::2],second_split[1::2]))
Share:
12,453
Kimvais
Author by

Kimvais

Updated on June 04, 2022

Comments

  • Kimvais
    Kimvais almost 2 years

    I have a naive "parser" that simply does something like:
    [x.split('=') for x in mystring.split(',')]

    However mystring can be something like
    'foo=bar,breakfast=spam,eggs'

    Obviously,
    The naive splitter will just not do it. I am limited to Python 2.6 standard library for this,
    So for example pyparsing can not be used.

    Expected output is
    [('foo', 'bar'), ('breakfast', 'spam,eggs')]

    I'm trying to do this with regex, but am facing the following problems:

    My First attempt
    r'([a-z_]+)=(.+),?'
    Gave me
    [('foo', 'bar,breakfast=spam,eggs')]

    Obviously,
    Making .+ non-greedy does not solve the problem.

    So,
    I'm guessing I have to somehow make the last comma (or $) mandatory.
    Doing just that does not really work,
    r'([a-z_]+)=(.+?)(?:,|$)'
    As with that the stuff behind the comma in an value containing one is omitted,
    e.g. [('foo', 'bar'), ('breakfast', 'spam')]

    I think I must use some sort of look-behind(?) operation.
    The Question(s)
    1. Which one do I use? or
    2. How do I do that/this?

    Edit:

    Based on daramarak's answer below,
    I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form;

    vals = [x.rsplit(',', 1) for x in (data.split('='))]
    ret = list()
    while vals:
        value = vals.pop()[0]
        key = vals[-1].pop()
        ret.append((key, value))
        if len(vals[-1]) == 0:
            break
    

    EDIT 2:

    Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that re.findall() would return a list of 2-tuples?

  • Kimvais
    Kimvais over 11 years
    I think that this could work, but how exactly would you just gather the pairs so that it works for all inputs?
  • abarnert
    abarnert over 11 years
    This would be much easier to read with separators in the example output. It's hard to tell from bob whatever king kong banana… that bob is a key, whatever is a value, etc. If I understand what you're doing, the last step will not work, because you're actually going to have (value_n, key_n+1) pairs, not (key_n, value_n) pairs. But it's close. See my answer.
  • abarnert
    abarnert over 11 years
    Whoever downvoted this, I don't think it deserved it. It's a bit vague, but the approach is pretty clearly right, and it's just missing the last step, which would be obvious once you printed out the results.
  • daramarak
    daramarak over 11 years
    I have been vague here in purpose, I do not think writing the code would be necessary here. It seemed clear to me that OP would be able to do the programming, I simply wanted wanted OP to look at the problem in another way.
  • daramarak
    daramarak over 11 years
    Clarified my answer to show my meaning. But you manage to flatten the structure and pair it in one go. That is +1 from me.
  • Kimvais
    Kimvais over 11 years
    Actually, for once I have to say that the regular expression method is much easier to read than the other way of doing things. Awesome! +1
  • abarnert
    abarnert over 11 years
    Ah, now I get what you were doing. The problem with the original version was that, without showing the commas or quotes or any other way to tell where the boundaries were, it was ambiguous. Anyway, flattening it out and then rezipping as you did is possibly clearer than the (x[i][-1], x[i+1][0]) bit I did, even if it does seem like an extra step.