How best to parse a simple grammar?

36,577

Solution 1

def parse(astr):
    astr=astr.replace(',','')
    astr=astr.replace('and','')    
    tokens=astr.split()
    dept=None
    number=None
    result=[]
    option=[]
    for tok in tokens:
        if tok=='or':
            result.append(option)
            option=[]
            continue
        if tok.isalpha():
            dept=tok
            number=None
        else:
            number=int(tok)
        if dept and number:
            option.append((dept,number))
    else:
        if option:
            result.append(option)
    return result

if __name__=='__main__':
    tests=[ ("CS 2110" , [[("CS", 2110)]]),
            ("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),
            ("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),
            ("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]

    for test,answer in tests:
        result=parse(test)
        if result==answer:
            print('GOOD: {0} => {1}'.format(test,answer))
        else:
            print('ERROR: {0} => {1} != {2}'.format(test,result,answer))
            break

yields

GOOD: CS 2110 => [[('CS', 2110)]]
GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]]
GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]]
GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]

Solution 2

For simple grammars I really like Parsing Expression Grammars (PEGs), which amount to a disciplined, structured way of writing a recursive-descent parser. In a dynamically typed language like Python you can do useful things without having a separate "parser generator". That means no nonsense with reduce-reduce conflicts or other arcana of LR parsing.

I did a little searching and pyPEG appears to be a nice library for Python.

Solution 3

I know that this question is about a decade old and has certainly been answered now. I am mainly posting this answer to prove myself that I have understood PEG parsers at last. I'm using the fantastic parsimonious module here.
That being said, you could come up with a parsing grammar, build an ast and visit this one to obtain the desired structure:

from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
from itertools import groupby

grammar = Grammar(
    r"""
    term            = course (operator course)*
    course          = coursename? ws coursenumber
    coursename      = ~"[A-Z]+"
    coursenumber    = ~"\d+"
    operator        = ws (and / or / comma) ws
    and             = "and"
    or              = (comma ws)? "or"
    comma           = ","
    ws              = ~"\s*"
    """
)

class CourseVisitor(NodeVisitor):
    def __init__(self):
        self.current = None
        self.courses = []
        self.listnum = 1

    def generic_visit(self, node, children):
        pass

    def visit_coursename(self, node, children):
        if node.text:
            self.current = node.text

    def visit_coursenumber(self, node, children):
        course = (self.current, int(node.text), self.listnum)
        self.courses.append(course)

    def visit_or(self, node, children):
        self.listnum += 1

courses = ["CS 2110", "CS 2110 and INFO 3300",
           "CS 2110, INFO 3300", "CS 2110, 3300, 3140",
           "CS 2110 or INFO 3300", "MATH 2210, 2230, 2310, or 2940"]

for course in courses:
    tree = grammar.parse(course)
    cv = CourseVisitor()
    cv.visit(tree)
    courses = [list(v) for _, v in groupby(cv.courses, lambda x: x[2])]
    print(courses)

Here, we walk our way from bottom to top, starting with brickets like whitespace, the operators or, and and , which will eventually lead to the course and finally the term. The visitor class builds the desired (well, kind of, one needs to get rid of the last tuple element) structure.

Solution 4

Just in the name of completness there is SLY. The creator David Beazley has a great talk about it at PyCon 2018 which is fun.

Share:
36,577
Nick Heiner
Author by

Nick Heiner

JS enthusiast by day, horse mask enthusiast by night. Talks I've Done

Updated on November 25, 2021

Comments

  • Nick Heiner
    Nick Heiner over 2 years

    Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.

    I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.

    From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

    Some sample inputs and outputs:

    "CS 2110" => ("CS", 2110) # 0
    
    "CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
    "CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
    "CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1
    
    "CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2
    
    "MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3  
    
    1. If the entire description is just a course, it is output directly.

    2. If the courses are conjoined ("and"), they are all output in the same list

    3. If the course are disjoined ("or"), they are in separate lists

    4. Here, we have both "and" and "or".

    One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.

    What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:

    def p_course(p):
     'course : DEPT_CODE COURSE_NUMBER'
     p[0] = (p[1], int(p[2]))
    

    With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.

     def addCourse(self, str, location, tokens):
      self.result.append((tokens[0][0], tokens[0][1]))
    
     def makeCourseList(self, str, location, tokens):
    
      dept = tokens[0][0]
      new_tokens = [(dept, tokens[0][1])]
      new_tokens.extend((dept, tok) for tok in tokens[1:])
    
      self.result.append(new_tokens)
    

    For instance, to handle "or" cases:

        def __init__(self):
                self.result = []
                # ...
      self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses)
    
    
    
     def disjunctionCourses(self, str, location, tokens):
      if len(tokens) == 1:
       return tokens
    
      print "disjunction tokens: %s" % tokens
    

    How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted...

    Also, there are many descriptions that include misc text, like:

    "CS 2110 or permission of instructor"
    "INFO 3140 or equivalent experience"
    "PYSCH 2210 and sophomore standing"
    

    But it isn't critical that I parse that text.

    What's a better way to approach this problem?

  • Nick Heiner
    Nick Heiner almost 14 years
    Wow, that is much simpler than other attempts I'd made. How do you come up with that?
  • unutbu
    unutbu almost 14 years
    @Rosarch: I'm sure there are ways to improve what I've written, but I guess the key idea is that you can read the tokens from left to right and build the result by keeping track of your state. Once you've found a dept like "CS", all numbers that follow refer to "CS" until you find a different dept... I wrote the test code first, and then the parse function in many iterations to pass the tests. In my first pass at this problem I ignored "and" and "or". Then in the second pass I realized "and" is sort of unimportant, but "or" requires the use of a second list, option. Hope this helps.
  • Eric Lippert
    Eric Lippert almost 14 years
    Note that there is a huge difference between using a grammar as a generator of strings in a language and using a grammar as a recognizer of strings in a language. The former problem is very easy; as you saw I did it in a few dozen lines of code. The latter is quite difficult, particularly if the grammar is complex.
  • Josh Smeaton
    Josh Smeaton almost 14 years
    @eric fair enough. After I wrote this answer, I made a short attempt at doing it myself and discovered it was quite different, and a lot more difficult for someone that's fumbling their way through.
  • minghua
    minghua over 5 years
    The pyPEG source looks concise and well-written. Awesome.
  • Jan
    Jan about 5 years
    Have added an answer with a PEG parser.
  • CpILL
    CpILL almost 2 years
    also, the link is dead