How best to parse a simple grammar?
Solution 1
def parse(astr):
astr=astr.replace(',','')
astr=astr.replace('and','')
tokens=astr.split()
dept=None
number=None
result=[]
option=[]
for tok in tokens:
if tok=='or':
result.append(option)
option=[]
continue
if tok.isalpha():
dept=tok
number=None
else:
number=int(tok)
if dept and number:
option.append((dept,number))
else:
if option:
result.append(option)
return result
if __name__=='__main__':
tests=[ ("CS 2110" , [[("CS", 2110)]]),
("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),
("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),
("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]
for test,answer in tests:
result=parse(test)
if result==answer:
print('GOOD: {0} => {1}'.format(test,answer))
else:
print('ERROR: {0} => {1} != {2}'.format(test,result,answer))
break
yields
GOOD: CS 2110 => [[('CS', 2110)]]
GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]]
GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]]
GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]
Solution 2
For simple grammars I really like Parsing Expression Grammars (PEGs), which amount to a disciplined, structured way of writing a recursive-descent parser. In a dynamically typed language like Python you can do useful things without having a separate "parser generator". That means no nonsense with reduce-reduce conflicts or other arcana of LR parsing.
I did a little searching and pyPEG appears to be a nice library for Python.
Solution 3
I know that this question is about a decade old and has certainly been answered now. I am mainly posting this answer to prove myself that I have understood PEG
parsers at last. I'm using the fantastic parsimonious
module here.
That being said, you could come up with a parsing grammar, build an ast and visit this one to obtain the desired structure:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
from itertools import groupby
grammar = Grammar(
r"""
term = course (operator course)*
course = coursename? ws coursenumber
coursename = ~"[A-Z]+"
coursenumber = ~"\d+"
operator = ws (and / or / comma) ws
and = "and"
or = (comma ws)? "or"
comma = ","
ws = ~"\s*"
"""
)
class CourseVisitor(NodeVisitor):
def __init__(self):
self.current = None
self.courses = []
self.listnum = 1
def generic_visit(self, node, children):
pass
def visit_coursename(self, node, children):
if node.text:
self.current = node.text
def visit_coursenumber(self, node, children):
course = (self.current, int(node.text), self.listnum)
self.courses.append(course)
def visit_or(self, node, children):
self.listnum += 1
courses = ["CS 2110", "CS 2110 and INFO 3300",
"CS 2110, INFO 3300", "CS 2110, 3300, 3140",
"CS 2110 or INFO 3300", "MATH 2210, 2230, 2310, or 2940"]
for course in courses:
tree = grammar.parse(course)
cv = CourseVisitor()
cv.visit(tree)
courses = [list(v) for _, v in groupby(cv.courses, lambda x: x[2])]
print(courses)
Here, we walk our way from bottom to top, starting with brickets like whitespace, the operators or
, and
and ,
which will eventually lead to the course and finally the term
. The visitor class builds the desired (well, kind of, one needs to get rid of the last tuple element) structure.
Solution 4
Just in the name of completness there is SLY. The creator David Beazley has a great talk about it at PyCon 2018 which is fun.
Nick Heiner
JS enthusiast by day, horse mask enthusiast by night. Talks I've Done
Updated on November 25, 2021Comments
-
Nick Heiner over 2 years
Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.
I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.
From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)
Some sample inputs and outputs:
"CS 2110" => ("CS", 2110) # 0 "CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1 "CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2 "MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3
If the entire description is just a course, it is output directly.
If the courses are conjoined ("and"), they are all output in the same list
If the course are disjoined ("or"), they are in separate lists
Here, we have both "and" and "or".
One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.
What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:
def p_course(p): 'course : DEPT_CODE COURSE_NUMBER' p[0] = (p[1], int(p[2]))
With PyParse, it's less clear how to modify the output of
parseString()
. I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.def addCourse(self, str, location, tokens): self.result.append((tokens[0][0], tokens[0][1])) def makeCourseList(self, str, location, tokens): dept = tokens[0][0] new_tokens = [(dept, tokens[0][1])] new_tokens.extend((dept, tok) for tok in tokens[1:]) self.result.append(new_tokens)
For instance, to handle "or" cases:
def __init__(self): self.result = [] # ... self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses) def disjunctionCourses(self, str, location, tokens): if len(tokens) == 1: return tokens print "disjunction tokens: %s" % tokens
How does
disjunctionCourses()
know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored inresult
, so how can the function tell which data inresult
corresponds to which elements oftoken
? I guess I could search through the tokens, then find an element ofresult
with the same data, but that feel convoluted...Also, there are many descriptions that include misc text, like:
"CS 2110 or permission of instructor" "INFO 3140 or equivalent experience" "PYSCH 2210 and sophomore standing"
But it isn't critical that I parse that text.
What's a better way to approach this problem?
-
Nick Heiner almost 14 yearsWow, that is much simpler than other attempts I'd made. How do you come up with that?
-
unutbu almost 14 years@Rosarch: I'm sure there are ways to improve what I've written, but I guess the key idea is that you can read the tokens from left to right and build the result by keeping track of your state. Once you've found a dept like "CS", all numbers that follow refer to "CS" until you find a different dept... I wrote the test code first, and then the parse function in many iterations to pass the tests. In my first pass at this problem I ignored "and" and "or". Then in the second pass I realized "and" is sort of unimportant, but "or" requires the use of a second list,
option
. Hope this helps. -
Eric Lippert almost 14 yearsNote that there is a huge difference between using a grammar as a generator of strings in a language and using a grammar as a recognizer of strings in a language. The former problem is very easy; as you saw I did it in a few dozen lines of code. The latter is quite difficult, particularly if the grammar is complex.
-
Josh Smeaton almost 14 years@eric fair enough. After I wrote this answer, I made a short attempt at doing it myself and discovered it was quite different, and a lot more difficult for someone that's fumbling their way through.
-
minghua over 5 yearsThe pyPEG source looks concise and well-written. Awesome.
-
Jan about 5 yearsHave added an answer with a
PEG
parser. -
CpILL almost 2 yearsalso, the link is dead