Parsing mbox files in Python
I haven't tested this, but something like this might work for you. Just open the file (in binary mode so your byte counts are correct), and scan through it, finding messages.
def is_mail_start(line):
return line.startswith("From ")
def build_index(fname):
with open(fname, "rb") as f:
i = 0
b = 0
# find start of first message
for line in f:
b += len(line)
if is_mail_start(line):
break
# find start of each message, and yield up (index, length) of previous message
for line in f:
if is_mail_start(line):
yield (i, b)
i += b
b = 0
b += len(line)
yield (i, b) # yield up (index, length) of last message
# get index as a list
mbox_index = list(build_index(fname))
Once you have the index, you can use the .seek()
method on a file object to seek there, and .read(length)
on the file object to read just one message. I'm not sure how you will use the mailbox
module with a string, though; I think it is meant to work on a mailbox in-place. Maybe there is some other mail-parsing module you can use.

Mark Fletcher
Updated on June 19, 2022Comments
-
Mark Fletcher 6 months
Python newbie here. I want to walk through a large mbox file, parsing email messages. I can do that with:
import sys import mailbox def gen_summary(filename): mbox = mailbox.mbox(filename) for message in mbox: subj = message['subject'] print subj if __name__ == "__main__": if len(sys.argv) != 2: print 'Usage: python genarchivesum.py mbox' sys.exit(1) gen_summary(sys.argv[1])
But I need more control. I need to be able to get the byte position of the start of a given email in the mbox file and I also need to get the number of bytes in the message (as represented on disk). And then in the future, instead of iterating from the beginning of the mbox file, I need to be able to seek to a given message and just parse that (hence one of the needs of getting the byte position on disk). These are large mbox files and efficiency is a concern.
The purpose of all this is so that I can generate a summary file, which contains some small bits about each email in the mbox, and then in the future efficiently look up individual emails within the mbox.
-
Mark Fletcher over 10 yearsOk, thanks. I guess I'll use something like this strategy. btw, the start of an email in an mbox begins with 'From ' (without the :). I can use email.Parser to parse the email. Thanks.
-
steveha over 10 yearsI'll edit the answer to take out the ':'. I did say I didn't test it... Good luck with your project, and have a great weekend!
-
adammenges over 7 yearsFor what it's worth, for future users, it's actually both, at least on the latest version of OSX. def is_mail_start(line): return line.startswith("From") and not line.startswith("From:")
-
steveha over 7 yearsIf the
From
that marks the start is always followed by a space, you could just search for the string"From "
(note the space at the end). This wouldn't matchFrom:
with a colon.