Parsing mbox files in Python

15,183

I haven't tested this, but something like this might work for you. Just open the file (in binary mode so your byte counts are correct), and scan through it, finding messages.

def is_mail_start(line):
    return line.startswith("From ")
def build_index(fname):
    with open(fname, "rb") as f:
        i = 0
        b = 0
        # find start of first message
        for line in f:
            b += len(line)
            if is_mail_start(line):
                break
        # find start of each message, and yield up (index, length) of previous message
        for line in f:
            if is_mail_start(line):
                yield (i, b)
                i += b
                b = 0
            b += len(line)
        yield (i, b) # yield up (index, length) of last message
# get index as a list
mbox_index = list(build_index(fname))

Once you have the index, you can use the .seek() method on a file object to seek there, and .read(length) on the file object to read just one message. I'm not sure how you will use the mailbox module with a string, though; I think it is meant to work on a mailbox in-place. Maybe there is some other mail-parsing module you can use.

Share:
15,183
Mark Fletcher
Author by

Mark Fletcher

Updated on June 19, 2022

Comments

  • Mark Fletcher
    Mark Fletcher 6 months

    Python newbie here. I want to walk through a large mbox file, parsing email messages. I can do that with:

    import sys
    import mailbox
    def gen_summary(filename):
        mbox = mailbox.mbox(filename)
        for message in mbox:
           subj = message['subject']
           print subj
    if __name__ == "__main__":
        if len(sys.argv) != 2:
            print 'Usage: python genarchivesum.py mbox'
            sys.exit(1)
        gen_summary(sys.argv[1])
    

    But I need more control. I need to be able to get the byte position of the start of a given email in the mbox file and I also need to get the number of bytes in the message (as represented on disk). And then in the future, instead of iterating from the beginning of the mbox file, I need to be able to seek to a given message and just parse that (hence one of the needs of getting the byte position on disk). These are large mbox files and efficiency is a concern.

    The purpose of all this is so that I can generate a summary file, which contains some small bits about each email in the mbox, and then in the future efficiently look up individual emails within the mbox.

  • Mark Fletcher
    Mark Fletcher over 10 years
    Ok, thanks. I guess I'll use something like this strategy. btw, the start of an email in an mbox begins with 'From ' (without the :). I can use email.Parser to parse the email. Thanks.
  • steveha
    steveha over 10 years
    I'll edit the answer to take out the ':'. I did say I didn't test it... Good luck with your project, and have a great weekend!
  • adammenges
    adammenges over 7 years
    For what it's worth, for future users, it's actually both, at least on the latest version of OSX. def is_mail_start(line): return line.startswith("From") and not line.startswith("From:")
  • steveha
    steveha over 7 years
    If the From that marks the start is always followed by a space, you could just search for the string "From " (note the space at the end). This wouldn't match From: with a colon.