How can I get an email message's text content using Python?
Solution 1
In a multipart e-mail, email.message.Message.get_payload()
returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part:
import email
msg = email.message_from_string(raw_message)
for part in msg.walk():
# each part is a either non-multipart, or another multipart message
# that contains further parts... Message is organized like a tree
if part.get_content_type() == 'text/plain':
print part.get_payload() # prints the raw text
For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type.
msg = email.message_from_string(raw_message)
msg.get_payload()
If the content is encoded, you need to pass None
as the first parameter to get_payload()
, followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment:
msg = email.message_from_string(raw_message)
for part in msg.walk():
if part.get_content_type() == 'application/msword':
name = part.get_param('name') or 'MyDoc.doc'
f = open(name, 'wb')
f.write(part.get_payload(None, True)) # You need None as the first param
# because part.is_multipart()
# is False
f.close()
As for getting a reasonable plain-text approximation of an HTML part, I've found that html2text works pretty darn well.
Solution 2
Flat is better than nested ;)
from email.mime.multipart import MIMEMultipart
assert isinstance(msg, MIMEMultipart)
for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']:
print _
Related videos on Youtube
Chris R
I'm a software developer and inveterate geek (like many here, I suspect). For work I use so many tools I usually can't remember them all, but recently they've been heavily Python/Java. C, Java/J2EE, various scripting and release engineering tools figure heavily in the list as well.
Updated on July 28, 2020Comments
-
Chris R almost 4 years
Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this:
message = email.message_from_string(raw_message) if has_mime_part(message, "text/plain"): mime_part = get_mime_part(message, "text/plain") text_content = decode_mime_part(mime_part) elif has_mime_part(message, "text/html"): mime_part = get_mime_part(message, "text/html") html = decode_mime_part(mime_part) text_content = render_html_to_plaintext(html) else: # fallback text_content = str(message) return text_content
Of these things, I have
get_mime_part
andhas_mime_part
down pat, but I'm not quite sure how to get the decoded text from the MIME part. I can get the encoded text usingget_payload()
, but if I try to use thedecode
parameter of theget_payload()
method (see the doc) I get an error when I call it on the text/plain part:File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ email/message.py", line 189, in get_payload raise TypeError('Expected list, got %s' % type(self._payload)) TypeError: Expected list, got <type 'str'>
In addition, I don't know how to take HTML and render it to text as closely as possible.
-
beldaz almost 11 yearsI found a useful solution to something similar in ginstrom.com/scribbles/2007/11/19/…
-
-
Chris R over 14 yearsThat is an excellent explanation...that covers exactly what I've already got; I can, as noted, locate and extract the bare payload of the part. However, I can not decode the part if it's decoded, nor can I render the text/html part to text if no text/plain part is available.
-
Chris R over 14 years(on re-read -- sorry, coffee is lacking!) Well, okay, so you've solved my HTML to text problem :)
-
Jarret Hardie over 14 yearsMy bad... clearly not enough coffee last night when I answered. I've amended the answer, hopefully with what you need.
-
Chris R over 14 yearsCool.. How can I check if the part is encoded? Where do I see the part's Content-Transfer-Encoding attribute?
-
Jarret Hardie over 14 yearsUse part.get_param('content-transfer-encoding') to see the attribute
-
Wodin about 10 yearsActually, use part.get("content-transfer-encoding"), since it's just a header. Not part of the content-type header. Also, instead of
part.get_payload(None, True)
you can usepart.get_payload(decode=True)
, which I think is a little clearer. -
tripleee over 8 yearsThis blindly extracts all `text/plain´ parts with no attention to which one is "right".
-
guneysus over 8 years@tripleee Generally we use one plain, one html part, and several image parts. Even if there is more than plain parts, how do you know which one right?
-
tripleee over 8 yearsIn the typical case, with a toplevel
multipart/alternative
where only one part istext/plain
, that one. In the more general case, I don't think there is a single right answer, because it depends on the purpose of your application and the preferences of the recipient. -
tripleee over 8 yearsIn all fairness, the accepted answer has the same problem.