Parse email content from quoted reply

23,634

Solution 1

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

When you have the thread:

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

When you don't have the thread:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

  1. a line (as seen in outlook).
  2. Angle Brackets
  3. "---Original Message---"
  4. "On such-and-such day, so-and-so wrote:"

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

Solution 2

First of all, this is a tricky task.

You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.

I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);

To remove quotation in the end:

new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);

Here is my small collection of test responses (samples divided by --- ):

From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>

>  text
----
[email protected] wrote:
> text
----
      [email protected] wrote:         text
text
----
2009/1/13 <[email protected]>

>  text
----
 [email protected] wrote:         text
 text
----
2009/1/13 <[email protected]>

> text
> text
----
2009/1/13 <[email protected]>

> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:

> text
> text

Solution 3

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:

def extract_reply(text, address)
    regex_arr = [
      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
      Regexp.new("from:\s*$", Regexp::IGNORECASE)
    ]

    text_length = text.length
    #calculates the matching regex closest to top of page
    index = regex_arr.inject(text_length) do |min, regex|
        [(text.index(regex) || text_length), min].min
    end

    text[0, index].strip
end

It's worked pretty well so far.

Solution 4

By far the easiest way to do this is by placing a marker in your content, such as:

--- Please reply above this line ---

As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.

Facebook can do this, but unless your project has a big budget, you probably can't.

Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.

Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.

Solution 5

There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.

Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.

Share:
23,634
sqlconsumer.net
Author by

sqlconsumer.net

a C# developer located in beautiful Nashville, TN.

Updated on July 08, 2022

Comments

  • sqlconsumer.net
    sqlconsumer.net almost 2 years

    I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

  • kenny
    kenny over 15 years
    gmail does it...at least it seems to do it. From what I remember there is some thread id that doesn't change between the orginal and replies...
  • 3Doubloons
    3Doubloons over 15 years
    gmail might add '>'s as do other email clients, but it's not a standard of emails and not something you can count on
  • Matthieu
    Matthieu over 12 years
    You should make a ruby question and answer it with this code instead of posting it on a c# question.
  • Trent
    Trent over 12 years
    @Matthieu, its not just a C# question, but an email and email-parsing question. totally relevant in my opinion.
  • Matthieu
    Matthieu over 12 years
    @Trent : the C# tag should be dropped then.
  • bratsche
    bratsche about 12 years
    The funny thing is I found this question by Googling for the topic (not the language), and I actually needed to implement something in Ruby. So, cheers!
  • superluminary
    superluminary almost 12 years
    This is the best response so far. Regex is pretty language agnostic. Thanks for posting
  • jpw
    jpw about 10 years
    This approach fails with replies to replies unless you put that line each time you reply.
  • superluminary
    superluminary about 10 years
    Yes, it has drawbacks. If the user deletes the reply above the line string then your reply will fail. I catch this case and send the user a direct message letting them know their message failed, with a link to reply via the web app. Most users seem to be able to use it without too much trouble.
  • user4271704
    user4271704 over 8 years
    Can anyone help for its php version?
  • harsimranb
    harsimranb over 8 years
    What if I don't know the email address?
  • pableiros
    pableiros over 7 years
    Links to external resources are encouraged, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline.
  • Benni
    Benni over 7 years
    This should be the accepted answer. However, I would add the information that the answer will not succeed if the line is removed.
  • superluminary
    superluminary over 7 years
    @Benni - yes, it will fail if the line is removed. Unfortunately, there is no one standard way of quoting text across email clients. In the case where the line is removed, you might treat all the text as a reply. I don't think a perfect solution is possible in this case.
  • Benni
    Benni over 7 years
    @superluminary I meant, I would add it to the line. So it's something like -- Please reply above this line. DO NOT REMOVE IT! --. Also, What I experienced is that it won't always work since some email clients add a xxx wrote on <datetime>: line before the whole quote and therefore before that line. This line could be parsed with regex, however it may be in different languages and in a different format since email clients differ.
  • FullStackDev
    FullStackDev over 6 years
  • maembe
    maembe about 5 years
    @Shyamal-Parikh this won't work for html emails, but typically a plaintext message is also included with email messages
  • Greg Veres
    Greg Veres almost 5 years
    are you keeping that library up to date? I came searching because the C# library doesn't proper parse out a simple email from Outlook from Office 365. Then I looked in the ruby source code and found that there was an identical test case in their test cases so clearly they think they should parse it.