Python : How to convert markdown formatted text to text

29,002

Solution 1

The Markdown and BeautifulSoup (now called beautifulsoup4) modules will help do what you describe.

Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.

Your code might look something like this:

from bs4 import BeautifulSoup
from markdown import markdown

html = markdown(some_html_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))

Solution 2

Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.

The markdown module core class Markdown has a property output_formats which is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:

from markdown import Markdown
from io import StringIO


def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

unmark function takes markdown text as an input and returns all the markdown characters stripped out.

Solution 3

This is similar to Jason's answer, but handles comments correctly.

import markdown # pip install markdown
from bs4 import BeautifulSoup # pip install beautifulsoup4

def md_to_text(md):
    html = markdown.markdown(md)
    soup = BeautifulSoup(html, features='html.parser')
    return soup.get_text()

def example():
    md = '**A** [B](http://example.com) <!-- C -->'
    text = md_to_text(md)
    print(text)
    # Output: A B

Solution 4

Commented and removed it because I finally think I see the rub here: It may be easier to convert your markdown text to HTML and remove HTML from the text. I'm not aware of anything to remove markdown from text effectively but there are many HTML to plain text solutions.

Share:
29,002
Krish
Author by

Krish

VC++, C#, Python, Perl Programmer and Architect. Interested in ADO.NET Data services Windows Smart Client Pylons Rails Django

Updated on July 09, 2022

Comments

  • Krish
    Krish almost 2 years

    I need to convert markdown text to plain text format to display summary in my website. I want the code in python.

    • naught101
      naught101 about 10 years
      Not python, but you could pass it to pandoc: pandoc --to=plain leaves some formatting (header undelines), but not much.
  • Krish
    Krish about 15 years
    it seems like convert to html.. I need to convert to plain text.. like stackoverflow, in the homepage question summary, it removes the formatting
  • Krish
    Krish about 15 years
    Thanks coonj.. Good to know about BeatifulSoup
  • Frerich Raabe
    Frerich Raabe over 4 years
    Looks great, thanks a lot for taking the time to add an answer even though the question is so old already. Much appreciated!
  • Renato Byrro
    Renato Byrro almost 4 years
    Converting back and forth from Markdown to HTML is too much, there's a good alternative below that sticks to Markdown only.
  • gargoylebident
    gargoylebident almost 3 years
    So much for praising Markdown for being "basically plain text." Might as well use Word if it's that hard to strip off.
  • Leonardo Maffei
    Leonardo Maffei over 2 years
    Thank you for this aweseom answer. I was going to implement it by myself, but this snippet saved me some good time.
  • Hans Z
    Hans Z over 2 years
    This is definitely preferable to the accepted answer! Thanks.