Converting a pdf to text/html in python so I can parse it

26,437

Solution 1

It's not exactly magic. I suggest

  • downloading the PDF file to a temp directory,
  • calling out to an external program to extract the text into a (temp) text file,
  • reading the text file.

For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

Solution 2

Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.

To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().

If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:

balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
Share:
26,437
Thomas Jensen
Author by

Thomas Jensen

Political scientist trying to learn python and R. My blog: polstat.org/blog

Updated on September 29, 2020

Comments

  • Thomas Jensen
    Thomas Jensen over 3 years

    I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

    EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

    import mechanize
    import urllib2
    import re
    from BeautifulSoup import *
    
    adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
    
    url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
    
    def get_pdf(soup2):
        link = soup2.findAll("a", "com_acronym")
        new_link = []
        amendments = []
        for i in link:
            if "REPORT" in i["href"]:
                new_link.append(i["href"])
        if new_link == None:
            print "No A number"
        else:
            for i in new_link:
                page = br.open(str(i)).read()
                bs = BeautifulSoup(page)
                text = bs.findAll("a")
                for i in text:
                    if re.search("PDF", str(i)) != None:
                        pdf_link = "http://www.europarl.europa.eu/" + i["href"]
                pdf = urllib2.urlopen(pdf_link)
                name_pdf = "%s_%s.pdf" % (y,p)
                localfile = open(name_pdf, "w")
                localfile.write(pdf.read())
                localfile.close()
    
                br.open(adobe)
                br.select_form(name = "convertFrm")
                br.form["srcPdfUrl"] = str(pdf_link)
                br["convertTo"] = ["html"]
                br["visuallyImpaired"] = ["notcompatible"]
                br.form["platform"] =["Macintosh"]
                pdf_html = br.submit()
    
                soup = BeautifulSoup(pdf_html)
    
    
    page = range(1,2) #can be set to 400 to get every document for a given year
    year = range(1999,2000) #can be set to 2011 to get documents from all years
    
    for y in year:
        for p in page:
            br = mechanize.Browser()
            br.open(url)
            br.select_form(name = "byReferenceForm")
            br.form["year"] = str(y)
            br.form["sequence"] = str(p)
            response = br.submit()
            soup1 = BeautifulSoup(response)
            test = soup1.find(text="No search result")
            if test != None:
                print "%s %s No page skipping..." % (y,p)
            else:
                print "%s %s  Writing dossier..." % (y,p)
                for i in br.links(url_regex="file.jsp"):
                    link = i
                response2 = br.follow_link(link).read()
                soup2 = BeautifulSoup(response2)
                get_pdf(soup2)
    

    In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

    Thomas

  • Thomas Jensen
    Thomas Jensen over 13 years
    Thanks for the answer. In the end i chose to just use the adobe online conversion tool (see the code above).
  • rikb
    rikb over 7 years
    for me pdfquery has been an excellent answer to my PDF parsing issues. my most recent problem was getting field entries from a PDF form, worked like a charm. a solid +1 to you @JackCushman!
  • Deepa MG
    Deepa MG almost 6 years
    @Jack Cushman cam you please add some examples and documentation to Repository. Its very hectic for freshers to understand and start with the pdfquery.
  • Abhishek Poojary
    Abhishek Poojary over 5 years
    Hi Jack, I am using pdfquery to extract data from PDFs and it's going very well. I now want to convert the XML output of pdfquery into HTML. Basically I am looking to generate an HTML page equivalent of the original PDF file. Can you direct me in the correct direction to achieve this ?