Extract hyperlinks from PDF in Python

python pdf hyperlink pypdf2 pdfminer

13,706

Solution 1

This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.

It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.

Here is the code I used to get links on a PDFPage

annotationList = []
if page.annots:
    for annotation in page.annots.resolve():
        annotationDict = annotation.resolve()
        if str(annotationDict["Subtype"]) != "/Link":
            # Skip over any annotations that are not links
            continue
        position = annotationDict["Rect"]
        uriDict = annotationDict["A"].resolve()
        # This has always been true so far.
        assert str(uriDict["S"]) == "/URI"
        # Some of my URI's have spaces.
        uri = uriDict["URI"].replace(" ", "%20")
        annotationList.append((position, uri))

Then I defined a function like:

def getOverlappingLink(annotationList, element):
    for (x0, y0, x1, y1), url in annotationList:
        if x0 > element.x1 or element.x0 > x1:
            continue
        if y0 > element.y1 or element.y0 > y1:
            continue
        return url
    else:
        return None

which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.

In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.

Solution 2

slightly modified version of Ashwin's Answer:

import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if uri in u[ank].keys():
                print(u[ank][uri])

Solution 3

I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if pageObject.has_key(key):
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
            print u[ank][uri]

This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

Solution 4

import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")    
urls = []
for page in pdf_file.pages:
    for annots in page.get("/Annots"):
        url=annots.get("/A").get("/URI")
        if url is not None:
            urls.append(url)
            urls.append(" ; ")
print(urls)

You will get a semicolon separated list of links in the given PDF

View more solutions

13,706

Author by

Randomly Named User

Updated on June 05, 2022

Comments

Randomly Named User 5 months

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.
Randomly Named User almost 8 years

I needed both the text as well as the hyperlink, and so I extracted the text. And I'm not exactly sure what you mean by process the annotation... Could you explain that? I'm a bit of an amateur.
KenS almost 8 years

You need to use a library which will locate and return all the annotations on a given page (or in the Outlines tree) and return the dictionary describing them. This should contain both the text to be drawn, and the URL. I'm sorry but I can't tell you which library to use or how to use it, I don't know of any that will do this.
Sundeep Pidugu over 3 years

This seems to work fine but is there any way i could extract the text which encloses the hyperlink and modify that ?
shantanuo over 3 years

PdfFileReader method accept the file as parameter and therefore PDFFile object is not required!
gasstationwithoutpumps about 2 years

Fails with "TypeError: 'IndirectObject' object is not subscriptable" on the item lookup.