pyPdf for IndirectObject extraction
Solution 1
each element in pdf.pages
is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT']
should be the element you want.
You can try to print that individually or poke at it with help
and dir
in a python prompt for more about how to get the string you want
Edit:
after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT']
and the value can be retrieved via getData()
the following function gives a more generic way to solve this by recursively looking for the key in question
import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)
def findInDict(needle,haystack):
for key in haystack.keys():
try:
value = haystack[key]
except:
continue
if key == needle:
return value
if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):
x = findInDict(needle,value)
if x is not None:
return x
answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()
Solution 2
An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.
If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.
Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:
13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj
Can be read with this:
>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}
Solution 3
Jehiah's method is good if looking everywhere for the object. My guess (looking at the PDF) is that it is always in the same place (the first page, in the 'MC0' property), and so a much simpler method of finding the string would be:
import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()
Comments
-
JuanDeLosMuertos almost 2 years
Following this example, I can list all elements into a pdf file
import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects
now, I need to extract a non-standard object from the pdf file.
My object is the one named MYOBJECT and it is a string.
The piece printed by the python script that concernes me is:
{'/MYOBJECT': IndirectObject(584, 0)}
The pdf file is this:
558 0 obj <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources <</ColorSpace <</CS0 563 0 R>> /ExtGState <</GS0 568 0 R>> /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>> /ProcSet[/PDF/Text/ImageC] /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >> /XObject<</Im0 578 0 R>>>> /Rotate 0/StructParents 0/Type/Page>> endobj ... ... ... 584 0 obj <</Length 8>>stream 1_22_4_1 --->>>> this is the string I need to extract from the object endstream endobj
How can I follow the
584
value in order to refer to my string (under pyPdf of course)?? -
stenci about 10 years
pdf.resolvedObjects[0][n]
saysKeyError: 0
. This works for me:pdf.resolvedObjects[(0,n)]
-
Sundeep Pidugu about 5 years
NotImplementedError: unsupported filter /DCTDecode
i get this error. -
Sundeep Pidugu about 5 yearsHow do i figure out the filters ??
['/Resources']['/Properties']['/MC0']['/MYOBJECT']
these which you are referring to ? -
JoGe over 2 yearsEither print the entire structure or browse the PDF using a tool like iText RUPS.