How to extract text from pdf in Python 3.7
Solution 1
Using tika worked for me!
from tika import parser
rawText = parser.from_file('January2019.pdf')
rawList = rawText['content'].splitlines()
This made it really easy to extract separate each line in the bank statement into a list.
Solution 2
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.
Hope this will be helpful to you.
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
Solution 3
If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf
and use it like this:
import fitz
def get_text(filepath: str) -> str:
with fitz.open(filepath) as doc:
text = ""
for page in doc:
text += page.getText().strip()
return text
Solution 4
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :
While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.
-
You could instead install and use pdfminer using
pip install pdfminer
or you can use another open source utility named
pdftotext
by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here
and could use the pdftotext.exe utility using subprocess
.detailed explanation for using subprocess is given here
Solution 5
Here is an alternative solution in Windows 10, Python 3.8
Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
Related videos on Youtube
![RaV1oLLi](https://lh5.googleusercontent.com/-S8piC9J65yw/AAAAAAAAAAI/AAAAAAAAAAA/ACHi3rdIVfRBqSp53zS6VcglKxxJd_H8BQ/mo/photo.jpg?sz=256)
RaV1oLLi
Updated on May 02, 2022Comments
-
RaV1oLLi about 2 years
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.
What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?
I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.
import PyPDF2 pdfFileObj = open("January2019.pdf", 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) print(pageObj.extractText())
This prints empty strings when it should be printing the contents of the page
-
Error - Syntactical Remorse about 5 yearsDoes the pdf have textual content?
-
SyntaxVoid supports Monica about 5 yearsIs there actual text in the PDF? Can you use your mouse to highlight and copy text from the PDF? From the official documentation of PyPDF2: ' extractText() Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. Returns: a unicode string object.`
-
lit about 5 yearsHow about searching through the questions already on SO? stackoverflow.com/questions/tagged/pypdf2
-
RaV1oLLi about 5 yearsYes there is actual text all over the pdf that I can highlight.
-
-
RaV1oLLi about 5 yearsthis also prints empty lines
-
İsa GİRİŞKEN about 5 yearsı tried it on cv id didnt work. But if there is only text its working is there pic on your pdf ?
-
Nick about 5 yearsCode only answers are discouraged. Please add some explanation as to how this solves the problem, or how this differs from the existing answers. From Review
-
İsa GİRİŞKEN about 5 yearsım trying on my pc dont worry when ı found it ı will let u know :) but for now with pic it doesnt reading the text
-
dataviews about 5 yearsfinally found a solution that worked for me. All of these other PDF scanners did not work for my use case, and that may be due to the formatting of the actual PDF. However, this tika package worked flawlessly. You will need to install the latest version of Java, as well as the Java tika server.jar file. Once you download the java tika server jar file you can run from cmd on windows, java -jar java-tika-server.jar to run the local server, then this package will work for python
-
Siddharth Das about 5 yearsIt is best thing I found, I have tried
PyPDF2
,pdfminer
but is suits by purpose,because it gives line by line output. -
Andrew Anderson over 3 yearsI can confirm that tika is very nice choice. I like it for the simplicity and ability to extract links from pdf. However, for me I found even better way to do the job from Windows command line: "gswin64c -sDEVICE=txtwrite -o pdf2text.txt "sample.pdf"" ...provided you have gswin64c.exe installed and the path set correctly. It was installed on my machine, I just had to set the PATH.
-
user1465073 over 3 yearsyou saved me from losing my sanity. I'm trying to open pdfs with arabic, Chinese, non English language and your solution preserved the characters, thank you
-
AHK over 3 yearsCould you loop this solution for multiple folders with multiple pdfs and transform the results in dataframe or alike? I have a question about it if you could kindly look -> stackoverflow.com/questions/66224627/…
-
arjun over 2 yearsThis solution seems more effective than PyPDF2.
-
Aska about 2 yearsexcellent package, much better than PyPDF2, thank you!