How do I convert a PDF to text so I can parse that text with PHP?
Solution 1
I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.
Below is the install process for Linux CentOS:
- download version 3.03 from here: http://foolabs.com/xpdf/download.html
- tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
- create required directories for install ( some or all of these might exist already )
- sudo mkdir /usr/local/man/
- sudo mkdir /usr/local/man/man1/
- sudo mkdir /usr/local/man/man5/
- sudo mkdir /usr/local/etc/xpdfrc/
- move files from extracted folders ( cd into the folder where xpdf was just unzipped )
- move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
- move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
- move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
- xpdf should be installed and ready to use
- you can delete the downloaded tar.gz file and the folder where it was unzipped
Solution 2
Third party software can dump the text contents of a PDF file, for example:
- xdoc2txt (Windows-only, used in WinMerge plugins)
- pdftotext, part of Xpdf
Solution 3
You can't do that with file_get_contents()
because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:
And don't forget
T. Brian Jones
I'm a programmer and systems architect who loves to analyze massive data sets. I have a degree in Mechanical Engineering and an MBA. Before becoming a full time computer jockey, I was a manufacturing and design engineer doing industrial automation and robotics. I have experience with: OSX, Linux, HTML & CSS, PHP, MySQL, MongoDB, Sphinx, ElasticSearch, CodeIgniter, CakePHP, Slim Framework, SupervisorD, Amazon Web Services ( VPC, EC2, EBS, RDS, SQS, & S3, etc., etc. ), Subversion & Assembla, GIT & Github, Jira & Confluence, API Design, Big Data management and analysis, and large scale web-crawling. I've also worked with proprietary industrial automation languages (like GE Fanuc).
Updated on July 06, 2022Comments
-
T. Brian Jones almost 2 years
I have PDFs that are mostly simply formatted text. I would like to parse the text with PHP. I realize that the PDF is binary so I need a utility or library to convert it to text.
Any recommendations?