How to extract the text from MS Office documents in Linux?
Solution 1
I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!
Get Apache Tika here:
(Mac Homebrew users: brew install tika
)
The command-line interface works like this:
tika --text something.docx > something.txt
Solution 2
Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.
For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and
Solution 3
Abiword can convert from the commandline between any file formats it knows.
Convert from Word to plain text:
abiword --to=txt myfile.doc
Make a pdf from a Word file:
abiword --to=pdf myfile.doc
And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:
abiword --to=txt --to-name=output.txt myfile.doc
Convert ODT to Word:
abiword --to=doc myfile.odt
Convert Word to ODT:
abiword --to=odt myfile.doc
In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.
I hate word processors. This is the main reason I have AbiWord installed.
You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.
Solution 4
With LibreOffice you can do:
libreoffice --invisible --convert-to pdf file1.ppt file2.ppt
Related videos on Youtube
Phyo Arkar Lwin
Updated on September 17, 2022Comments
-
Phyo Arkar Lwin over 1 year
I need a way to extract the text from all of the MS Office document types (Word, Excel, Powerpoint), in Linux. I envision that there might be several different approaches to accomplish this, such as a Bash or Python script, or converting them to PDF and then extracting the text using a tool such as pdftotext.
This seems like it might be a commonplace requirement. Is there an established procedure or tool to accomplish this easily?
-
Phyo Arkar Lwin almost 14 yearsCatdoc! Thats the thing i am looking for! Will it also work for ODF ?
-
Phyo Arkar Lwin almost 14 yearsInteresting , can that convert any printable stuff to PDF? Can you point me and example doing that for Doc or Xls?
-
nahar almost 14 yearsJust googled & got stosberg.net/odt2txt. never tried it, seems like it does the job.
-
ptman over 13 yearsunoconv seems to be the OpenOffice-related tool I couldn't remember.
-
Phyo Arkar Lwin over 13 yearscool thanks. catdoc is ok but it cant convert xls,ppt to test , i use xls2csv and apache-tika for them. check them out!
-
CarlF over 12 yearsThis is the exact opposite of what the OP asked for.
-
Allen over 11 years@nahar, odt2txt only works on odt format, not ms doc.
-
Scott - Слава Україні about 11 years(1) Catdoc was proposed in an answer that was posted within an hour of the question, almost three years ago. Why are you repeating it? (2) Where can antiword be obtained? (3) What does the bottom half of your answer mean?
-
Warface about 10 yearsFor .docx document it mess up :S But a nice solution for .doc
-
Gagaro about 10 yearsYou can use the Text filter to convert to txt: libreoffice --invisible --convert-to txt:Text files
-
fotanus almost 10 yearsgreat, catdoc gives me segmentation fault
-
user2518618 over 8 years+1: Apache Tika is a serious Open source project, works also in Windows, works from the command line, it has a GUI with drag and drop, opens anything (Word, Excel, PowerPoint, PDF, svg), extracts the metadata of the document as well. After trying most the tools above, Apache Tika is what I was looking for. This should be the accepted answer (I don't know if you can accept your own answer)
-
Phyo Arkar Lwin over 8 yearsdid , shamelessly ... :D