How to extract the text from MS Office documents in Linux?

38,097

Solution 1

I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!

Get Apache Tika here:

http://tika.apache.org/

(Mac Homebrew users: brew install tika)

The command-line interface works like this:

tika --text something.docx > something.txt

Solution 2

Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.

For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and

Solution 3

Abiword can convert from the commandline between any file formats it knows.

Convert from Word to plain text:

abiword --to=txt myfile.doc

Make a pdf from a Word file:

abiword --to=pdf myfile.doc

And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:

abiword --to=txt --to-name=output.txt myfile.doc

Convert ODT to Word:

abiword --to=doc myfile.odt

Convert Word to ODT:

abiword --to=odt myfile.doc

In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.

I hate word processors. This is the main reason I have AbiWord installed.

You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.

Solution 4

With LibreOffice you can do:

libreoffice --invisible --convert-to pdf file1.ppt file2.ppt
Share:
38,097

Related videos on Youtube

Phyo Arkar Lwin
Author by

Phyo Arkar Lwin

Updated on September 17, 2022

Comments

  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 1 year

    I need a way to extract the text from all of the MS Office document types (Word, Excel, Powerpoint), in Linux. I envision that there might be several different approaches to accomplish this, such as a Bash or Python script, or converting them to PDF and then extracting the text using a tool such as pdftotext.

    This seems like it might be a commonplace requirement. Is there an established procedure or tool to accomplish this easily?

  • Phyo Arkar Lwin
    Phyo Arkar Lwin almost 14 years
    Catdoc! Thats the thing i am looking for! Will it also work for ODF ?
  • Phyo Arkar Lwin
    Phyo Arkar Lwin almost 14 years
    Interesting , can that convert any printable stuff to PDF? Can you point me and example doing that for Doc or Xls?
  • nahar
    nahar almost 14 years
    Just googled & got stosberg.net/odt2txt. never tried it, seems like it does the job.
  • ptman
    ptman over 13 years
    unoconv seems to be the OpenOffice-related tool I couldn't remember.
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 13 years
    cool thanks. catdoc is ok but it cant convert xls,ppt to test , i use xls2csv and apache-tika for them. check them out!
  • CarlF
    CarlF over 12 years
    This is the exact opposite of what the OP asked for.
  • Allen
    Allen over 11 years
    @nahar, odt2txt only works on odt format, not ms doc.
  • Scott - Слава Україні
    Scott - Слава Україні about 11 years
    (1) Catdoc was proposed in an answer that was posted within an hour of the question, almost three years ago. Why are you repeating it? (2) Where can antiword be obtained? (3) What does the bottom half of your answer mean?
  • Warface
    Warface about 10 years
    For .docx document it mess up :S But a nice solution for .doc
  • Gagaro
    Gagaro about 10 years
    You can use the Text filter to convert to txt: libreoffice --invisible --convert-to txt:Text files
  • fotanus
    fotanus almost 10 years
    great, catdoc gives me segmentation fault
  • user2518618
    user2518618 over 8 years
    +1: Apache Tika is a serious Open source project, works also in Windows, works from the command line, it has a GUI with drag and drop, opens anything (Word, Excel, PowerPoint, PDF, svg), extracts the metadata of the document as well. After trying most the tools above, Apache Tika is what I was looking for. This should be the accepted answer (I don't know if you can accept your own answer)
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 8 years
    did , shamelessly ... :D