How can I search a word in a Word 2007 .docx file?

47,327

Solution 1

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

Solution 2

After reading your post above, I made a 100% native Python docx module to solve this specific problem.

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

The docx module is at https://python-docx.readthedocs.org/en/latest/

Solution 3

In this example, "Course Outline.docx" is a Word 2007 document, which does contain the word "Windows", and does not contain the phrase "random other string".

>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()

Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the 'document.xml' file in the 'word' folder. If you wanted to be more sophisticated, you could then parse the XML, but if you're just looking for a phrase (which you know won't be a tag), then you can just look in the XML for the string.

Solution 4

A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!

<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">

<w:r w:rsidRPr="003F6D7A">

<w:rPr>

<w:b /> 

</w:rPr>

<w:t>Hello</w:t> 

</w:r>

<w:r>

<w:t xml:space="preserve">World.</w:t> 

</w:r>

</w:p>

You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other "dead ends" just because the OOXML spec is around 6000 pages long and MS Word can write lots of "stuff" you don't expect. So you could end up writing your own document processing library.

Or you can try using Aspose.Words.

It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can't post a second link, stackoverflow does not let me yet).

Solution 5

You can use docx2txt to get the text inside the docx, than search in that txt

npm install -g docx2txt
docx2txt input.docx # This will  print the text to stdout
Share:
47,327
Gerry
Author by

Gerry

Updated on July 09, 2022

Comments

  • Gerry
    Gerry almost 2 years

    I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word.

    Is there a way from Python to see the text? I have no interest in formatting - I just want to classify documents as having or not having "some special phrase".

  • mikemaccana
    mikemaccana over 14 years
    It's probably easier to look for the phrase in the element text (using an XML parser) than having to worry about whether part of your text is matched by an element name.
  • user1006544
    user1006544 over 12 years
    Ya i get all the xml file.Now i want to ask you that How can we get all the values like (bold,italic ,color,fonname,space ) and all the formatting setting ,How can we get this values from xml.
  • claws
    claws over 11 years
    OOXML spec is around 6000 pages long : You've got to be kidding me :O
  • 11684
    11684 over 11 years
    Wait... You wrote an entire module just for this question?!
  • mikemaccana
    mikemaccana over 11 years
    @11684 Yes, I had the same problem as the poster, and all I could fine were horrible solutions for invoking .net or Java from Python.
  • Marc Maxmeister
    Marc Maxmeister about 11 years
    If I knew how to give you my reputation points, I would award them for this -write-a-solution- answer! So I tweeted it instead. MAJOR THANKS! (total time to solve this problem: 25mins, thanks to someone writing the code for me)
  • Vishwanath
    Vishwanath almost 11 years
    I think nailer deserves a meme. "Good guy nailer. Sees that a friend is troubled with a code. Writes a library himself."
  • mpiskore
    mpiskore over 5 years
    Since asterisk imports are an antipattern I'd suggest from docx import document, opendocx
  • Top-Master
    Top-Master about 5 years
    Although I was searching for a C++ library, this is really a motivating answer...
  • Dirk Schiller
    Dirk Schiller about 4 years
    opendocxand search doesn't work in Release v0.8.10. I couldn't find any Information about search. opendocx seems to be now Document.