How to iterate over everything in a python-docx document?

python python-docx

12,442

Solution 1

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.

python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you: https://github.com/python-openxml/python-docx/issues/40

There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.

This requirement comes up from time to time so we'll definitely want to add it at some point.

Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

Solution 2

Assuming doc is of type Document, then what you want to do is have 3 separate iterations:

One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes

The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.

Here is the documentation for more info: python-docx

12,442

thebitguru

@farhanahmadlinkedin

Updated on June 04, 2022

Comments

thebitguru about 2 years
I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
```
for para in doc.paragraphs:
    for run in para.runs:
        # How to tell if this run has images or tables?
```
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?

Thanks!
thebitguru almost 10 years

Thanks for the quick response. How would I know the order that they appeared in the original document?
mleyfman almost 10 years

There does not seem to be anything in the documentation/api for this. Perhaps you could add a feature request. In it's current state python-docx appears more suited for creating .docx files rather than reading them.
mleyfman almost 10 years

You may want to look into writing your own parser, as a .docx file is essentially a XML file. Here are some starting points: virantha.com/2013/08/16/…, lxml.de
thebitguru almost 10 years

Thanks! That gives me a good next step. I will see if I can add the code for iterating the inline items within a paragraph.
RightmireM almost 6 years

Unfortunately, the code at Issue 40 no longer works with the changes in commit e784a73. Is there some updated code?
scanny almost 6 years

If you can add a post to that issue with what you're trying and what's not working I'll see if I can help. It looks to me like it works, just not all in one post of that issue.