how to create a dataframe from a table in a word document (.docx) file using pandas

13,074

docx always reads data from Word tables as text (strings).

If we want to parse data with correct dtypes we can do one of the following:

  • manually specify dtype for all columns (not flexible)
  • write our own code to guess correct dtypes (too difficult and , Pandas IO methods do it well)
  • convert data into CSV format and let pd.read_csv() guess/infer correct dtypes (I've chosen this way)

Many thanks to @Anton vBR for improving the function!


import pandas as pd
import io
import csv
from docx import Document

def read_docx_tables(filename, tab_id=None, **kwargs):
    """
    parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)

    Parameters:
        filename:   file name of a Word Document

        tab_id:     parse a single table with the index: [tab_id] (counting from 0).
                    When [None] - return a list of DataFrames (parse all tables)

        kwargs:     arguments to pass to `pd.read_csv()` function

    Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
    """
    def read_docx_tab(tab, **kwargs):
        vf = io.StringIO()
        writer = csv.writer(vf)
        for row in tab.rows:
            writer.writerow(cell.text for cell in row.cells)
        vf.seek(0)
        return pd.read_csv(vf, **kwargs)

    doc = Document(filename)
    if tab_id is None:
        return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
    else:
        try:
            return read_docx_tab(doc.tables[tab_id], **kwargs)
        except IndexError:
            print('Error: specified [tab_id]: {}  does not exist.'.format(tab_id))
            raise

NOTE: you may want to add more checks and exception catching...

Examples:

In [209]: dfs = read_docx_tables(fn)

In [210]: dfs[0]
Out[210]:
   A   B               C,X
0  1  B1                C1
1  2  B2                C2
2  3  B3  val1, val2, val3

In [211]: dfs[0].dtypes
Out[211]:
A       int64
B      object
C,X    object
dtype: object

In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')

In [213]: dfs[1]
Out[213]:
   C1  C2          C3    Text column
0  11  21         NaN  Test "quotes"
1  12  23  2017-12-31            NaN

In [214]: dfs[1].dtypes
Out[214]:
C1              int64
C2              int64
C3             object
Text column    object
dtype: object

In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')

parsing dates:

In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])

In [217]: df
Out[217]:
   C1  C2         C3    Text column
0  11  21        NaT  Test "quotes"
1  12  23 2017-12-31            NaN

In [218]: df.dtypes
Out[218]:
C1                      int64
C2                      int64
C3             datetime64[ns]
Text column            object
dtype: object
Share:
13,074
pyd
Author by

pyd

Just a Tech.

Updated on June 26, 2022

Comments

  • pyd
    pyd almost 2 years

    I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.

    from docx import Document
    document = Document('req.docx')
    for table in document.tables:
        for row in table.rows:       
            for cell in row.cells:        
                print (cell.text)
    

    and also tried to read table as df pd.read_table("path of the file")

    I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance