Read text file data to pandas DataFrame

18,081

Yes, it is possible, but really data dependent:

  • first read_csv with omit first 3 rows and omit first whitespaces
  • omit trailing whitespaces in columns by strip
  • create column TYPE by extract values between [] and forward fill next rows
  • create helper column for distinguish each DataFrame by startswith and cumsum
  • last remove by contains rows where first column starts with [, -- or *

df = pd.read_csv(file, sep="!", skiprows=3, skipinitialspace=True)
df.columns = df.columns.str.strip()
df['TYPE'] = df['*BOHRKOPF'].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df['*BOHRKOPF'].str.startswith('*').cumsum()
df = df[~df['*BOHRKOPF'].str.contains('^\[|^--|^\*')]
print (df)
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2   A21              1  62   0.000     0.000       0.000       NaN   
4   A12             -1  62   0.000  -160.000       0.000       NaN   
5   A12              2  62   0.000  -128.000       3.000      70.0   
6   A12             -3  62   0.000   -96.000       0.000       NaN   
7   A12              4  62   0.000   -64.000       0.000       NaN   
12  O11             -9  62   0.000   -96.000       0.000       NaN   
13  O11             10  62   0.000  -128.000       5.000      70.0   

             TYPE  G  
2   NoValidForUse  0  
4             V11  0  
5             V11  0  
6             V11  0  
7             V11  0  
12            V11  1  
13            V11  1  

and then filter by G column:

df1 = df[df['G'] == 0].drop('G', axis=1)
print (df1)
    *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2  A21              1  62   0.000     0.000       0.000       NaN   
4  A12             -1  62   0.000  -160.000       0.000       NaN   
5  A12              2  62   0.000  -128.000       3.000      70.0   
6  A12             -3  62   0.000   -96.000       0.000       NaN   
7  A12              4  62   0.000   -64.000       0.000       NaN   

            TYPE  
2  NoValidForUse  
4            V11  
5            V11  
6            V11  
7            V11  

df2 = df[df['G'] == 1].drop('G', axis=1)
print (df2)
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR TYPE
12  O11             -9  62   0.000   -96.000       0.000       NaN  V11
13  O11             10  62   0.000  -128.000       5.000      70.0  V11

If in file is multiple DataFrames is possible use list comprehension for list of DataFrames:

dfs = [v.drop('G', axis=1) for k, v in df.groupby('G')]
print (dfs[0])
    *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2  A21              1  62   0.000     0.000       0.000       NaN   
4  A12             -1  62   0.000  -160.000       0.000       NaN   
5  A12              2  62   0.000  -128.000       3.000      70.0   
6  A12             -3  62   0.000   -96.000       0.000       NaN   
7  A12              4  62   0.000   -64.000       0.000       NaN   

            TYPE  
2  NoValidForUse  
4            V11  
5            V11  
6            V11  
7            V11  

print (dfs[1])
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR TYPE
12  O11             -9  62   0.000   -96.000       0.000       NaN  V11
13  O11             10  62   0.000  -128.000       5.000      70.0  V11

EDIT:

temp=u"""_MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21       !      1!62!     0.000!     0.000!      0.000!
[V11]
A12       !     -1!62!     0.000!  -160.000!      0.000!
A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
A12       !     -3!62!     0.000!   -96.000!      0.000!
A12       !      4!62!     0.000!   -64.000!      0.000!
---------------------------------------------------------------------------
*BOHRKOPF !          !X-POS     !Y-POS     !           ! 
----------+----------+----------+----------+-----------+-------------------
[V11]
O11       !          !     0.000!   -96.000!           !
O11       !          !     0.000!  -128.000!           !  """

Add parameter header for default columns names:

#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="!", skiprows=3, skipinitialspace=True, header=None)
df['TYPE'] = df[0].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df[0].str.startswith('*').cumsum()
#dont remove rows start with *
df = df[~df[0].str.contains('^\[|^--')]

print (df)
             0        1           2           3           4            5  \
0   *BOHRKOPF   SPINDEL          WK  DELTA-X     DELTA-Y     DURCHMESSER   
3   A21               1          62       0.000       0.000        0.000   
5   A12              -1          62       0.000    -160.000        0.000   
6   A12               2          62       0.000    -128.000        3.000   
7   A12              -3          62       0.000     -96.000        0.000   
8   A12               4          62       0.000     -64.000        0.000   
10  *BOHRKOPF       NaN  X-POS       Y-POS              NaN          NaN   
13  O11             NaN       0.000     -96.000         NaN          NaN   
14  O11             NaN       0.000    -128.000         NaN          NaN   

            6           TYPE  G  
0   KOMMENTAR            NaN  1  
3         NaN  NoValidForUse  1  
5         NaN            V11  1  
6        70.0            V11  1  
7         NaN            V11  1  
8         NaN            V11  1  
10        NaN            V11  2  
13        NaN            V11  2  
14        NaN            V11  2  

For each loop remove column G, rename all columns without last 2 by first row, remove first row by iloc and last if necessary remove all columns fill NaNs only by dropna:

dfs = [v.drop('G', axis=1).rename(columns=v.iloc[0, :-2]).iloc[1:].dropna(axis=1, how='all') for k, v in df.groupby('G')]
print (dfs[0])
   *BOHRKOPF  SPINDEL  WK DELTA-X    DELTA-Y    DURCHMESSER KOMMENTAR  \
3  A21              1  62      0.000      0.000       0.000       NaN   
5  A12             -1  62      0.000   -160.000       0.000       NaN   
6  A12              2  62      0.000   -128.000       3.000      70.0   
7  A12             -3  62      0.000    -96.000       0.000       NaN   
8  A12              4  62      0.000    -64.000       0.000       NaN   

            TYPE  
3  NoValidForUse  
5            V11  
6            V11  
7            V11  
8            V11 

print (dfs[1])
    *BOHRKOPF  X-POS      Y-POS      TYPE
13  O11             0.000    -96.000  V11
14  O11             0.000   -128.000  V11
Share:
18,081

Related videos on Youtube

Arnoldas Bankauskas
Author by

Arnoldas Bankauskas

Updated on June 04, 2022

Comments

  • Arnoldas Bankauskas
    Arnoldas Bankauskas almost 2 years

    I have specific file format from CNC (work center) data. saved like .txt . I want read this table to pandas dataframe but i never seen this format before.

    _MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
    _PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
    ---------------------------------------------------------------------------
    *BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
    ----------+----------+----------+----------+-----------+-------------------
    [NoValidForUse]
    A21       !      1!62!     0.000!     0.000!      0.000!
    [V11]
    A12       !     -1!62!     0.000!  -160.000!      0.000!
    A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
    A12       !     -3!62!     0.000!   -96.000!      0.000!
    A12       !      4!62!     0.000!   -64.000!      0.000!
    ---------------------------------------------------------------------------
    *BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
    ----------+----------+----------+----------+-----------+-------------------
    [V11]
    O11       !     -9!62!     0.000!   -96.000!      0.000!
    O11       !     10!62!     0.000!  -128.000!      5.000!  70.0
    

    Questions: 1. Is it possible to read this and convert as pandas Dataframe? 2. Hou to do this ?

    • why pandas dataFrame? I want this data use for some analysis by this characteristics of item. For analysis i always use pandas. Maybe for this i need do different ways ?

    Expected outpu:

    two pandas DataFrames first:

    ---------------------------------------------------------------------------------------
    *BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR ! TYPE
    ----------+----------+----------+----------+-----------+-------------------------------
    A21       !      1!62!     0.000!     0.000!      0.000!           !NoValidForUse
    A12       !     -1!62!     0.000!  -160.000!      0.000!           !V11
    A12       !      2!62!     0.000!  -128.000!      3.000!  70.0     !V11
    A12       !     -3!62!     0.000!   -96.000!      0.000!           !V11
    A12       !      4!62!     0.000!   -64.000!      0.000!           !V11
    

    And second:

    ---------------------------------------------------------------------------------------
    *BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR ! TYPE
    ----------+----------+----------+----------+-----------+-------------------------------
    O11       !     -9!62!     0.000!   -96.000!      0.000!           !V11
    O11       !     10!62!     0.000!  -128.000!      5.000!  70.0     !V11
    

    Headers of Dataframe1 and dataframe2 can be different:

    _MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
    _PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
    ---------------------------------------------------------------------------
    *BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
    ----------+----------+----------+----------+-----------+-------------------
    [NoValidForUse]
    A21       !      1!62!     0.000!     0.000!      0.000!
    [V11]
    A12       !     -1!62!     0.000!  -160.000!      0.000!
    A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
    A12       !     -3!62!     0.000!   -96.000!      0.000!
     ---------------------------------------------------------------------------
    *BOHRKOPF !          !X-POS     !Y-POS     !           ! 
    ----------+----------+----------+----------+-----------+-------------------
    [V11]
    O11       !          !     0.000!   -96.000!           !
    O11       !          !     0.000!  -128.000!           !  
    
    • on file can be different number of dataframes between 5 and 10 but structure of file sesame separator "!" headers row starts whit "*"
    • jezrael
      jezrael about 6 years
      What is expected output?
    • Arnoldas Bankauskas
      Arnoldas Bankauskas about 6 years
      i add more info to post. :)
  • Arnoldas Bankauskas
    Arnoldas Bankauskas about 6 years
    This solution will be good if headers always sesame like in line 4
  • jezrael
    jezrael about 6 years
    @ArnoldasBankauskas - I see edit. In each file are only 2 dataframes? Or more?
  • Arnoldas Bankauskas
    Arnoldas Bankauskas about 6 years
    on file can be different number of dataframes between 5 and 10 but structure of file sesame separator "!" headers row starts whit "*"
  • Arnoldas Bankauskas
    Arnoldas Bankauskas about 6 years
    brilliant now i understand how deal with this.