Pandas: Concatenate files but skip the headers except the first file

13,746

Solution 1

I think you need numpy.concatenate with DataFrame constructor:

df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)

Another solution is replace columns names in df2 and df3:

df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)

Samples:

np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
   A  B  F
0  8  8  3
1  7  7  0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
   E  R  T
0  4  2  5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
   H  T  R
0  2  2  2
1  1  0  8
2  4  0  9

print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
 [7 7 0]
 [4 2 5]
 [2 2 2]
 [1 0 8]
 [4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

Solution 2

You have to use argument skip_rows of read_csv for second and third lines like here:

import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])

Solution 3

Been working on this recently myself, here's the most compact/elegant thing I came up with:

import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)
Share:
13,746

Related videos on Youtube

MCG Code
Author by

MCG Code

Updated on July 28, 2022

Comments

  • MCG Code
    MCG Code 10 months

    I have 3 files representing the same dataset split in 3 and I need to concatenate:

    import pandas
    df1 = pandas.read_csv('path1')
    df2 = pandas.read_csv('path2')
    df3 = pandas.read_csv('path3')
    df = pandas.concat([df1,df2,df3])
    

    But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?

    • MCG Code
      MCG Code almost 6 years
      You're right, I check the skipping of the line but not the concatenation. Definitely the skiprows code is not the right one, the dataset should have 23 columns it has almost 3 times that.
  • MCG Code
    MCG Code almost 6 years
    I agree with Jezrael, the concatenation duplicates the columns, as many times as the files. I was a bit too fast, I was happy to see the first line disappear, but didn't check on the right that column numbers became huge
  • MCG Code
    MCG Code almost 6 years
    Do you have any idea why is it required to add df2.columns = df1.columns if the files already have identical headers?
  • jezrael
    jezrael almost 6 years
    If columns are identical, then your solution should works perfectly - concat align data by columns.
  • MCG Code
    MCG Code almost 6 years
    Your code works perfectly. I'm just wandering why pandas insists on me making the df2.columns = df1.columns before using ignore_index=True
  • jezrael
    jezrael almost 6 years
    I think your columns names are different, so need my solution. But if columns names are same, then need only df = pd.concat([df1,df2,df3], ignore_index=True).
  • MCG Code
    MCG Code almost 6 years
    The columns are identical I check it with all(df2.columns == df1.columns) and is returns True. But when I run the line df = pd.concat([df1,df2,df3], ignore_index=True) it just duplicates the columns, only when I use your full code (incl the replacement of columns) that it works
  • jezrael
    jezrael almost 6 years
    I think your solumns has to be different, maybe something like 0 as number and 0 like string. I have same issue with all some times before and very long time looking for problem. Because if duplicate columns it seems cannot allign so all duplicates column names are different some way.
  • jezrael
    jezrael almost 6 years
    maybe help check print (df.columns.tolist())

Related