Pandas: Concatenate files but skip the headers except the first file

13,746

Solution 1

I think you need numpy.concatenate with DataFrame constructor:

df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)

Another solution is replace columns names in df2 and df3:

df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)

Samples:

np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
   A  B  F
0  8  8  3
1  7  7  0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
   E  R  T
0  4  2  5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
   H  T  R
0  2  2  2
1  1  0  8
2  4  0  9

print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
 [7 7 0]
 [4 2 5]
 [2 2 2]
 [1 0 8]
 [4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

Solution 2

You have to use argument skip_rows of read_csv for second and third lines like here:

import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])

Solution 3

Been working on this recently myself, here's the most compact/elegant thing I came up with:

import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)

13,746

MCG Code

Updated on July 28, 2022

Comments

MCG Code 10 months
I have 3 files representing the same dataset split in 3 and I need to concatenate:
```
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2')
df3 = pandas.read_csv('path3')
df = pandas.concat([df1,df2,df3])
```
But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?
- MCG Code almost 6 years
  
  You're right, I check the skipping of the line but not the concatenation. Definitely the skiprows code is not the right one, the dataset should have 23 columns it has almost 3 times that.
MCG Code almost 6 years

I agree with Jezrael, the concatenation duplicates the columns, as many times as the files. I was a bit too fast, I was happy to see the first line disappear, but didn't check on the right that column numbers became huge
MCG Code almost 6 years

Do you have any idea why is it required to add df2.columns = df1.columns if the files already have identical headers?
jezrael almost 6 years

If columns are identical, then your solution should works perfectly - concat align data by columns.
MCG Code almost 6 years

Your code works perfectly. I'm just wandering why pandas insists on me making the df2.columns = df1.columns before using ignore_index=True
jezrael almost 6 years

I think your columns names are different, so need my solution. But if columns names are same, then need only df = pd.concat([df1,df2,df3], ignore_index=True).
MCG Code almost 6 years

The columns are identical I check it with all(df2.columns == df1.columns) and is returns True. But when I run the line df = pd.concat([df1,df2,df3], ignore_index=True) it just duplicates the columns, only when I use your full code (incl the replacement of columns) that it works
jezrael almost 6 years

I think your solumns has to be different, maybe something like 0 as number and 0 like string. I have same issue with all some times before and very long time looking for problem. Because if duplicate columns it seems cannot allign so all duplicates column names are different some way.
jezrael almost 6 years

maybe help check print (df.columns.tolist())