Pandas: Concatenate files but skip the headers except the first file
13,746
Solution 1
I think you need numpy.concatenate with DataFrame constructor:
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
Another solution is replace columns names in df2 and df3:
df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)
Samples:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
A B F
0 8 8 3
1 7 7 0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
E R T
0 4 2 5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
H T R
0 2 2 2
1 1 0 8
2 4 0 9
print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
[7 7 0]
[4 2 5]
[2 2 2]
[1 0 8]
[4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
Solution 2
You have to use argument skip_rows of read_csv for second and third lines like here:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])
Solution 3
Been working on this recently myself, here's the most compact/elegant thing I came up with:
import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)
Related videos on Youtube
Author by
MCG Code
Updated on July 28, 2022Comments
-
MCG Code 10 monthsI have 3 files representing the same dataset split in 3 and I need to concatenate:
import pandas df1 = pandas.read_csv('path1') df2 = pandas.read_csv('path2') df3 = pandas.read_csv('path3') df = pandas.concat([df1,df2,df3])But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?
-
MCG Code almost 6 yearsYou're right, I check the skipping of the line but not the concatenation. Definitely the skiprows code is not the right one, the dataset should have 23 columns it has almost 3 times that.
-
-
MCG Code almost 6 yearsI agree with Jezrael, the concatenation duplicates the columns, as many times as the files. I was a bit too fast, I was happy to see the first line disappear, but didn't check on the right that column numbers became huge -
MCG Code almost 6 yearsDo you have any idea why is it required to add df2.columns = df1.columns if the files already have identical headers? -
jezrael almost 6 yearsIf columns are identical, then your solution should works perfectly - concat align data by columns. -
MCG Code almost 6 yearsYour code works perfectly. I'm just wandering why pandas insists on me making the df2.columns = df1.columns before using ignore_index=True -
jezrael almost 6 yearsI think your columns names are different, so need my solution. But if columns names are same, then need onlydf = pd.concat([df1,df2,df3], ignore_index=True). -
MCG Code almost 6 yearsThe columns are identical I check it with all(df2.columns == df1.columns) and is returns True. But when I run the line df = pd.concat([df1,df2,df3], ignore_index=True) it just duplicates the columns, only when I use your full code (incl the replacement of columns) that it works -
jezrael almost 6 yearsI think your solumns has to be different, maybe something like0as number and0like string. I have same issue withallsome times before and very long time looking for problem. Because if duplicate columns it seems cannot allign so all duplicates column names are different some way. -
jezrael almost 6 yearsmaybe help checkprint (df.columns.tolist())