How to compare two CSV files and get the difference?
Solution 1
First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
# of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)
Solution 2
You can use pandas
to read in two files, join them and remove all duplicate rows:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
vishal
Hi, I am a Data science enthusiastic, enjoying the world of Data Science and exploring deep into it. From the initial days of my masters degree (M.Tech CSE with specialization in Big Data Analytics) from VIT University I have started my career in the field of Data Science and Big Data Analytics. And by participating in various Kaggle competitions. I have started the never ending and exciting learning process in Data science domain specially in Machine Learning & Deep Learning. I'm here as a lifelong learner , I thank all the members the Stackoverflow community for the great help that comes to me and I try in my turn to give my little help as I can.
Updated on June 13, 2022Comments
-
vishal almost 2 years
I have two CSV files,
a1.csv
city,state,link Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
a2.csv
city,state,link Aguila,Arizona,http://www.co.apache.az.us
I want to get the difference.
Here is my attempt:
import pandas as pd a = pd.read_csv('a1.csv') b = pd.read_csv('a2.csv') mask = a.isin(b.to_dict(orient='list')) # Reverse the mask and remove null rows. # Upside is that index of original rows that # are now gone are preserved (see result). c = a[~mask].dropna() print c
Expected Output:
city,state,link Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
But I am getting an error:
Empty DataFrame Columns: [city, state, link] Index: []**
I want to check based on the first two rows, then if they are the same, remove it off.