How to compare two CSV files and get the difference?

python pandas csv

10,152

Solution 1

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
                                              # of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

Solution 2

You can use pandas to read in two files, join them and remove all duplicate rows:

import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

10,152

Author by

vishal

Hi, I am a Data science enthusiastic, enjoying the world of Data Science and exploring deep into it. From the initial days of my masters degree (M.Tech CSE with specialization in Big Data Analytics) from VIT University I have started my career in the field of Data Science and Big Data Analytics. And by participating in various Kaggle competitions. I have started the never ending and exciting learning process in Data science domain specially in Machine Learning & Deep Learning. I'm here as a lifelong learner , I thank all the members the Stackoverflow community for the great help that comes to me and I try in my turn to give my little help as I can.

Updated on June 13, 2022

Comments

vishal almost 2 years

I have two CSV files,

a1.csv

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/

a2.csv

city,state,link
Aguila,Arizona,http://www.co.apache.az.us

I want to get the difference.

Here is my attempt:

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')

mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c

Expected Output:

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf

But I am getting an error:

Empty DataFrame
Columns: [city, state, link]
Index: []**

I want to check based on the first two rows, then if they are the same, remove it off.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

pandas read ASCII formatted table

How to drop the index column while writing the DataFrame in a .csv file in Pandas?

Print OLS regression summary to text file

How to quickly get the last line from a .csv file over a network drive?

Engines in Python Pandas read_csv

Handling error "TypeError: Expected tuple, got str" loading a CSV to pandas multilevel and multiindex (pandas)

Pandas read_csv does not load a comma separated CSV properly

Create empty csv file with pandas

How to convert string labels to numeric values

When saving a pandas dataframe to csv how do I retain the columns.name?