How to compare two CSV files and get the difference?

10,152

Solution 1

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
                                              # of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

Solution 2

You can use pandas to read in two files, join them and remove all duplicate rows:

import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Share:
10,152
vishal
Author by

vishal

Hi, I am a Data science enthusiastic, enjoying the world of Data Science and exploring deep into it. From the initial days of my masters degree (M.Tech CSE with specialization in Big Data Analytics) from VIT University I have started my career in the field of Data Science and Big Data Analytics. And by participating in various Kaggle competitions. I have started the never ending and exciting learning process in Data science domain specially in Machine Learning & Deep Learning. I'm here as a lifelong learner , I thank all the members the Stackoverflow community for the great help that comes to me and I try in my turn to give my little help as I can.

Updated on June 13, 2022

Comments

  • vishal
    vishal almost 2 years

    I have two CSV files,

    a1.csv

    city,state,link
    Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
    AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
    Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
    

    a2.csv

    city,state,link
    Aguila,Arizona,http://www.co.apache.az.us
    

    I want to get the difference.

    Here is my attempt:

    import pandas as pd
    
    a = pd.read_csv('a1.csv')
    b = pd.read_csv('a2.csv')
    
    mask = a.isin(b.to_dict(orient='list'))
    # Reverse the mask and remove null rows.
    # Upside is that index of original rows that
    # are now gone are preserved (see result).
    c = a[~mask].dropna()
    print c
    

    Expected Output:

    city,state,link
    Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
    AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
    

    But I am getting an error:

    Empty DataFrame
    Columns: [city, state, link]
    Index: []**
    

    I want to check based on the first two rows, then if they are the same, remove it off.