merging data and receiving a big loss of data

11,969

Can use the all argument to merge.

merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)

However, for rows where matches aren't found in the zipcode data, there will be NAs. Thus if you then na.rm or something to that effect, you will wind up with the same "data loss"

Check the zip codes for the rows where there are NAs in the lat and long columns after the merge:

tel1[is.na(tel1$latitude) | is.na(tel1$longitude),]

My guess is they aren't valid zip codes or the list of zipcodes you have is not complete.

Share:
11,969

Related videos on Youtube

user1741021
Author by

user1741021

Updated on September 15, 2022

Comments

  • user1741021
    user1741021 over 1 year

    I've been preparing my data and somehow I have way less data after merging my data sets.

    Since I don't have the longitude and latitude in my data I've been using the following code after I downloaded the package zipcode (tel1 is my data containing zipcodes)

    merge <- merge(zipcode,tel1,by.x=c('zip'),by.y=c('zip_code'))
    

    Before merging I had 195956 observations, while after merging it dropped down to 180090, but I don't understand why.

    In my opinion I just merged them where zip was equal to zip_code and I added the information from the dataset zipcode to my folder tel1

    Afterward I wanted to remove the rows that contain NA because the merge couldn't define any numbers or whatever. I used this code

    final <- result[complete.cases(result),]
    

    Then my number of observations dropped down to 51006 which I just can't believe. There can't be so many mismatches in my data.

    Is there any other code that I should use?

    Afterwards I've been trying to delete the duplicates with the code

     last <- with(final,final[order(state,latitude,longitude),])
    

    but the number of observations was consistent (51006).

    What did I do wrong or is there a way to get my data into an excel file again after merging the data so I could manually check if there are really so many mismatches? Thanks

    • Brandon Bertelsen
      Brandon Bertelsen over 11 years
      Try using join from the plyr package, it's much more intuitive than merge.
    • Justin
      Justin over 11 years
      you can also look at the all argument to merge. merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)