merging data and receiving a big loss of data
Can use the all
argument to merge
.
merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)
However, for rows where matches aren't found in the zipcode data, there will be NA
s. Thus if you then na.rm
or something to that effect, you will wind up with the same "data loss"
Check the zip codes for the rows where there are NAs in the lat and long columns after the merge:
tel1[is.na(tel1$latitude) | is.na(tel1$longitude),]
My guess is they aren't valid zip codes or the list of zipcodes you have is not complete.
Related videos on Youtube
user1741021
Updated on September 15, 2022Comments
-
user1741021 over 1 year
I've been preparing my data and somehow I have way less data after merging my data sets.
Since I don't have the longitude and latitude in my data I've been using the following code after I downloaded the package zipcode (
tel1
is my data containing zipcodes)merge <- merge(zipcode,tel1,by.x=c('zip'),by.y=c('zip_code'))
Before merging I had 195956 observations, while after merging it dropped down to 180090, but I don't understand why.
In my opinion I just merged them where
zip
was equal tozip_code
and I added the information from the dataset zipcode to my folder tel1Afterward I wanted to remove the rows that contain NA because the merge couldn't define any numbers or whatever. I used this code
final <- result[complete.cases(result),]
Then my number of observations dropped down to 51006 which I just can't believe. There can't be so many mismatches in my data.
Is there any other code that I should use?
Afterwards I've been trying to delete the duplicates with the code
last <- with(final,final[order(state,latitude,longitude),])
but the number of observations was consistent (51006).
What did I do wrong or is there a way to get my data into an excel file again after merging the data so I could manually check if there are really so many mismatches? Thanks
-
Brandon Bertelsen over 11 yearsTry using
join
from the plyr package, it's much more intuitive than merge. -
Justin over 11 yearsyou can also look at the
all
argument tomerge
.merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)
-