How to remove duplicate observations in Stata

10,229

It is no surprise that duplicates does not do what you are wanting, as it does not fit your problem. For example, the observation with id == 2, disease == 0 is not a duplicate of any other observation. More generally, duplicates does not purport to be a general-purpose command for dropping observations you don't want.

Your criteria appear to be

  1. Keep one observation for each id.

  2. If id has any observation with value of 1, that is to be kept.

A solution to that is

bysort id (disease) : keep if _n == _N 

That keeps the last observation for each distinct id: after sorting within id on disease observations with the disease are necessarily at the end of each group.

Share:
10,229
statuser
Author by

statuser

Updated on June 04, 2022

Comments

  • statuser
    statuser almost 2 years

    Let's say I have the following data:

    id  disease
    1   0
    1   1
    1   0
    2   0
    2   1
    3   0
    4   0
    4   0
    

    I would like to remove the duplicate observations in Stata. For example

    id   disease
    1      1
    2      1
    3      0
    4      0
    

    For group id=1, keep observation 2

    For group id=2, keep observation 2

    For group id=3, keep observation 1 (because it has only 1 obs)

    For group id=4, keep observation 1 (or any of them but one obs)

    I am trying Stata duplicates command,

    duplicates tag id if disease==0, generate(info)
    drop if info==1
    

    but it's not working as I required.