Adding observations with specific values for variable

18,626

Solution 1

My suggestions hinge on using expand, which in turn just requires information on the number of observations to be added. I ignore your variable Lack, as Stata itself can work out where the gaps are. My procedure for imputing total_workers is based on using the inbuilt command ipolate and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.

For other interpolation procedures, check out cipolate, csipolate, pchipolate, all accessible via ssc desc cipolate (or equivalent).

This kind of operation depends on getting sort order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list statements; and work on a good toy example dataset (as you kindly provided here).

. clear 

. input   firm_id year    dyrstr  total_workers

        firm_id       year     dyrstr  total_w~s
  1.     2432        2002    1980    29
  2.     2432        2003    1980    23
  3.     2432        2005    1980    283
  4.     2432        2006    1980    56
  5.     2432        2007    1980    21
  6.     2433        2004    2001    42
  7.     2433        2006    2001    29
  8.     2433        2008    2001    100
  9.     2434        2002    2002    21
 10.     2434        2003    2002    55
 11.     2434        2004    2002    22
 12.     2434        2005    2002    24
 13.     2434        2006    2002    17
 14.     2434        2007    2002    40
 15.     2434        2008    2002    110
 16.     2434        2009    2002    158
 17.     2434        2010    2002    38
 18.     2435        2002    2002    80
 19.     2435        2003    2002    86
 20.     2435        2004    2002    877
 21.     2435        2005    2002    254
 22.     2435        2006    2002    71
 23.     2435        2007    2002    116
 24.     2435        2008    2002    118
 25.     2435        2009    2002    1165
 26.     2435        2010    2002    67
 27.     2436        2002    1992    24
 28.     2436        2003    1992    25
 29.     2436        2004    1992    22
 30.     2436        2005    1992    23
 31.     2436        2006    1992    21
 32.     2436        2007    1992    100
 33.     2436        2008    1992    73
 34.     2436        2009    1992    23
 35.     2436        2010    1992    40
 36.     2437        2002    2002    30
 37.     2437        2003    2002    31
 38.     2437        2004    2002    21
 39.     2437        2006    2002    56
 40.     2437        2007    2002    20
 41. end 

 . scalar N = _N 

 . bysort firm_id (year) : gen gap = year - year[_n-1] 
 (6 missing values generated)

 . expand gap
 (6 missing counts ignored; observations not deleted)
 (4 observations created)

 . gen orig = _n <= scalar(N) 

 . bysort firm_id (year) : replace total_workers = . if !orig
 (4 real changes made, 4 to missing)

 . bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1 
 (4 real changes made)

 . bysort firm_id (year): ipolate total_workers year , gen(total_workers2)

 . list, sepby(firm_id) 

      +------------------------------------------------------------+
      | firm_id   year   dyrstr   total_~s   gap   orig   total_~2 |
      |------------------------------------------------------------|
   1. |    2432   2002     1980         29     .      1         29 |
   2. |    2432   2003     1980         23     1      1         23 |
   3. |    2432   2004     1980          .     2      0        153 |
   4. |    2432   2005     1980        283     2      1        283 |
   5. |    2432   2006     1980         56     1      1         56 |
   6. |    2432   2007     1980         21     1      1         21 |
      |------------------------------------------------------------|
   7. |    2433   2004     2001         42     .      1         42 |
   8. |    2433   2005     2001          .     2      0       35.5 |
   9. |    2433   2006     2001         29     2      1         29 |
  10. |    2433   2007     2001          .     2      0       64.5 |
  11. |    2433   2008     2001        100     2      1        100 |
      |------------------------------------------------------------|
  12. |    2434   2002     2002         21     .      1         21 |
  13. |    2434   2003     2002         55     1      1         55 |
  14. |    2434   2004     2002         22     1      1         22 |
  15. |    2434   2005     2002         24     1      1         24 |
  16. |    2434   2006     2002         17     1      1         17 |
  17. |    2434   2007     2002         40     1      1         40 |
  18. |    2434   2008     2002        110     1      1        110 |
  19. |    2434   2009     2002        158     1      1        158 |
  20. |    2434   2010     2002         38     1      1         38 |
      |------------------------------------------------------------|
  21. |    2435   2002     2002         80     .      1         80 |
  22. |    2435   2003     2002         86     1      1         86 |
  23. |    2435   2004     2002        877     1      1        877 |
  24. |    2435   2005     2002        254     1      1        254 |
  25. |    2435   2006     2002         71     1      1         71 |
  26. |    2435   2007     2002        116     1      1        116 |
  27. |    2435   2008     2002        118     1      1        118 |
  28. |    2435   2009     2002       1165     1      1       1165 |
  29. |    2435   2010     2002         67     1      1         67 |
      |------------------------------------------------------------|
  30. |    2436   2002     1992         24     .      1         24 |
  31. |    2436   2003     1992         25     1      1         25 |
  32. |    2436   2004     1992         22     1      1         22 |
  33. |    2436   2005     1992         23     1      1         23 |
  34. |    2436   2006     1992         21     1      1         21 |
  35. |    2436   2007     1992        100     1      1        100 |
  36. |    2436   2008     1992         73     1      1         73 |
  37. |    2436   2009     1992         23     1      1         23 |
  38. |    2436   2010     1992         40     1      1         40 |
      |------------------------------------------------------------|
  39. |    2437   2002     2002         30     .      1         30 |
  40. |    2437   2003     2002         31     1      1         31 |
  41. |    2437   2004     2002         21     1      1         21 |
  42. |    2437   2005     2002          .     2      0       38.5 |
  43. |    2437   2006     2002         56     2      1         56 |
  44. |    2437   2007     2002         20     1      1         20 |
      +------------------------------------------------------------+

Solution 2

You've already awarded the answer, but I have had to do similar before and always use the cross command as follows. Say I am using your dataset already & continue with the following code:

tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'

restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted

Then from here, use the ipolate command in the spirit of @NickCox.

I'm particularly interested in any pros/cons regarding the use of expand and cross. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year' tempfile differently.)

Share:
18,626
the_economist
Author by

the_economist

Updated on June 22, 2022

Comments

  • the_economist
    the_economist about 2 years

    First, have a look at some variables of my dataset:

    firm_id year    dyrstr  Lack    total_workers
    2432    2002    1980        29
    2432    2003    1980        23
    2432    2005    1980    1   283
    2432    2006    1980        56
    2432    2007    1980        21
    2433    2004    2001        42
    2433    2006    2001    1   29
    2433    2008    2001    1   100
    2434    2002    2002        21
    2434    2003    2002        55
    2434    2004    2002        22
    2434    2005    2002        24
    2434    2006    2002        17
    2434    2007    2002        40
    2434    2008    2002        110
    2434    2009    2002        158
    2434    2010    2002        38
    2435    2002    2002        80
    2435    2003    2002        86
    2435    2004    2002        877
    2435    2005    2002        254
    2435    2006    2002        71
    2435    2007    2002        116
    2435    2008    2002        118
    2435    2009    2002        1165
    2435    2010    2002        67
    2436    2002    1992        24
    2436    2003    1992        25
    2436    2004    1992        22
    2436    2005    1992        23
    2436    2006    1992        21
    2436    2007    1992        100
    2436    2008    1992        73
    2436    2009    1992        23
    2436    2010    1992        40
    2437    2002    2002        30
    2437    2003    2002        31
    2437    2004    2002        21
    2437    2006    2002    1   56
    2437    2007    2002        20
    

    The variables:

    1. firm_id is an identifier for firms
    2. year is the year of the observation
    3. dyrstr is the founding year of a firm
    4. Lack equals 1 if there is a missing observation in the year before (e.g. in line three of the dataset, Lack equals 1 because for the firm with ID 2432, there is no observation in the year 2004)
    5. total_workers is the number of workers

    I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):

    firm_id year    dyrstr  Lack    total_workers
    2432    2002    1980        29
    *2432*  *2004* *1980*      *156*
    2432    2003    1980        23
    2432    2005    1980    1   283
    2432    2006    1980        56
    2432    2007    1980        21
    

    The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.

    • firm_id should stay the same as in the line before
    • year should be the year from the previous line plus one
    • dyrstr should stay the same as in the line before
    • Lack: here it doesn't matter which value this variable has
    • total_workers equals 0.5*(value of the previous observation + value of consecutive observation)
    • all other variables of my dataset (which I didn't list here) should stay the same as in the line before

    I read something about the the command expand but help expand doesn't help me much. Hopefully one of you can help me!

    • SOConnell
      SOConnell over 10 years
      Should the range of year the same across firms (i.e., ideally, you would observe all firms for 10 years) or does the year range change by firm?
    • Nick Cox
      Nick Cox over 10 years
      Although the Stata users on SO seem happy to help with this kind of question, and I am too, please do note that by general Stack Overflow standards this is not a good question as it shows no attempt at you to provide code.
    • the_economist
      the_economist over 10 years
      The range of year differs from firm to firm.
  • Nick Cox
    Nick Cox over 10 years
    The pros and cons of expand and cross: An earlier version of my answer used fillin, which is related to cross, but then I had to do what you do, remove observations added at the beginning and end of the period which didn't belong. It comes down partly to style preference.