Adding observations with specific values for variable
Solution 1
My suggestions hinge on using expand
, which in turn just requires information on the number of observations to be added. I ignore your variable Lack
, as Stata itself can work out where the gaps are. My procedure for imputing total_workers
is based on using the inbuilt command ipolate
and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.
For other interpolation procedures, check out cipolate
, csipolate
, pchipolate
, all accessible via ssc desc cipolate
(or equivalent).
This kind of operation depends on getting sort
order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list
statements; and work on a good toy example dataset (as you kindly provided here).
. clear
. input firm_id year dyrstr total_workers
firm_id year dyrstr total_w~s
1. 2432 2002 1980 29
2. 2432 2003 1980 23
3. 2432 2005 1980 283
4. 2432 2006 1980 56
5. 2432 2007 1980 21
6. 2433 2004 2001 42
7. 2433 2006 2001 29
8. 2433 2008 2001 100
9. 2434 2002 2002 21
10. 2434 2003 2002 55
11. 2434 2004 2002 22
12. 2434 2005 2002 24
13. 2434 2006 2002 17
14. 2434 2007 2002 40
15. 2434 2008 2002 110
16. 2434 2009 2002 158
17. 2434 2010 2002 38
18. 2435 2002 2002 80
19. 2435 2003 2002 86
20. 2435 2004 2002 877
21. 2435 2005 2002 254
22. 2435 2006 2002 71
23. 2435 2007 2002 116
24. 2435 2008 2002 118
25. 2435 2009 2002 1165
26. 2435 2010 2002 67
27. 2436 2002 1992 24
28. 2436 2003 1992 25
29. 2436 2004 1992 22
30. 2436 2005 1992 23
31. 2436 2006 1992 21
32. 2436 2007 1992 100
33. 2436 2008 1992 73
34. 2436 2009 1992 23
35. 2436 2010 1992 40
36. 2437 2002 2002 30
37. 2437 2003 2002 31
38. 2437 2004 2002 21
39. 2437 2006 2002 56
40. 2437 2007 2002 20
41. end
. scalar N = _N
. bysort firm_id (year) : gen gap = year - year[_n-1]
(6 missing values generated)
. expand gap
(6 missing counts ignored; observations not deleted)
(4 observations created)
. gen orig = _n <= scalar(N)
. bysort firm_id (year) : replace total_workers = . if !orig
(4 real changes made, 4 to missing)
. bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1
(4 real changes made)
. bysort firm_id (year): ipolate total_workers year , gen(total_workers2)
. list, sepby(firm_id)
+------------------------------------------------------------+
| firm_id year dyrstr total_~s gap orig total_~2 |
|------------------------------------------------------------|
1. | 2432 2002 1980 29 . 1 29 |
2. | 2432 2003 1980 23 1 1 23 |
3. | 2432 2004 1980 . 2 0 153 |
4. | 2432 2005 1980 283 2 1 283 |
5. | 2432 2006 1980 56 1 1 56 |
6. | 2432 2007 1980 21 1 1 21 |
|------------------------------------------------------------|
7. | 2433 2004 2001 42 . 1 42 |
8. | 2433 2005 2001 . 2 0 35.5 |
9. | 2433 2006 2001 29 2 1 29 |
10. | 2433 2007 2001 . 2 0 64.5 |
11. | 2433 2008 2001 100 2 1 100 |
|------------------------------------------------------------|
12. | 2434 2002 2002 21 . 1 21 |
13. | 2434 2003 2002 55 1 1 55 |
14. | 2434 2004 2002 22 1 1 22 |
15. | 2434 2005 2002 24 1 1 24 |
16. | 2434 2006 2002 17 1 1 17 |
17. | 2434 2007 2002 40 1 1 40 |
18. | 2434 2008 2002 110 1 1 110 |
19. | 2434 2009 2002 158 1 1 158 |
20. | 2434 2010 2002 38 1 1 38 |
|------------------------------------------------------------|
21. | 2435 2002 2002 80 . 1 80 |
22. | 2435 2003 2002 86 1 1 86 |
23. | 2435 2004 2002 877 1 1 877 |
24. | 2435 2005 2002 254 1 1 254 |
25. | 2435 2006 2002 71 1 1 71 |
26. | 2435 2007 2002 116 1 1 116 |
27. | 2435 2008 2002 118 1 1 118 |
28. | 2435 2009 2002 1165 1 1 1165 |
29. | 2435 2010 2002 67 1 1 67 |
|------------------------------------------------------------|
30. | 2436 2002 1992 24 . 1 24 |
31. | 2436 2003 1992 25 1 1 25 |
32. | 2436 2004 1992 22 1 1 22 |
33. | 2436 2005 1992 23 1 1 23 |
34. | 2436 2006 1992 21 1 1 21 |
35. | 2436 2007 1992 100 1 1 100 |
36. | 2436 2008 1992 73 1 1 73 |
37. | 2436 2009 1992 23 1 1 23 |
38. | 2436 2010 1992 40 1 1 40 |
|------------------------------------------------------------|
39. | 2437 2002 2002 30 . 1 30 |
40. | 2437 2003 2002 31 1 1 31 |
41. | 2437 2004 2002 21 1 1 21 |
42. | 2437 2005 2002 . 2 0 38.5 |
43. | 2437 2006 2002 56 2 1 56 |
44. | 2437 2007 2002 20 1 1 20 |
+------------------------------------------------------------+
Solution 2
You've already awarded the answer, but I have had to do similar before and always use the cross
command as follows. Say I am using your dataset already & continue with the following code:
tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'
restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted
Then from here, use the ipolate
command in the spirit of @NickCox.
I'm particularly interested in any pros/cons regarding the use of expand
and cross
. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year'
tempfile differently.)
the_economist
Updated on June 22, 2022Comments
-
the_economist about 2 years
First, have a look at some variables of my dataset:
firm_id year dyrstr Lack total_workers 2432 2002 1980 29 2432 2003 1980 23 2432 2005 1980 1 283 2432 2006 1980 56 2432 2007 1980 21 2433 2004 2001 42 2433 2006 2001 1 29 2433 2008 2001 1 100 2434 2002 2002 21 2434 2003 2002 55 2434 2004 2002 22 2434 2005 2002 24 2434 2006 2002 17 2434 2007 2002 40 2434 2008 2002 110 2434 2009 2002 158 2434 2010 2002 38 2435 2002 2002 80 2435 2003 2002 86 2435 2004 2002 877 2435 2005 2002 254 2435 2006 2002 71 2435 2007 2002 116 2435 2008 2002 118 2435 2009 2002 1165 2435 2010 2002 67 2436 2002 1992 24 2436 2003 1992 25 2436 2004 1992 22 2436 2005 1992 23 2436 2006 1992 21 2436 2007 1992 100 2436 2008 1992 73 2436 2009 1992 23 2436 2010 1992 40 2437 2002 2002 30 2437 2003 2002 31 2437 2004 2002 21 2437 2006 2002 1 56 2437 2007 2002 20
The variables:
- firm_id is an identifier for firms
- year is the year of the observation
- dyrstr is the founding year of a firm
- Lack equals 1 if there is a missing observation in the year before (e.g. in line three of the dataset, Lack equals 1 because for the firm with ID 2432, there is no observation in the year 2004)
- total_workers is the number of workers
I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):
firm_id year dyrstr Lack total_workers 2432 2002 1980 29 *2432* *2004* *1980* *156* 2432 2003 1980 23 2432 2005 1980 1 283 2432 2006 1980 56 2432 2007 1980 21
The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.
- firm_id should stay the same as in the line before
- year should be the year from the previous line plus one
- dyrstr should stay the same as in the line before
- Lack: here it doesn't matter which value this variable has
- total_workers equals 0.5*(value of the previous observation + value of consecutive observation)
- all other variables of my dataset (which I didn't list here) should stay the same as in the line before
I read something about the the command
expand
buthelp expand
doesn't help me much. Hopefully one of you can help me!-
SOConnell over 10 yearsShould the range of year the same across firms (i.e., ideally, you would observe all firms for 10 years) or does the year range change by firm?
-
Nick Cox over 10 yearsAlthough the Stata users on SO seem happy to help with this kind of question, and I am too, please do note that by general Stack Overflow standards this is not a good question as it shows no attempt at you to provide code.
-
the_economist over 10 yearsThe range of year differs from firm to firm.
-
Nick Cox over 10 yearsThe pros and cons of
expand
andcross
: An earlier version of my answer usedfillin
, which is related tocross
, but then I had to do what you do, remove observations added at the beginning and end of the period which didn't belong. It comes down partly to style preference.