SAS distinct in proc sql vs proc sort nodupkey
Solution 1
PROC SORT
with NODUPKEY
will always return the physical first record - ie, as you list the data, c=71
will be kept always. PROC SQL
will not necessarily return any particular record; you could ask for min
or max
, but you could not guarantee the first record in sort order regardless of how you did the query; SQL will often resort the data as needed to accomplish the query as efficiently as possible.
They will be identical insomuch as they both return the same number of records, if that is your concern.
You cannot accomplish exactly the same thing in a straightforward manner in SQL; because SQL doesn't have a concept of row ordering, you would have to either have a method of choosing which c (max(c)
, min(c)
, etc.) or you would have to add a row counter and choose the lowest value of that.
For example:
data work.dataset;
input a b c;
rowcounter=_n_;
datalines;
27 93 71
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;
proc sql;
select a,b,min(rowcounter*100+c)-min(rowcounter*100) as c
from work.dataset
group by a,b;
quit;
That's using a cheat (knowing that rowcounter*100 will always dominate the size of c); of course if your c doesn't have values appropriate for that, this won't work and you're better off merging it on separately.
If you are interested in the SQL solution, you may consider posting that explicitly as a separate question as the SQL-only folk will then answer it.
Solution 2
NODUPKEY will return one observation for each key. In your example only one of the two observations with a=27 and b=93 will be kept. Either c=71 or c=72 will be lost.
The NODUPREC option will remove duplicate records. Both observations with a=27 and b=93 will be kept, but only one of the two with the values a=34, b=34 and c=32.
user2280549
Updated on December 10, 2020Comments
-
user2280549 over 3 years
I have following dataset:
data work.dataset; input a b c; datalines; 27 93 71 27 93 72 46 68 75 55 55 33 46 68 68 34 34 32 45 67 88 56 75 22 34 34 32 ; run;
I want to select all distinct records from first 2 columns, so I wrote:
proc sql; create table work.output1 as select distinct t1.a, t1.b from work.dataset t1; quit;
But now I want to know what value of var c stands in previous set next to combination (var a, var b) seen in the output. Is there a way to find out? I tried following proc sort, but I don't know if it works the same way as selecting distinct records in proc sql.
proc sort data = work.dataset out = work.output2 NODUPKEY; by a b; run;
Thanks for help in advance.
-
user2280549 over 10 yearsI knew it, but the question is can I find out which one (c=71 or c=72) was dropped. I assume that SAS keeps record which is "higher" in dataset (in this patricular example 27 93 71 will be kept), but need someone who either confirms or denies this.
-
Laurent de Walick over 10 yearsIn the sql select distinct query, no record is really dropped as variable c not part of the select. In the proc sort nodupkey example SAS only keeps the first observation it encounters.
-
user2280549 over 10 yearsThank you for answers. One more, is it a possiblity to do proc sql taking n columns and "distincting" it based on only a subset of them? IN other words, I would add row_number to my data, select a, b and row_number but the distinct feature will be put on the first two column.