SAS distinct in proc sql vs proc sort nodupkey

33,348

Solution 1

PROC SORT with NODUPKEY will always return the physical first record - ie, as you list the data, c=71 will be kept always. PROC SQL will not necessarily return any particular record; you could ask for min or max, but you could not guarantee the first record in sort order regardless of how you did the query; SQL will often resort the data as needed to accomplish the query as efficiently as possible.

They will be identical insomuch as they both return the same number of records, if that is your concern.

You cannot accomplish exactly the same thing in a straightforward manner in SQL; because SQL doesn't have a concept of row ordering, you would have to either have a method of choosing which c (max(c), min(c), etc.) or you would have to add a row counter and choose the lowest value of that.

For example:

data work.dataset;
input a b c;
rowcounter=_n_;
datalines;
27 93 71 
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;

proc sql;
select a,b,min(rowcounter*100+c)-min(rowcounter*100) as c
from work.dataset
group by a,b;
quit;

That's using a cheat (knowing that rowcounter*100 will always dominate the size of c); of course if your c doesn't have values appropriate for that, this won't work and you're better off merging it on separately.

If you are interested in the SQL solution, you may consider posting that explicitly as a separate question as the SQL-only folk will then answer it.

Solution 2

NODUPKEY will return one observation for each key. In your example only one of the two observations with a=27 and b=93 will be kept. Either c=71 or c=72 will be lost.

The NODUPREC option will remove duplicate records. Both observations with a=27 and b=93 will be kept, but only one of the two with the values a=34, b=34 and c=32.

Share:
33,348
user2280549
Author by

user2280549

Updated on December 10, 2020

Comments

  • user2280549
    user2280549 over 3 years

    I have following dataset:

    data work.dataset;
    input a b c;
    datalines;
    27 93 71 
    27 93 72
    46 68 75
    55 55 33
    46 68 68
    34 34 32
    45 67 88
    56 75 22
    34 34 32
    ;
    run;
    

    I want to select all distinct records from first 2 columns, so I wrote:

    proc sql;
    create table work.output1 as
    select distinct t1.a,
    t1.b
    from work.dataset t1;
    quit;
    

    But now I want to know what value of var c stands in previous set next to combination (var a, var b) seen in the output. Is there a way to find out? I tried following proc sort, but I don't know if it works the same way as selecting distinct records in proc sql.

    proc sort data = work.dataset out = work.output2 NODUPKEY;
    by a b;
    run;
    

    Thanks for help in advance.

  • user2280549
    user2280549 over 10 years
    I knew it, but the question is can I find out which one (c=71 or c=72) was dropped. I assume that SAS keeps record which is "higher" in dataset (in this patricular example 27 93 71 will be kept), but need someone who either confirms or denies this.
  • Laurent de Walick
    Laurent de Walick over 10 years
    In the sql select distinct query, no record is really dropped as variable c not part of the select. In the proc sort nodupkey example SAS only keeps the first observation it encounters.
  • user2280549
    user2280549 over 10 years
    Thank you for answers. One more, is it a possiblity to do proc sql taking n columns and "distincting" it based on only a subset of them? IN other words, I would add row_number to my data, select a, b and row_number but the distinct feature will be put on the first two column.