SQL Left Join first match only

sql sql-server tsql join greatest-n-per-group

174,230

Solution 1

Turns out I was doing it wrong, I needed to perform a nested select first of just the important columns, and do a distinct select off that to prevent trash columns of 'unique' data from corrupting my good data. The following appears to have resolved the issue... but I will try on the full dataset later.

SELECT DISTINCT P2.*
FROM (
  SELECT
      IDNo
    , FirstName
    , LastName
  FROM people P
) P2

Here is some play data as requested: http://sqlfiddle.com/#!3/050e0d/3

CREATE TABLE people
(
       [entry] int
     , [IDNo] varchar(3)
     , [FirstName] varchar(5)
     , [LastName] varchar(7)
);

INSERT INTO people
    (entry,[IDNo], [FirstName], [LastName])
VALUES
    (1,'uqx', 'bob', 'smith'),
    (2,'abc', 'john', 'willis'),
    (3,'ABC', 'john', 'willis'),
    (4,'aBc', 'john', 'willis'),
    (5,'WTF', 'jeff', 'bridges'),
    (6,'Sss', 'bill', 'doe'),
    (7,'sSs', 'bill', 'doe'),
    (8,'ssS', 'bill', 'doe'),
    (9,'ere', 'sally', 'abby'),
    (10,'wtf', 'jeff', 'bridges')
;

Solution 2

distinct is not a function. It always operates on all columns of the select list.

Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:

select ...
from (
  select IDNo,
         FirstName,
         LastName,
         ....,
         row_number() over (partition by lower(idno) order by firstname) as rn 
  from people 
) t
where rn = 1;

Using the order by clause you can select which of the duplicates you want to pick.

The above can be used in a left join, see below:

select ...
from x
  left join (
    select IDNo,
           FirstName,
           LastName,
           ....,
           row_number() over (partition by lower(idno) order by firstname) as rn 
    from people 
  ) p on p.idno = x.idno and p.rn = 1
where ...

Solution 3

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.

SELECT *
FROM People p
WHERE PeopleID = (
    SELECT MIN(PeopleID) 
    FROM People 
    WHERE IDNo = p.IDNo
)

Solution 4

After careful consideration this dillema has a few different solutions:

Aggregate Everything Use an aggregate on each column to get the biggest or smallest field value. This is what I am doing since it takes 2 partially filled out records and "merges" the data.

http://sqlfiddle.com/#!3/59cde/1

SELECT
  UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY 
  IDNo

Get First (or Last record)

http://sqlfiddle.com/#!3/59cde/23

-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo:  Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------

SELECT 
   PC.user_id
  ,PData.FirstName
  ,PData.LastName
  ,PData.entry
FROM (
  SELECT 
      P2.user_id
     ,MIN(P2.entry) AS rownum
  FROM (
    SELECT
        UPPER(P.IDNo) AS user_id 
      , P.entry 
    FROM people P
  ) AS P2
  GROUP BY 
    P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY 
   PData.entry

Solution 5

Try this

 SELECT *
 FROM people P 
 where P.IDNo in (SELECT DISTINCT IDNo
              FROM people)

View more solutions

174,230

Author by

Dave

I am IT. How do you kill that which has no life?

Updated on June 03, 2021

Comments

Dave about 3 years
I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.

Given that, I need to add this crap data as a left join to my good query. The data set looks like:
```
IDNo    FirstName   LastName    ...
-------------------------------------------
uqx     bob     smith
abc     john        willis
ABC     john        willis
aBc     john        willis
WTF     jeff        bridges
sss     bill        doe
ere     sally       abby
wtf     jeff        bridges
...
```
(about 2 dozen columns, and 100K rows)

My first instinct was to perform a distinct gave me about 80K rows:
```
SELECT DISTINCT P.IDNo
FROM people P
```
But when I try the following, I get all the rows back:
```
SELECT DISTINCT P.*
FROM people P
```
OR
```
SELECT 
    DISTINCT(P.IDNo) AS IDNoUnq 
    ,P.FirstName
    ,P.LastName
    ...etc.    
FROM people P
```
I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?

Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.
Dave over 10 years

Good idea. Tried it and it still shows the duplicates. It appears IN compares non case sensitively. I tried to cheat by wrapping the P.IDNo in an Upper() and same for the distinct IDNo, but it still showed the dupes. Doh.
mucio over 10 years

if you don't control the final database you can discover that it's case sensitive (stackoverflow.com/questions/1411161/…). My suggestion is to put the IDNo in a UCASE() just to stay on the safe side
Dave over 10 years

After doing a full test, looks like the POC in the fiddle doesn't turn out as expected. This is because ANY column with a different value will make the row distinct.
Dave over 10 years

@mucio good point, when I get this solved I definitely will upper case the keys.
mucio over 10 years

that's exactly the point of DISTINCT, put everything in UCASE() and maybe also TRIM()
Dave over 10 years

@mucio So considering that, I am pretty much screwed and might as well just do a bunch of aggregate functions to maintain the integrity of the data since I can't just change all the data to upper case. I will consider the options here and see what alternative solutions I can come up with.
mucio over 10 years

I would rather go for fixing the data, like update every column to low cases (or if it's the case low case with the first letter capital), trim all the strings, etc.
HansHarhoff over 7 years

How would this perform if e.g. the people table is much larger than the x table?
a_horse_with_no_name over 7 years

@HansHarhoff: check the execution plan. But it doesn't matter if one table is bigger then the other - if the requirement is to return all rows from x and only the "latest" from people there is no other way to do it.
a_horse_with_no_name over 7 years

The query in this answer is exactly the same as select distinct IDNo, FirstName, LastName from people - the derived table won't change a thing
ggedde about 5 years

This wont allow for null rows
Keugels over 4 years

This obviously won't work. Say people IDno's is (1, 2, 2, 3, 4, 5, 5). Distinct IDno's is (1,2,3,4,5). Then you're taking all people where IDno is in (1,2,3,4,5), but all IDno's that are in (1, 2, 2, 3, 4, 5, 5) are also in (1, 2, 3, 4, 5) and vice versa. You didn't do anything here besides making the query a bit more complex.