SQL Left Join first match only
Solution 1
Turns out I was doing it wrong, I needed to perform a nested select first of just the important columns, and do a distinct select off that to prevent trash columns of 'unique' data from corrupting my good data. The following appears to have resolved the issue... but I will try on the full dataset later.
SELECT DISTINCT P2.*
FROM (
SELECT
IDNo
, FirstName
, LastName
FROM people P
) P2
Here is some play data as requested: http://sqlfiddle.com/#!3/050e0d/3
CREATE TABLE people
(
[entry] int
, [IDNo] varchar(3)
, [FirstName] varchar(5)
, [LastName] varchar(7)
);
INSERT INTO people
(entry,[IDNo], [FirstName], [LastName])
VALUES
(1,'uqx', 'bob', 'smith'),
(2,'abc', 'john', 'willis'),
(3,'ABC', 'john', 'willis'),
(4,'aBc', 'john', 'willis'),
(5,'WTF', 'jeff', 'bridges'),
(6,'Sss', 'bill', 'doe'),
(7,'sSs', 'bill', 'doe'),
(8,'ssS', 'bill', 'doe'),
(9,'ere', 'sally', 'abby'),
(10,'wtf', 'jeff', 'bridges')
;
Solution 2
distinct
is not a function. It always operates on all columns of the select list.
Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:
select ...
from (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) t
where rn = 1;
Using the order by
clause you can select which of the duplicates you want to pick.
The above can be used in a left join, see below:
select ...
from x
left join (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) p on p.idno = x.idno and p.rn = 1
where ...
Solution 3
Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.
SELECT *
FROM People p
WHERE PeopleID = (
SELECT MIN(PeopleID)
FROM People
WHERE IDNo = p.IDNo
)
Solution 4
After careful consideration this dillema has a few different solutions:
Aggregate Everything Use an aggregate on each column to get the biggest or smallest field value. This is what I am doing since it takes 2 partially filled out records and "merges" the data.
http://sqlfiddle.com/#!3/59cde/1
SELECT
UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY
IDNo
Get First (or Last record)
http://sqlfiddle.com/#!3/59cde/23
-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo: Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------
SELECT
PC.user_id
,PData.FirstName
,PData.LastName
,PData.entry
FROM (
SELECT
P2.user_id
,MIN(P2.entry) AS rownum
FROM (
SELECT
UPPER(P.IDNo) AS user_id
, P.entry
FROM people P
) AS P2
GROUP BY
P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY
PData.entry
Solution 5
Try this
SELECT *
FROM people P
where P.IDNo in (SELECT DISTINCT IDNo
FROM people)
![Dave](https://i.stack.imgur.com/0ruaR.png?s=256&g=1)
Comments
-
Dave about 3 years
I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.
Given that, I need to add this crap data as a left join to my good query. The data set looks like:
IDNo FirstName LastName ... ------------------------------------------- uqx bob smith abc john willis ABC john willis aBc john willis WTF jeff bridges sss bill doe ere sally abby wtf jeff bridges ...
(about 2 dozen columns, and 100K rows)
My first instinct was to perform a distinct gave me about 80K rows:
SELECT DISTINCT P.IDNo FROM people P
But when I try the following, I get all the rows back:
SELECT DISTINCT P.* FROM people P
OR
SELECT DISTINCT(P.IDNo) AS IDNoUnq ,P.FirstName ,P.LastName ...etc. FROM people P
I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?
Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.
-
Dave over 10 yearsGood idea. Tried it and it still shows the duplicates. It appears IN compares non case sensitively. I tried to cheat by wrapping the P.IDNo in an Upper() and same for the distinct IDNo, but it still showed the dupes. Doh.
-
mucio over 10 yearsif you don't control the final database you can discover that it's case sensitive (stackoverflow.com/questions/1411161/…). My suggestion is to put the
IDNo
in aUCASE()
just to stay on the safe side -
Dave over 10 yearsAfter doing a full test, looks like the POC in the fiddle doesn't turn out as expected. This is because ANY column with a different value will make the row distinct.
-
Dave over 10 years@mucio good point, when I get this solved I definitely will upper case the keys.
-
mucio over 10 yearsthat's exactly the point of
DISTINCT
, put everything inUCASE()
and maybe alsoTRIM()
-
Dave over 10 years@mucio So considering that, I am pretty much screwed and might as well just do a bunch of aggregate functions to maintain the integrity of the data since I can't just change all the data to upper case. I will consider the options here and see what alternative solutions I can come up with.
-
mucio over 10 yearsI would rather go for fixing the data, like update every column to low cases (or if it's the case low case with the first letter capital), trim all the strings, etc.
-
HansHarhoff over 7 yearsHow would this perform if e.g. the people table is much larger than the x table?
-
a_horse_with_no_name over 7 years@HansHarhoff: check the execution plan. But it doesn't matter if one table is bigger then the other - if the requirement is to return all rows from
x
and only the "latest" frompeople
there is no other way to do it. -
a_horse_with_no_name over 7 yearsThe query in this answer is exactly the same as
select distinct IDNo, FirstName, LastName from people
- the derived table won't change a thing -
ggedde about 5 yearsThis wont allow for null rows
-
Keugels over 4 yearsThis obviously won't work. Say people IDno's is (1, 2, 2, 3, 4, 5, 5). Distinct IDno's is (1,2,3,4,5). Then you're taking all people where IDno is in (1,2,3,4,5), but all IDno's that are in (1, 2, 2, 3, 4, 5, 5) are also in (1, 2, 3, 4, 5) and vice versa. You didn't do anything here besides making the query a bit more complex.