Postgres, table1 left join table2 with only 1 row per ID in table1
Solution 1
DISTINCT ON is your friend.
select distinct on (user_id) * from user_stats order by datestamp desc;
Solution 2
Basically you need to decide how to resolve ties, and you need some other column besides datestamp
which is guaranteed to be unique (at least over a given user) so it can be used as the tiebreaker. If nothing else, you can use the id
primary key column.
Another solution if you're using PostgreSQL 8.4 is windowing functions:
WITH numbered_user_stats AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY datestamp DESC) AS RowNum
FROM user_stats) AS numbered_user_stats
) SELECT u.user_id, u.datestamp, u.post_count, u.friends_count, u.favorites_count
FROM numbered_user_stats AS u
WHERE u.RowNum = 1;
Related videos on Youtube
Peck
Updated on June 04, 2022Comments
-
Peck almost 2 years
Ok, so the title is a bit convoluted. This is basically a greatest-n-per-group type problem, but I can't for the life of me figure it out.
I have a table, user_stats:
------------------+---------+--------------------------------------------------------- id | bigint | not null default nextval('user_stats_id_seq'::regclass) user_id | bigint | not null datestamp | integer | not null post_count | integer | friends_count | integer | favourites_count | integer | Indexes: "user_stats_pk" PRIMARY KEY, btree (id) "user_stats_datestamp_index" btree (datestamp) "user_stats_user_id_index" btree (user_id) Foreign-key constraints: "user_user_stats_fk" FOREIGN KEY (user_id) REFERENCES user_info(id)
I want to get the stats for each id by latest datestamp. This is a biggish table, somewhere in the neighborhood of 41m rows, so I've created a temp table of user_id, last_date using:
CREATE TEMP TABLE id_max_date AS (SELECT user_id, MAX(datestamp) AS date FROM user_stats GROUP BY user_id);
The problem is that datestamp isn't unique since there can be more than 1 stat update in a day (should have been a real timestamp but the guy who designed this was kind of an idiot and theres too much data to go back at the moment). So some IDs have multiple rows when I do the JOIN:
SELECT user_stats.user_id, user_stats.datestamp, user_stats.post_count, user_stats.friends_count, user_stats.favorites_count FROM id_max_date JOIN user_stats ON id_max_date.user_id=user_stats.user_id AND date=datestamp;
If I was doing this as subselects I guess I could LIMIT 1, but I've always heard those are horribly inefficient. Thoughts?
-
Charles almost 14 years"... I've always heard those are horribly inefficient." Don't get sucked into the cargo cult!
EXPLAIN
is your friend! Try it and find out what the query optimizer can do for you.
-
-
Jonathan Leffler almost 14 yearsI suppose those id columns have to have a use after all; I'm not sure this use was planned for, though.
-
Peck almost 14 yearsThat is exactly what I want, its specific to postgres, so not ideal, but Ill make a note around it and move along. Thank you!
-
rfusca almost 14 years@Peck - I think DISTINCT ON is one of the handiest postgres-isms. I wish more SQL implementations had something similar!
-
Bill Karwin almost 14 yearsThe permissive behavior of
GROUP BY
in MySQL and SQLite is similar. But the results might be arbitrary. These features are not supported by the SQL standard.