Transpose rows and columns (a.k.a. pivot) only with a minimum COUNT()?
Solution 1
Here's an alternative to @bluefeet's suggestion, which is somewhat similar but avoids the join (instead, the upper level grouping is applied to the already grouped result set):
SELECT
year,
MAX(CASE animal WHEN 'kittens' THEN avg_price END) AS "kittens",
MAX(CASE animal WHEN 'puppies' THEN avg_price END) AS "puppies"
FROM (
SELECT
animal,
year,
COUNT(*) AS cnt,
AVG(Price) AS avg_price
FROM tab_test
GROUP BY
animal,
year
) s
WHERE cnt >= 3
GROUP BY
year
;
Solution 2
CASE
If your case is as simple as demonstrated, a CASE
statement will do:
SELECT year
, sum(CASE WHEN animal = 'kittens' THEN price END) AS kittens
, sum(CASE WHEN animal = 'puppies' THEN price END) AS puppies
FROM (
SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY year, animal
HAVING count(*) > 2
) t
GROUP BY year
ORDER BY year;
Doesn't matter whether you use sum()
, max()
or min()
as aggregate function in the outer query. They all result in the same value in this case.
crosstab()
With more categories it will be simpler with a crosstab()
query. This should also be faster for bigger tables.
You need to install the additional module tablefunc (once per database). Since Postgres 9.1 that's as simple as:
CREATE EXTENSION tablefunc;
Details in this related answer:
SELECT * FROM crosstab(
'SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY animal, year
HAVING count(*) > 2
ORDER BY 1,2'
,$$VALUES ('kittens'::text), ('puppies')$$)
AS ct ("year" text, "kittens" numeric, "puppies" numeric);
No sqlfiddle for this one because the site doesn't allow additional modules.
Benchmark
To verify my claims I ran a quick benchmark with close to real data in my small test database. PostgreSQL 9.1.6. Test with EXPLAIN ANALYZE
, best of 10:
Test setup with 10020 rows:
CREATE TABLE tab_test (year int, animal text, price numeric);
-- years with lots of rows
INSERT INTO tab_test
SELECT 2000 + ((g + random() * 300))::int/1000
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,10000) g;
-- .. and some years with only few rows to include cases with count < 3
INSERT INTO tab_test
SELECT 2010 + ((g + random() * 10))::int/2
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,20) g;
Results:
@bluefeet
Total runtime: 95.401 ms
@wildplasser (different results, includes rows with count <= 3
)
Total runtime: 64.497 ms
@Andreiy (+ ORDER BY
)
& @Erwin1 - CASE
(both perform about the same)
Total runtime: 39.105 ms
@Erwin2 - crosstab()
Total runtime: 17.644 ms
Largely proportional (but irrelevant) results with only 20 rows. Only @wildplasser's CTE has more overhead and spikes a little.
With more than a handful of rows, crosstab()
quickly takes lead.
@Andreiy's query performs about the same as my simplified version, aggregate function in outer SELECT
(min()
, max()
, sum()
) makes no measurable difference (just two rows per group).
Everything as expected, no surprises, take my setup and try it @home.
Solution 3
Is this what you are looking for:
SELECT t1.year,
AVG(CASE WHEN t1.animal = 'kittens' THEN t1.price END) AS "kittens",
AVG(CASE WHEN t1.animal = 'puppies' THEN t1.price END) AS "puppies"
FROM tab_test t1
inner join
(
select animal, count(*) YearCount, year
from tab_test
group by animal, year
) t2
on t1.animal = t2.animal
and t1.year = t2.year
where t2.YearCount >= 3
group by t1.year
Solution 4
CREATE TABLE pussyriot(year INTEGER NOT NULL
, animal varchar
, price integer
);
INSERT INTO pussyriot(year , animal , price ) VALUES
(2000, 'kittens', 79)
, (2000, 'kittens', 93)
...
, (2007, 'puppies', 81)
, (2007, 'puppies', 38)
;
-- a self join is a poor man's pivot:
WITH cal AS ( -- generate calendar file
SELECT generate_series(MIN(pr.year) , MAX(pr.year)) AS year
FROM pussyriot pr
)
, fur AS (
SELECT distinct year, animal, AVG(price) AS price
FROM pussyriot
GROUP BY year, animal
-- UPDATE: added next line
HAVING COUNT(*) >= 3
)
SELECT cal.year
, pussy.price AS price_of_the_pussy
, puppy.price AS price_of_the_puppy
FROM cal
LEFT JOIN fur pussy ON pussy.year=cal.year AND pussy.animal='kittens'
LEFT JOIN fur puppy ON puppy.year=cal.year AND puppy.animal='puppies'
;
user1626730
Updated on July 29, 2022Comments
-
user1626730 almost 2 years
Here's my table 'tab_test':
year animal price 2000 kittens 79 2000 kittens 93 2000 kittens 100 2000 puppies 15 2000 puppies 32 2001 kittens 31 2001 kittens 17 2001 puppies 65 2001 puppies 48 2002 kittens 84 2002 kittens 86 2002 puppies 15 2002 puppies 95 2003 kittens 62 2003 kittens 24 2003 puppies 36 2003 puppies 41 2004 kittens 65 2004 kittens 85 2004 puppies 58 2004 puppies 95 2005 kittens 45 2005 kittens 25 2005 puppies 15 2005 puppies 35 2006 kittens 50 2006 kittens 80 2006 puppies 95 2006 puppies 49 2007 kittens 40 2007 kittens 19 2007 puppies 81 2007 puppies 38 2008 kittens 37 2008 kittens 51 2008 puppies 29 2008 puppies 72 2009 kittens 84 2009 kittens 26 2009 puppies 49 2009 puppies 34 2010 kittens 75 2010 kittens 96 2010 puppies 18 2010 puppies 26 2011 kittens 35 2011 kittens 21 2011 puppies 90 2011 puppies 18 2012 kittens 12 2012 kittens 23 2012 puppies 74 2012 puppies 79
Here's some code that transposes the rows and columns so I get an average for 'kittens' and 'puppies':
SELECT year, AVG(CASE WHEN animal = 'kittens' THEN price END) AS "kittens", AVG(CASE WHEN animal = 'puppies' THEN price END) AS "puppies" FROM tab_test GROUP BY year ORDER BY year;
The output for the code above is:
year kittens puppies 2000 90.6666666666667 23.5 2001 24.0 56.5 2002 85.0 55.0 2003 43.0 38.5 2004 75.0 76.5 2005 35.0 25.0 2006 65.0 72.0 2007 29.5 59.5 2008 44.0 50.5 2009 55.0 41.5 2010 85.5 22.0 2011 28.0 54.0 2012 17.5 76.5
What I'd like is a table like the second one, but it would only contain items which had a
COUNT()
of at least 3 in the first table. In other words, the goal is to have this as output:year kittens 2000 90.6666666666667
There were at least 3 instances of 'kitten' in the first table.
Is this possible in PostgreSQL? -
user1626730 over 11 yearsThe code works great, though it's just a little slower than @AndriyM's. Still, thanks for the extra tips!
-
Erwin Brandstetter over 11 years@user1626730: Both are slower? The
CASE
version should be equally fast or faster - except for the missingORDER BY
in @Andriy's version. Hmm.. maybesum()
is slower thanmax()
.. but should hardly be relevant. Thecrosstab()
version would be faster for more complex cases / bigger tables. -
user1626730 over 11 yearsWell, the first one was between 100-200ms on SQLfiddle, while @AndriyM's was between 1-10ms. Even so, the crosstab info will probably be useful to me in the future, as I do plan on creating custom crosstabs.
-
Erwin Brandstetter over 11 years@user1626730: These number are measuring artefacts. With a small data-set you can hardly get reliable timings on sqlfiddle. I ran a quick benchmark locally to verify my claims and added it to my answer.
-
user1626730 over 11 yearsAh, I see. In that case, I'll try out all these codes then. Thank you.