Transpose rows and columns (a.k.a. pivot) only with a minimum COUNT()?

11,778

Solution 1

Here's an alternative to @bluefeet's suggestion, which is somewhat similar but avoids the join (instead, the upper level grouping is applied to the already grouped result set):

SELECT
  year,
  MAX(CASE animal WHEN 'kittens' THEN avg_price END) AS "kittens",
  MAX(CASE animal WHEN 'puppies' THEN avg_price END) AS "puppies"
FROM (
  SELECT
    animal,
    year,
    COUNT(*) AS cnt,
    AVG(Price) AS avg_price
  FROM tab_test
  GROUP BY
    animal,
    year
) s
WHERE cnt >= 3
GROUP BY
  year
;

Solution 2

CASE

If your case is as simple as demonstrated, a CASE statement will do:

SELECT year
     , sum(CASE WHEN animal = 'kittens' THEN price END) AS kittens
     , sum(CASE WHEN animal = 'puppies' THEN price END) AS puppies
FROM  (
   SELECT year, animal, avg(price) AS price
   FROM   tab_test
   GROUP  BY year, animal
   HAVING count(*) > 2
   ) t
GROUP  BY year
ORDER  BY year;

Doesn't matter whether you use sum(), max() or min() as aggregate function in the outer query. They all result in the same value in this case.

SQL Fiddle

crosstab()

With more categories it will be simpler with a crosstab() query. This should also be faster for bigger tables.

You need to install the additional module tablefunc (once per database). Since Postgres 9.1 that's as simple as:

CREATE EXTENSION tablefunc;

Details in this related answer:

SELECT * FROM crosstab(
      'SELECT year, animal, avg(price) AS price
       FROM   tab_test
       GROUP  BY animal, year
       HAVING count(*) > 2
       ORDER  BY 1,2'

      ,$$VALUES ('kittens'::text), ('puppies')$$)
AS ct ("year" text, "kittens" numeric, "puppies" numeric);

No sqlfiddle for this one because the site doesn't allow additional modules.

Benchmark

To verify my claims I ran a quick benchmark with close to real data in my small test database. PostgreSQL 9.1.6. Test with EXPLAIN ANALYZE, best of 10:

Test setup with 10020 rows:

CREATE TABLE tab_test (year int, animal text, price numeric);

-- years with lots of rows
INSERT INTO tab_test
SELECT 2000 + ((g + random() * 300))::int/1000 
     , CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
     , (random() * 200)::numeric
FROM   generate_series(1,10000) g;

-- .. and some years with only few rows to include cases with count < 3
INSERT INTO tab_test
SELECT 2010 + ((g + random() * 10))::int/2
     , CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
     , (random() * 200)::numeric
FROM   generate_series(1,20) g;

Results:

@bluefeet
Total runtime: 95.401 ms

@wildplasser (different results, includes rows with count <= 3)
Total runtime: 64.497 ms

@Andreiy (+ ORDER BY)
& @Erwin1 - CASE (both perform about the same)
Total runtime: 39.105 ms

@Erwin2 - crosstab()
Total runtime: 17.644 ms

Largely proportional (but irrelevant) results with only 20 rows. Only @wildplasser's CTE has more overhead and spikes a little.

With more than a handful of rows, crosstab() quickly takes lead. @Andreiy's query performs about the same as my simplified version, aggregate function in outer SELECT (min(), max(), sum()) makes no measurable difference (just two rows per group).

Everything as expected, no surprises, take my setup and try it @home.

Solution 3

Is this what you are looking for:

SELECT t1.year,
    AVG(CASE WHEN t1.animal = 'kittens' THEN t1.price END) AS "kittens",
    AVG(CASE WHEN t1.animal = 'puppies' THEN t1.price END) AS "puppies"
FROM tab_test t1
inner join 
(
  select animal, count(*) YearCount, year
  from tab_test
  group by animal, year
) t2
  on t1.animal = t2.animal 
  and t1.year = t2.year
where t2.YearCount >= 3
group by t1.year

See SQL Fiddle with Demo

Solution 4

CREATE TABLE pussyriot(year INTEGER NOT NULL
        , animal varchar
        , price integer
        );

INSERT INTO pussyriot(year , animal , price ) VALUES
 (2000, 'kittens', 79)
, (2000, 'kittens', 93)
...
, (2007, 'puppies', 81)
, (2007, 'puppies', 38)
        ;

-- a self join is a poor man's pivot:
WITH cal AS ( -- generate calendar file
        SELECT generate_series(MIN(pr.year) , MAX(pr.year)) AS year
        FROM pussyriot pr
        )
, fur AS (
        SELECT distinct year, animal, AVG(price) AS price
        FROM pussyriot
        GROUP BY year, animal
        -- UPDATE: added next line
        HAVING COUNT(*) >= 3
        )
SELECT cal.year
        , pussy.price AS price_of_the_pussy
        , puppy.price AS price_of_the_puppy
FROM cal
LEFT JOIN fur pussy ON pussy.year=cal.year AND pussy.animal='kittens'
LEFT JOIN fur puppy ON puppy.year=cal.year AND puppy.animal='puppies'
        ;
Share:
11,778
user1626730
Author by

user1626730

Updated on July 29, 2022

Comments

  • user1626730
    user1626730 almost 2 years

    Here's my table 'tab_test':

    year    animal  price
    2000    kittens 79
    2000    kittens 93
    2000    kittens 100
    2000    puppies 15
    2000    puppies 32
    2001    kittens 31
    2001    kittens 17
    2001    puppies 65
    2001    puppies 48
    2002    kittens 84
    2002    kittens 86
    2002    puppies 15
    2002    puppies 95
    2003    kittens 62
    2003    kittens 24
    2003    puppies 36
    2003    puppies 41
    2004    kittens 65
    2004    kittens 85
    2004    puppies 58
    2004    puppies 95
    2005    kittens 45
    2005    kittens 25
    2005    puppies 15
    2005    puppies 35
    2006    kittens 50
    2006    kittens 80
    2006    puppies 95
    2006    puppies 49
    2007    kittens 40
    2007    kittens 19
    2007    puppies 81
    2007    puppies 38
    2008    kittens 37
    2008    kittens 51
    2008    puppies 29
    2008    puppies 72
    2009    kittens 84
    2009    kittens 26
    2009    puppies 49
    2009    puppies 34
    2010    kittens 75
    2010    kittens 96
    2010    puppies 18
    2010    puppies 26
    2011    kittens 35
    2011    kittens 21
    2011    puppies 90
    2011    puppies 18
    2012    kittens 12
    2012    kittens 23
    2012    puppies 74
    2012    puppies 79
    

    Here's some code that transposes the rows and columns so I get an average for 'kittens' and 'puppies':

    SELECT
        year,
        AVG(CASE WHEN animal = 'kittens' THEN price END) AS "kittens",
        AVG(CASE WHEN animal = 'puppies' THEN price END) AS "puppies"
    FROM tab_test
    GROUP BY year
    ORDER BY year;
    

    The output for the code above is:

        year    kittens puppies
        2000    90.6666666666667    23.5
        2001    24.0    56.5
        2002    85.0    55.0
        2003    43.0    38.5
        2004    75.0    76.5
        2005    35.0    25.0
        2006    65.0    72.0
        2007    29.5    59.5
        2008    44.0    50.5
        2009    55.0    41.5
        2010    85.5    22.0
        2011    28.0    54.0
        2012    17.5    76.5
    

    What I'd like is a table like the second one, but it would only contain items which had a COUNT() of at least 3 in the first table. In other words, the goal is to have this as output:

    year    kittens
    2000    90.6666666666667
    

    There were at least 3 instances of 'kitten' in the first table.
    Is this possible in PostgreSQL?

  • user1626730
    user1626730 over 11 years
    The code works great, though it's just a little slower than @AndriyM's. Still, thanks for the extra tips!
  • Erwin Brandstetter
    Erwin Brandstetter over 11 years
    @user1626730: Both are slower? The CASE version should be equally fast or faster - except for the missing ORDER BY in @Andriy's version. Hmm.. maybe sum() is slower than max() .. but should hardly be relevant. The crosstab() version would be faster for more complex cases / bigger tables.
  • user1626730
    user1626730 over 11 years
    Well, the first one was between 100-200ms on SQLfiddle, while @AndriyM's was between 1-10ms. Even so, the crosstab info will probably be useful to me in the future, as I do plan on creating custom crosstabs.
  • Erwin Brandstetter
    Erwin Brandstetter over 11 years
    @user1626730: These number are measuring artefacts. With a small data-set you can hardly get reliable timings on sqlfiddle. I ran a quick benchmark locally to verify my claims and added it to my answer.
  • user1626730
    user1626730 over 11 years
    Ah, I see. In that case, I'll try out all these codes then. Thank you.