Quartiles in SQL query

39,064

Solution 1

See SqlFiddle : http://sqlfiddle.com/#!9/accca6/2/6 Note : for the sqlfiddle I've generated 100 rows, each integer between 1 and 100 has a row, but it is a random order (done in excel).

Here is the code :

SET @number_of_rows := (SELECT COUNT(*) FROM LuxLog);
SET @quartile := (ROUND(@number_of_rows*0.25));
SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));
SET @sql_q3 := (CONCAT('( SELECT "Q3" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux ASC LIMIT 1 OFFSET ', @quartile,');'));
SET @sql := (CONCAT(@sql_q1,' UNION ',@sql_q3));
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;

EDIT :

SET @current_sensor := 101;
SET @quartile := (ROUND((SELECT COUNT(*) FROM LuxLog WHERE Sensor = @current_sensor)*0.25));
SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=', @current_sensor,' ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));
SET @sql_q3 := (CONCAT('( SELECT "Q3" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=', @current_sensor,' ORDER BY Lux ASC LIMIT 1 OFFSET ', @quartile,');'));
SET @sql := (CONCAT(@sql_q1,' UNION ',@sql_q3));
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;

Underlying reasoning is as follows : For quartile 1 we want to get 25% from the top so we want to know how much rows there are, that's :

SET @number_of_rows := (SELECT COUNT(*) FROM LuxLog);

Now that we know the number of rows, we want to know what is 25% of that, it is this line :

SET @quartile := (ROUND(@number_of_rows*0.25));

Then to find a quartile we want to order the LuxLog table by Lux, then to get the row number "@quartile", in order to do that we set the OFFSET to @quartile to say that we want to start our select from the row number @quartile and we say limit 1 to say that we want to retrieve only one row. That's :

SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));

We do (almost) the same for the other quartile, but rather than starting from the top (from higher values to lower) we start from the bottom (it explains the ASC).

But for now we just have strings stored in the variables @sql_q1 and @sql_q3, so the concatenate them, we union the results of the queries, we prepare the query and execute it.

Solution 2

Something like this should do it:

select
    ll.*,
    if (a.position is not null, 1,
        if (b.position is not null, 2, 
        if (c.position is not null, 3, 
        if (d.position is not null, 4, 0)))
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux > (select count(*)*0.00 from luxlog) and a.lux <= (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux > (select count(*)*0.25 from luxlog) and b.lux <= (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux > (select count(*)*0.50 from luxlog) and c.lux <= (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux > (select count(*)*0.75 from luxlog)
;    

Here's the complete example:

use example;

drop table if exists luxlog;

CREATE TABLE LuxLog (
  Sensor TINYINT,
  Lux INT,
  position int,
  PRIMARY KEY(Position)
);

insert into luxlog values (0, 1, 10);
insert into luxlog values (0, 2, 20);
insert into luxlog values (0, 3, 30);
insert into luxlog values (0, 4, 40);
insert into luxlog values (0, 5, 50);
insert into luxlog values (0, 6, 60);
insert into luxlog values (0, 7, 70);
insert into luxlog values (0, 8, 80);

select count(*)*.25 from luxlog;
select count(*)*.50 from luxlog;

select
    ll.*,
    a.position,
    b.position,
    if(
        a.position is not null, 1,
        if (b.position is not null, 2, 0)
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux >= (select count(*)*0.00 from luxlog) and a.lux < (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux >= (select count(*)*0.25 from luxlog) and b.lux < (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux >= (select count(*)*0.50 from luxlog) and c.lux < (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux >= (select count(*)*0.75 from luxlog) and d.lux < (select count(*)*1.00 from luxlog)
;    


select
    ll.*,
    if (a.position is not null, 1,
        if (b.position is not null, 2, 
        if (c.position is not null, 3, 
        if (d.position is not null, 4, 0)))
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux > (select count(*)*0.00 from luxlog) and a.lux <= (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux > (select count(*)*0.25 from luxlog) and b.lux <= (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux > (select count(*)*0.50 from luxlog) and c.lux <= (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux > (select count(*)*0.75 from luxlog)
;    
Share:
39,064
Hamma
Author by

Hamma

Updated on October 27, 2020

Comments

  • Hamma
    Hamma over 3 years

    I have a very simple table like that:

    CREATE TABLE IF NOT EXISTS LuxLog (
      Sensor TINYINT,
      Lux INT,
      PRIMARY KEY(Sensor)
    )
    

    It contains thousands of logs from different sensors.

    I would like to have Q1 and Q3 for all sensors.

    I can do one query for every data, but it would be better for me to have one query for all sensors (getting Q1 and Q3 back from one query)

    I though it would be a fairly simple operation, as quartiles are broadly used and one of the main statistical variables in frequency calculation. The truth is that I found loads of overcomplicated solutions, while I was hoping to find something neat and simple.

    Anyone can give me a hint?

    Edit: This is a piece of code that I found online, but it is not working for me:

    SELECT  SUBSTRING_INDEX(
            SUBSTRING_INDEX(
                GROUP_CONCAT(                 -- 1) make a sorted list of values
                    Lux
                    ORDER BY Lux
                    SEPARATOR ','
                )
            ,   ','                           -- 2) cut at the comma
            ,   75/100 * COUNT(*)        --    at the position beyond the 90% portion
            )
        ,   ','                               -- 3) cut at the comma
        ,   -1                                --    right after the desired list entry
        )                 AS `75th Percentile`
        FROM    LuxLog
        WHERE   Sensor=12
        AND     Lux<>0
    

    I am getting 1 as return value, while it should be a number that can be divided by 10 (10,20,30.....1000)