Rails : How to build statistics per day/month/year or How database agnostic SQL functions are missing (ex. : STRFTIME, DATE_FORMAT, DATE_TRUNC)

11,130

Solution 1

I ended up writing my own gem. Check it out and feel free to contribute: https://github.com/lakim/sql_funk

It allows you to make calls like:

Subscriber.count_by("created_at", :group_by => "day")

Solution 2

You speak of some pretty difficult problems that Rails, unfortunately, completely overlooks. The ActiveRecord::Calculations docs are written like they're all you ever need, but databases can do much more advanced things. As Donal Fellows mentioned in his comment, the problem is much trickier than it seems.

I've developed a Rails application over the last two years that makes heavy use of aggregation, and I've tried a few different approaches to the problem. I unfortunately don't have the luxary of ignoring things like daylight savings because the statistics are "only trends". The calculations I generate are tested by my customers to exact specifications.

To expand upon the problem a bit, I think you'll find that your current solution of grouping by dates is inadequate. It seems like a natural option to use STRFTIME. The primary problem is that it doesn't let you group by arbitrary time periods. If you want to do aggregation by year, month, day, hour, and/or minute, STRFTIME will work fine. If not, you'll find yourself looking for another solution. Another huge problem is that of aggregation upon aggregation. Say, for example, you want to group by month, but you want to do it starting from the 15th of every month. How would you do it using STRFTIME? You'd have to group by each day, and then by month, but then someone account for the starting offset of the 15th day of each month. The final straw is that grouping by STRFTIME necessitates grouping by a string value, which you'll find very slow when performing aggregation upon aggregation.

The most performant and best designed solution I've come to is one based upon integer time periods. Here is an excerpt from one of my mysql queries:

SELECT
  field1, field2, field3,
  CEIL((UNIX_TIMESTAMP(CONVERT_TZ(date, '+0:00', @@session.time_zone)) + :begin_offset) / :time_interval) AS time_period
FROM
  some_table
GROUP BY 
  time_period

In this case, :time_interval is the number of seconds in the grouping period (e.g. 86400 for daily) and :begin_offset is the number of seconds to offset the period start. The CONVERT_TZ() business accounts for the way mysql interprets dates. Mysql always assumes that the date field is in the mysql local time zone. But because I store times in UTC, I must convert it from UTC to the session time zone if I want the UNIX_TIMESTAMP() function to give me a correct response. The time period ends up being an integer that describes the number of time intervals since the start of unix time. This solution is much more flexible because it lets you group by arbitrary periods and doesn't require aggregation upon aggregation.

Now, to get to my real point. For a robust solution, I'd recommend that you consider not using Rails at all to generate these queries. The biggest issue is that the performance characteristics and subtleties of aggregation are different across the databases. You might find one design that works well in your development environment but not in production, or vice-versa. You'll jump through a lot of hoops to get Rails to play nicely with both databases in query construction.

Instead I'd recommend that you generate database-specific views in your chosen database and bring those along to the correct environment. Try to model the view as you would any other ActiveRecord table (id's and all), and of course make the fields in the view identical across databases. Because these statistics are read-only queries, you can use a model to back them and pretend like they're full-fledged tables. Just raise an exception if somebody tries to save, create, update, or destroy.

Not only will you get simplified model management by doing things the Rails way, you'll also find that you can write units tests for your aggregation features in ways you wouldn't dream of in pure SQL. And if you decide to switch databases, you'll have to rewrite those views, but your tests will tell you where you're wrong, and make life so much easier.

Solution 3

I just released a gem that allows you to do this easily with MySQL. https://github.com/ankane/groupdate

You should really try to run MySQL in development, too. Your development and production environments should be as close as possible - less of a chance for something to work on development and totally break production.

Share:
11,130
lakim
Author by

lakim

Freelance Ruby and Javascript developer

Updated on June 10, 2022

Comments

  • lakim
    lakim almost 2 years

    I have been searching all over the web and I have no clue.

    • Suppose you have to build a dashboard in the admin area of your Rails app and you want to have the number of subscriptions per day.
    • Suppose that you are using SQLite3 for development, MySQL for production (pretty standard setup)

    Basically, there are two options :

    1) Retrieve all rows from the database using Subscriber.all and aggregate by day in the Rails app using the Enumerable.group_by :

    @subscribers = Subscriber.all
    @subscriptions_per_day = @subscribers.group_by { |s| s.created_at.beginning_of_day }
    

    I think this is a really bad idea. Retrieving all rows from the database can be acceptable for a small application, but it will not scale at all. Database aggregate and date functions to the rescue !

    2) Run a SQL query in the database using aggregate and date functions :

    Subscriber.select('STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions').group('day')
    

    Which will run in this SQL query :

    SELECT STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions
    FROM subscribers
    GROUP BY day
    

    Much better. Now aggregates are done in the database which is optimized for this kind of task, and only one row per day is returned from the database to the Rails app.

    ... but wait... now the app has to go live in my production env which uses MySQL ! Replace STRFTIME() with DATE_FORMAT(). What if tomorrow I switch to PostgreSQL ? Replace DATE_FORMAT() with DATE_TRUNC().

    I like to develop with SQLite. Simple and easy. I also like the idea that Rails is database agnostic. But why Rails doesn't provide a way to translate SQL functions that do the exact same thing, but have different syntax in each RDBMS (this difference is really stupid, but hey, it's too late to complain about it) ?

    I can't believe that I find so few answers on the Web for such a basic feature of a Rails app : count the subscriptions per day, month or year.

    Tell me I'm missing something :)

    EDIT

    It's been a few years since I posted this question. Experience has shown that I should use the same DB for dev and prod. So I now consider the database agnostic requirement irrelevant.

    Dev/prod parity FTW.

  • lakim
    lakim over 13 years
    I like your first option. Helps me think out of the box. But the trade-off (larger record size) is even bigger considering it will only be used by admins.
  • lakim
    lakim over 13 years
    Does anyone know a Rails plugin that translates SQL functions for each RDBMS ? I still think it would be the best option.
  • pbaumann
    pbaumann over 13 years
    Well technically rails provides that functionality (hence the conversion of a conditions hash to a SQL query with the finder). If you're confident that you'll stick with SQLite and MySQL, they're similar in their date() function... Try daily_subscriber_counts = Subscriber.count(:group => "date(created_at)")
  • Kevin Bedell
    Kevin Bedell about 12 years
    This seems like basically pretty solid advice. Twisting rails to do things your database can do in a single query seem time-consuming and error prone -- not to mention the database will can probably generate the answer faster and without chewing up memory.