Django 1.11 Annotating a Subquery Aggregate

49,589

Solution 1

It's also possible to create a subclass of Subquery, that changes the SQL it outputs. For instance, you can use:

class SQCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = models.IntegerField()

You then use this as you would the original Subquery class:

spaces = Space.objects.filter(carpark=OuterRef('pk')).values('pk')
Carpark.objects.annotate(space_count=SQCount(spaces))

You can use this trick (at least in postgres) with a range of aggregating functions: I often use it to build up an array of values, or sum them.

Solution 2

Shazaam! Per my edits, an additional column was being output from my subquery. This was to facilitate ordering (which just isn't required in a COUNT).

I just needed to remove the prescribed meta-order from the model. You can do this by just adding an empty .order_by() to the subquery. In my code terms that meant:

from django.db.models import Count, OuterRef, Subquery

spaces = Space.objects.filter(carpark=OuterRef('pk')).order_by().values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))

And that works. Superbly. So annoying.

Solution 3

I just bumped into a VERY similar case, where I had to get seat reservations for events where the reservation status is not cancelled. After trying to figure the problem out for hours, here's what I've seen as the root cause of the problem:

Preface: this is MariaDB, Django 1.11.

When you annotate a query, it gets a GROUP BY clause with the fields you select (basically what's in your values() query selection). After investigating with the MariaDB command line tool why I'm getting NULLs or Nones on the query results, I've came to the conclusion that the GROUP BY clause will cause the COUNT() to return NULLs.

Then, I started diving into the QuerySet interface to see how can I manually, forcibly remove the GROUP BY from the DB queries, and came up with the following code:

from django.db.models.fields import PositiveIntegerField

reserved_seats_qs = SeatReservation.objects.filter(
        performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
    ).values('id').annotate(
        count=Count('id')).values('count')
# Query workaround: remove GROUP BY from subquery. Test this
# vigorously!
reserved_seats_qs.query.group_by = []

performances_qs = Performance.objects.annotate(
    reserved_seats=Subquery(
        queryset=reserved_seats_qs,
        output_field=PositiveIntegerField()))

print(performances_qs[0].reserved_seats)

So basically, you have to manually remove/update the group_by field on the subquery's queryset in order for it to not have a GROUP BY appended on it on execution time. Also, you'll have to specify what output field the subquery will have, as it seems that Django fails to recognize it automatically, and raises exceptions on the first evaluation of the queryset. Interestingly, the second evaluation succeeds without it.

I believe this is a Django bug, or an inefficiency in subqueries. I'll create a bug report about it.

Edit: the bug report is here.

Solution 4

Problem

The problem is that Django adds GROUP BY as soon as it sees using an aggregate function.

Solution

So you can just create your own aggregate function but so that Django thinks it is not aggregate. Just like this:

total_comments = Comment.objects.filter(
    post=OuterRef('pk')
).order_by().annotate(
    total=Func(F('length'), function='SUM')
).values('total')

Post.objects.filter(length__gt=Subquery(total_comments))

This way you get the SQL query like this:

SELECT "testapp_post"."id", "testapp_post"."length"
FROM "testapp_post"
WHERE "testapp_post"."length" > (SELECT SUM(U0."length") AS "total"
                                 FROM "testapp_comment" U0
                                 WHERE U0."post_id" = "testapp_post"."id")

So you can even use aggregate subqueries in aggregate functions.

Example

You can count the number of workdays between two dates, excluding weekends and holidays, and aggregate and summarize them by employee:

class NonWorkDay(models.Model):
    date = DateField()

class WorkPeriod(models.Model):
    employee = models.ForeignKey(User, on_delete=models.CASCADE)
    start_date = DateField()
    end_date = DateField()

number_of_non_work_days = NonWorkDay.objects.filter(
    date__gte=OuterRef('start_date'),
    date__lte=OuterRef('end_date'),
).annotate(
    cnt=Func('id', function='COUNT')
).values('cnt')

WorkPeriod.objects.values('employee').order_by().annotate(
    number_of_word_days=Sum(F('end_date__year') - F('start_date__year') - number_of_non_work_days)
)

Hope this will help!

Solution 5

A solution which would work for any general aggregation could be implemented using Window classes from Django 2.0. I have added this to the Django tracker ticket as well.

This allows the aggregation of annotated values by calculating the aggregate over partitions based on the outer query model (in the GROUP BY clause), then annotating that data to every row in the subquery queryset. The subquery can then use the aggregated data from the first row returned and ignore the other rows.

Performance.objects.annotate(
    reserved_seats=Subquery(
        SeatReservation.objects.filter(
            performance=OuterRef(name='pk'),
            status__in=TAKEN_TYPES,
        ).annotate(
            reserved_seat_count=Window(
                expression=Count('pk'),
                partition_by=[F('performance')]
            ),
        ).values('reserved_seat_count')[:1],
        output_field=FloatField()
    )
)
Share:
49,589
Oli
Author by

Oli

Hi, I'm Oli and I'm a "full-stack" web-dev-op. Eurgh. I'm also allergic to jargon BS. I spend most of my professional time writing Django websites and webapps for SMEs. I write a lot of Python outside of Django sites too. I administer various Linux servers for various tasks. I contribute to the open source projects that I use when I can. I'm a full-time Linux user and that has lead to helping other people live the dream. I am an official Ubuntu Member and I earnt my ♦ on SE's own Ask Ubuntu in 2011's moderator election. That's probably where I spend most of my unpaid time. I also run thepcspy.com which has been my place to write for the last decade or so. If you need to contact me for extended help, you can do so via my website, just remember that I have bills so if I feel your request is above and beyond normal duty, I might ask for remuneration for one-on-one support. For more social contact, you can usually find me (or just my computer) lurking in the Ask Ubuntu General Chat Room and on Freenode in #ubuntu and #ubuntu-uk under the handle Oli or Oli``.

Updated on July 17, 2022

Comments

  • Oli
    Oli almost 2 years

    This is a bleeding-edge feature that I'm currently skewered upon and quickly bleeding out. I want to annotate a subquery-aggregate onto an existing queryset. Doing this before 1.11 either meant custom SQL or hammering the database. Here's the documentation for this, and the example from it:

    from django.db.models import OuterRef, Subquery, Sum
    comments = Comment.objects.filter(post=OuterRef('pk')).values('post')
    total_comments = comments.annotate(total=Sum('length')).values('total')
    Post.objects.filter(length__gt=Subquery(total_comments))
    

    They're annotating on the aggregate, which seems weird to me, but whatever.

    I'm struggling with this so I'm boiling it right back to the simplest real-world example I have data for. I have Carparks which contain many Spaces. Use Book→Author if that makes you happier but —for now— I just want to annotate on a count of the related model using Subquery*.

    spaces = Space.objects.filter(carpark=OuterRef('pk')).values('carpark')
    count_spaces = spaces.annotate(c=Count('*')).values('c')
    Carpark.objects.annotate(space_count=Subquery(count_spaces))
    

    This gives me a lovely ProgrammingError: more than one row returned by a subquery used as an expression and in my head, this error makes perfect sense. The subquery is returning a list of spaces with the annotated-on total.

    The example suggested that some sort of magic would happen and I'd end up with a number I could use. But that's not happening here? How do I annotate on aggregate Subquery data?

    Hmm, something's being added to my query's SQL...

    I built a new Carpark/Space model and it worked. So the next step is working out what's poisoning my SQL. On Laurent's advice, I took a look at the SQL and tried to make it more like the version they posted in their answer. And this is where I found the real problem:

    SELECT "bookings_carpark".*, (SELECT COUNT(U0."id") AS "c"
    FROM "bookings_space" U0
    WHERE U0."carpark_id" = ("bookings_carpark"."id")
    GROUP BY U0."carpark_id", U0."space"
    )
    AS "space_count" FROM "bookings_carpark";

    I've highlighted it but it's that subquery's GROUP BY ... U0."space". It's retuning both for some reason. Investigations continue.

    Edit 2: Okay, just looking at the subquery SQL I can see that second group by coming through ☹

    In [12]: print(Space.objects_standard.filter().values('carpark').annotate(c=Count('*')).values('c').query)
    SELECT COUNT(*) AS "c" FROM "bookings_space" GROUP BY "bookings_space"."carpark_id", "bookings_space"."space" ORDER BY "bookings_space"."carpark_id" ASC, "bookings_space"."space" ASC
    

    Edit 3: Okay! Both these models have sort orders. These are being carried through to the subquery. It's these orders that are bloating out my query and breaking it.

    I guess this might be a bug in Django but short of removing the Meta-order_by on both these models, is there any way I can unsort a query at querytime?


    *I know I could just annotate a Count for this example. My real purpose for using this is a much more complex filter-count but I can't even get this working.

  • Oli
    Oli about 7 years
    Good tip and thanks for the known-good SQL. This has actually highlighted that something extra is being added to my query (an extra group-by in the subquuery) but I'm still not sure why. I'll edit my question with the illicit SQL.
  • Oli
    Oli about 7 years
    Awarding this the bounty because it got me on track to help myself, but accepting my own answer because that's most likely to help the next person.
  • Laurent S
    Laurent S about 7 years
    I was confused why any ordering at all would be required. Would you by any chance use mysql as a backend? Django source has requires_explicit_null_ordering_when_grouping = False in db/backends/base/features.py which only mysql sets to True. I can't figure out where this is used except in a unit test (github.com/django/django/blob/…), but this might explain why you're getting this extra ORDER BY.
  • Oli
    Oli about 7 years
    No, the ordering being applied is the "natural" Model-specified ordering. But it breaks these subqueries. I think I've found a bug. It's super simple to repro.
  • karolyi
    karolyi almost 7 years
    @Oli, have a look at my solution, I'm actually discussing it with django devs. It has nothing to do with the order_by() clause.
  • Oli
    Oli almost 7 years
    @karolyi It may have been an indirect fix to underlying issues, but overriding the model-specified ordering very much did fix this for me. The additional GROUP BY was being caused by there being a superfluous sort.
  • Matthew Schinckel
    Matthew Schinckel almost 7 years
    As it turns out, the Exists subclass of Subquery removes the ordering (which was actually done for performance reasons, since there's no reason to sort the subquery). This is (very lightly) documented, but perhaps the documentation for Subquery could be improved too.
  • Matthew Schinckel
    Matthew Schinckel almost 7 years
    You may be able to write a custom subclass of Subquery, that allows you to avoid the .values().annotate().values() dance. For instance: class Count(Subquery): template = "SELECT count(*) FROM (%(subquery)s) _count"
  • karolyi
    karolyi almost 7 years
    That is a valid solution, thx. My goal was to stay on the ORM level and be SQL independent as much as possible. Hence the group_by tweak.
  • minder
    minder almost 7 years
    Thank you for this post. There's still not enough proper documentation on subqueries introduced in Django 1.11. This is so far the best problem solver.
  • pgcd
    pgcd over 6 years
    I spent several fruitless hours trying to juggle annotate()s and values()s to aggregate over nested subqueries (counting objects only if their related objects satisfied certain parameters with an __in as well) and the one from @MatthewSchinckel is the only solution that worked for me. I believe it should have a bit more visibility.
  • Oli
    Oli over 6 years
    That's an interesting idea (and moderately neat, even with the name conflict) but if the underlying bug I was hitting (model defined ordering adding am extra group-by) still exists in the Subquery SQL composition, this will hit it too.
  • Matthew Schinckel
    Matthew Schinckel over 6 years
    I'm not sure it will hit the extra group-by name, because the subquery is used within another subquery, that returns a single value.
  • Salvatore Fiorenza
    Salvatore Fiorenza almost 6 years
    Thanks for the trick! Is worth mentioning that you could combine this trick with @karolyi answer to remove any order by that you can have in your queryset, only is required to edit resolve_expression on the SQCount class.
  • Matthew Schinckel
    Matthew Schinckel almost 6 years
    IIRC not all databases support window functions (and possibly, not all aggregate functions can be used as window functions). Having said that, the window stuff is a great addition, and I look forward to having a project using a version of django that supports it!
  • Matthew Schinckel
    Matthew Schinckel almost 6 years
    You actually make a really good point here (and I think we probably haven't hammered that home enough in the django docs): subqueries will perform less well in situations where a join+aggregation, via a .annotate() would work. In general, I use them where I need to reference or aggregate over a subset of related rows.
  • vdboor
    vdboor over 5 years
    This generates an awful lot of unneeded SQL. The subquery becomes: (SELECT count(SELECT id, field1, field2, field3, ... FROM space WHERE ..)). The Solution below from @Oli performs much better, which becomes (SELECT COUNT(*) FROM space WHERE ..)
  • vdboor
    vdboor over 5 years
    This should be the accepted answer! It produces much shorter and more efficient SQL.
  • Matthew Schinckel
    Matthew Schinckel over 5 years
    Are you sure? The SQCount subquery uses an explicit template that only does a count(*): it doesn't include the other columns at all.
  • Matthew Schinckel
    Matthew Schinckel over 5 years
    ...however, I do think that the other answer (using a proper Count) is actually a better solution, because it doesn't require the subclassing.
  • themanatuf
    themanatuf almost 5 years
    Anyone using this query syntax on a Space model that has millions of rows in it and a large number of Carpark objects? I'm finding that internally the database executes the subquery once per Carpark which could lead to very long query times for large datasets. Any other comments would be appreciated!
  • Skratt
    Skratt about 3 years
    It worked extremely well for me in SQlite and far quicker than the conventional Django Count annotation ... however when I tried in a MariaDB environnement ... it crashed ! MariaDB is apparently not able / not willing to handle correlated subqueries as those are considered sub-optimal. Would anyone be able to help me figure out a way to implement those helper functions for MariaDB ? What should template be in this environnement?
  • Skratt
    Skratt about 3 years
    Ok, I was finally able to find a Database agnostic solution that works also with MariaDB. I published the solution under: stackoverflow.com/questions/61677702/…
  • slamora
    slamora almost 3 years
  • JHS
    JHS almost 3 years
    This solution fixed my Count subquery. Glad I gave it a shot, thank you. Django was putting a GROUP BY clause into my output query in a way that was just not helpful. As others have recommended, you can avoid the ORM's output SQL entirely with a subclass.