What are the pros and cons of performing calculations in sql vs. in your application

92,687

Solution 1

It depends on a lot of factors - but most crucially:

  • complexity of calculations (prefer doing complex crunching on an app-server, since that scales out; rather than a db server, which scales up)
  • volume of data (if you need to access/aggregate a lot of data, doing it at the db server will save bandwidth, and disk io if the aggregates can be done inside indexes)
  • convenience (sql is not the best language for complex work - especially not great for procedural work, but very good for set-based work; lousy error-handling, though)

As always, if you do bring the data back to the app-server, minimising the columns and rows will be to your advantage. Making sure the query is tuned and appropriately indexed will help either scenario.

Re your note:

and then loop through the records

Looping through records is almost always the wrong thing to do in sql - writing a set-based operation is preferred.

As a general rule, I prefer to keep the database's job to a minimum "store this data, fetch this data" - however, there are always examples of scenarios where an elegant query at the server can save a lot of bandwidth.

Also consider: if this is computationally expensive, can it be cached somewhere?

If you want an accurate "which is better"; code it both ways and compare it (noting that a first draft of either is likely not 100% tuned). But factor in typical usage to that: if, in reality, it is being called 5 times (separately) at once, then simulate that: don't compare just a single "1 of these vs 1 of those".

Solution 2

Let me use a metaphor: if you want to buy a golden necklace in Paris, the goldsmith could sit in Cape Town or Paris, that is a matter of skill and taste. But you would never ship tons of gold ore from South Africa to France for that. The ore is processed at the mining site (or at least in the general area), only the gold gets shipped. The same should be true for apps and databases.

As far as PostgreSQL is concerned, you can do almost anything on the server, quite efficiently. The RDBMS excels at complex queries. For procedural needs you can choose from a variety of server-side script languages: tcl, python, perl and many more. Mostly I use PL/pgSQL, though.

Worst case scenario would be to repeatedly go to the server for every single row of a larger set. (That would be like shipping one ton of ore a time.)

Second in line, if you send a cascade of queries, each depending on the one before, while all of it could be done in one query or procedure on the server. (That's like shipping the gold, and each of the jewels with a separate ship, sequentially.)

Going back and forth between app and server is expensive. For server and client. Try to cut down on that, and you will win - ergo: use server side procedures and / or sophisticated SQL where necessary.

We just finished a project where we packed almost all complex queries into Postgres functions. The app hands over parameters and gets the datasets it needs. Fast, clean, simple (for the app developer), I/O reduced to a minimum ... a shiny necklace with a low carbon footprint.

Solution 3

In this case you are probably slightly better off doing the calculation in SQL as the database engine is likely to have a more efficient decimal arithmetic routines than Java.

Generally though for row level calculations there is not much difference.

Where it does make a difference is:

  • Aggregate calculations like SUM(), AVG(),MIN(), MAX() here the database engine will be an order of magnitude faster than a Java implementation.
  • Anywhere the calculation is used to filter rows. Filtering at the DB is much more efficient than reading a row and then discarding it.

Solution 4

There's no black / white with respect to what parts of data access logic should be performed in SQL and what parts should be performed in your application. I like Mark Gravell's wording, distinguishing between

  • complex calculations
  • data-intensive calculations

The power and expressivity of SQL is heavily underestimated. Since the introduction of window functions, a lot of non-strictly set-oriented calculations can be performed very easily and elegantly in the database.

Three rules of thumb should always be followed, regardless of the overall application architecture:

  • keep the amount of data transferred between database and application slim (in favour of calculating stuff in the DB)
  • keep the amount of data loaded from the disk by the database slim (in favour of letting the database optimise statements to avoid unnecessary data access)
  • don't push the database to its CPU limits with complex, concurrent calculations (in favour of pulling data into application memory and performing calculations there)

In my experience, with a decent DBA and some decent knowledge about your decent database, you won't run into your DBs CPU limits very soon.

Some further reading where these things are explained:

Solution 5

In general do things in SQL if there are chances that also other modules or component in same or other projects will need to get those results. an atomic operation done server side is also better because you just need to invoke the stored proc from any db management tool to get final values without further processing.

In some cases this does not apply but when it does it makes sense. also in general the db box has the best hardware and performances.

Share:
92,687

Related videos on Youtube

hellojava
Author by

hellojava

Updated on July 22, 2022

Comments

  • hellojava
    hellojava almost 2 years

    shopkeeper table has following fields:

    id (bigint),amount (numeric(19,2)),createddate (timestamp)
    

    Let's say, I have the above table. I want to get the records for yesterday and generate a report by having the amount printed to cents.

    One way of doing is to perform calculations in my java application and execute a simple query

    Date previousDate ;// $1 calculate in application
    
    Date todayDate;// $2 calculate in application
    
    select amount where createddate between $1 and $2 
    

    and then loop through the records and convert amount to cents in my java application and generate the report

    Another way is like performing calculations in sql query itself:

    select cast(amount * 100 as int) as "Cents"
    from shopkeeper  where createddate  between date_trunc('day', now()) - interval '1 day'  and  date_trunc('day', now())
    

    and then loop through the records and generate the report

    In one way , all my processing is done in java application and a simple query is fired. In other case all the conversions and calculations is done in Sql query.

    The above use case is just an example, in a real scenario a table can have many columns that require processing of the similar kind.

    Can you please tell me which approach is better in terms of performance and other aspects and why?

    • Morg.
      Morg. over 12 years
      The date calculations will have little to no effect at all - assuming your sql engine will indeed calculate your dates only once. having them defined in your application makes perfect sense, since they will be defined there at some point anyway, be it for report title or other things. multiplying the value by 100 in this case could be done on any tier, since you will be looping through those rows anyway for rendering and *100 is unlikely to be slower on any tier except front-end. In either case your calculations are minimal and dwarfed by the surrounding operations, not a performance concern.
  • wildplasser
    wildplasser over 12 years
    Looping implicates more-or-less "row-at-a-time" processing. And that means 2* network latency plus four context switches round trip. Yes: that is expensive. A "native" DBMS operation does all the hard work to minimise disk-I/O's (system calls) but manages to fetch more than one row per system call. Row at a time takes at least four system calls.
  • Marc Gravell
    Marc Gravell over 12 years
    @wildplasser not necessary; the server could be streaming rows which you consume as they arrive - a "reader" metaphor is not uncommon.
  • wildplasser
    wildplasser over 12 years
    @Marc Cavell: Well, it depends. In the case where the footprint of an application program is only one logical record, it's more or less Ok. But most of the "frameworks" I know tend to suck in all the records at startup, and fire them off, one by one. Locking is another pitfall.
  • Morg.
    Morg. over 12 years
    Reusability can be present at any tier and is not a reason (performance wise) to put more calculations in SQL. "In general the db box" : this is wrong and furthermore, as marc gravell said, scaling does not work in the same fashion. Most databases require little hardware to be ran decently, and the performance pattern has little to do with that of an application server (i.e. I'd spend 2/3rds of my budget for an SQL server on godlike IO whereas I wouldn't spend more than a few hundreds for an appserver's storage stack).
  • Sklivvz
    Sklivvz over 10 years
    I think that a good rule of thumb is: don't bring back from SQL server rows of data you don't ultimately need. For example, if you have to perform aggregate operations, they likely belong in SQL. Joins between tables or subqueries? SQL. That's also the approach we use with badges, and, so far, we are coping with scale :-)
  • Doug
    Doug over 10 years
    I'd be cautious about using this analogy to make design decisions meaningfully with other developers. Analogies are more of a rhetorical device than a logical one. Among other factors, it's a lot cheaper to ship data to an app server than it is to ship golden ore to a goldsmith.
  • Guru
    Guru over 10 years
    For someone who have wrote and maintained DB based applications, with years of experience in seeing complex SQL, I have to agree, maintaining complex SQL is the worst thing to do. Any SQL, that has handful of tables and joins are candidates for breaking up and done elsewhere.
  • 200_success
    200_success over 10 years
    If your query can be reasonably expressed as an SQL query, it probably should be. SQL is usually so much simpler than Java/.NET code, and much faster than sending all the data to the client. If you are worried about the complexity of the calculations hogging the database server's CPU, you can replicate your database to a reporting database.
  • Dainius
    Dainius over 10 years
    You will send ores or gold depending of what is cheaper, if you don't have technology to convert ore to gold, or it's to expensive (because miners want to kill these other workers), you will ship it to another location, maybe in between goldsmith and miners, especially if you have more then one goldsmith.
  • Dainius
    Dainius over 10 years
    If you change from jboss to ruby, it's very likely that you will change db (and you will need to adopt these calculations anyway) and it's not that unlikely that you can change to something more different, like nosql.
  • zinking
    zinking over 10 years
    exactly what I agree, I don't think it is always bad thing to do loop based calculation in SQL @a_horse_with_no_name, sometime this has to be done anyways, I would rather it be calculated when data fetched as Erwin's metaphor indicated. or you have to repeat this at a cost when data fetched back.
  • zinking
    zinking over 10 years
    "looping is alomost always the wrong thing in sql" this only hold if you are narrowing down "looping" to how database perform joins and causing performance hit. but to me "looping" could be apply certain calculation to all records, in which case not wrong at all.
  • yfeldblum
    yfeldblum over 10 years
    -1 Because it's a one-sided argument, ignores trade-offs, and sets up a straw man for the opposing side instead of considering and refuting the best case of the opposing side. "Going back and forth between app and server is expensive" - absolutely: but it is not the only thing that is expensive, and the various expenses must be weighed against each other. It may turn out that "sophisticated SQL" queries or stored procedures are the best for the particular case; but the details of the case must generally be taken into account when making that kind of determination.
  • Marc Gravell
    Marc Gravell over 10 years
    @zinking that would be a set-based operation. In that scenario you don't write the loop code - that is an implementation detail. By "looping" I mean explicit loops, for example a cursor
  • Chris Koston
    Chris Koston over 10 years
    Cool analogy but unfortunately it's based on wrong assumptions. Shipping gold ore is very common. Gold stripping ratio is about 1:1 (gold to waste) however it's often cheaper to process it offsite, where better equipment and quality of workmanship is available. Depending on the size of the shipment, increasing the processing efficiency by 0.1% may allow relative increase of the revenue (despite the doubled shipping price) - as the gold is quite expensive these days. Other ores, like iron for example are typically shipped too (iron's stripping ratio is about 60%!).
  • allenwlee
    allenwlee over 9 years
    @MarcGravell can you please expand on your comment sql is not the best language for complex work - especially not great for procedural work, but very good for set-based work--can you direct me to an article that explains what you mean by 'procedural' versus 'set-based' work?
  • Erwin Brandstetter
    Erwin Brandstetter over 7 years
    @Chris: OK, that's where the analogy does not work then, because with Postgres, the best "equipment and quality of workmanship" happens to be right there in the RDBMS as well, for most purposes.
  • Muhammad Omer Aslam
    Muhammad Omer Aslam about 5 years
    so should i say that transforming ( 3959 * acos( cos( radians(37) ) * cos( radians( lat ) ) * cos( radians( lng ) - radians(-122) ) + sin( radians(37) ) * sin( radians( lat ) ) ) ) from mysql functions to the php or any other server side would be a better option
  • Lenny
    Lenny about 5 years
    However, If you need to do this millions of times as quickly as possible it is much easier to spawn parallel python apps than db replicas. Up until a certain scale leaning more on SQL is certainly faster / cheaper, but eventually there is a tipping point when it's better to do this calculation in your application.