When to build a separate reporting database?

34,435

Solution 1

In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.

  1. When transaction performance is critical.
  2. When it's hard to get a maintenance window on the transactional app.
  3. If reporting needs to correlate results not only from this app, but from other application silos.
  4. If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
  5. If the reports are long running.
  6. If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
  7. If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).

It adds non-trivial complexity, so imo, there has to be a good reason to split.

Solution 2

Typically, I would try to report off the transactional database initially.

Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.

When you do go to a reporting database, remember there are only a few reasons you are going there:

Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.

Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.

Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).

Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.

Solution 3

This question requires experience rather than science.

As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.

I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.

When you said:

I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.

There are a few things wrong with your statement.

  1. Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).

  2. If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.

  3. Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.

My point is that it all comes to experience for deciding when to:

  1. design a new schema
  2. create a semantic database
  3. work on the same transactional database
  4. or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)

Solution 4

The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.

As for questions, here are some basic one:

  1. Can I do the high intensity reports during non-peak hours?
  2. Does it interfere with the users using the system?
  3. If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?

Solution 5

I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).

If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database. All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.

There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.

Share:
34,435
Adrian K
Author by

Adrian K

I'm a Solution Architect with a passion for Agile. I used to actively blog, and I also started Morphological Software Solutions Limited as a part-time side project, and also as a possible full-time venture for the future; check out my open source CMS / ASP.NET Framework on CodePlex: http://morphfolia.codeplex.com and Shell for RoboCopy called RoboMojo. I'm co-lead and co-founder of the Wellington Solution Architects Forum, and an active member of the Wellington chapter of the Agile Professionals Network. You can find out more about me on LinkedIn (http://nz.linkedin.com/in/adriankearns). #SOreadytohelp

Updated on April 17, 2020

Comments

  • Adrian K
    Adrian K about 4 years

    We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.

    Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.

    My question is: how do you know when a separate database for reporting is required?

    What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?

  • northpole
    northpole almost 14 years
    databases are designed to negate this issue. A properly set up database and reporting system should not have an issue.
  • northpole
    northpole almost 14 years
    Properly indexed tables and tuned queries would hopefully make this a non issue. I am not convinced performance would require separately maintained databases.
  • DVK
    DVK almost 14 years
    @northpole - sometimes you just can't tune any more. If a query at max tuning takes 1 hour to run, you can't wish that away. The above answer kinda assumes you already tuned as best as it could be.
  • northpole
    northpole almost 14 years
    @DVK, and you next solution would be to create and maintain a separate but equal database? Why not consider new hardware, or a new Database all together, like an Oracle cluster? You suggest timing is your main reason, but how can you expect good performance when you have to do real time replication in order to have current data in both.
  • northpole
    northpole almost 14 years
    @DVK, I don't buy that the correct assumption is that you already tuned the database as best as possible. I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
  • DVK
    DVK almost 14 years
    @northpole - if your first guess is "separate but equal", you are putting words into my mouth that weren't there. And I'm sure you are better DB tuner than a bunch of fairly good developers and a set of high-level DBAs for one of the major and very technology-centric financial firms out there, especially on a DB and a codebase and a problem you never saw in your life.
  • northpole
    northpole almost 14 years
    @DVK, I was not trying to put words in your mouth, but I had to assume based off the info you provided. If you were trying to explain something like Cade Roux, then my apologies for generalizing your answer.
  • Corith Malin
    Corith Malin almost 14 years
    Not all databases are perfect, and sometimes taking the approach of throwing more hardware at a problem is much cheaper than hiring out developers, DBAs, project managers, etc... to recode the application for a more optimized database design. Your comment is similar to saying that jet engines negate all the issues of prop planes. While that is true, a prop plane is still a whole lot cheaper to build, fly, and maintain (also doesn't require as high a level of pilot).
  • Adrian K
    Adrian K almost 14 years
    @northpole - one advantage of a new database over new hardware or a cluster is cost. Yes the additional complexity (cost) of a second DB has to be taken into account but I doubt the costs are in the same ball-park. Also, depending on the organisation, operational costs associated with managing a new db come out of a different budget from the captial costs associated with new hardware - so there's also a political / fiscal dimension to the problem.
  • Adrian K
    Adrian K almost 14 years
    Thanks - I like your reasoning and explainations.