How does data denormalization work with the Microservice Pattern?

10,916

Solution 1

This is subjective but the following solution worked for me, my team, and our DB team.

At the application layer, Microservices are decomposed to semantic function.
- e.g. a Contact service might CRUD contacts (metadata about contacts: names, phone numbers, contact info, etc.)
- e.g. a User service might CRUD users with login credentials, authorization roles, etc.
- e.g. a Payment service might CRUD payments and work under the hood with a 3rd party PCI compliant service like Stripe, etc.
At the DB layer, the tables can be organized however the devs/DBs/devops people want the tables organized

The problem is with cascading and service boundaries: Payments might need a User to know who is making a payment. Instead of modeling your services like this:

interface PaymentService {
    PaymentInfo makePayment(User user, Payment payment);
}

Model it like so:

interface PaymentService {
    PaymentInfo makePayment(Long userId, Payment payment);
}

This way, entities that belong to other microservices only are referenced inside a particular service by ID, not by object reference. This allows DB tables to have foreign keys all over the place, but at the app layer "foreign" entities (that is, entities living in other services) are available via ID. This stops object cascading from growing out of control and cleanly delineates service boundaries.

The problem it does incur is that it requires more network calls. For instance, if I gave each Payment entity a User reference, I could get the user for a particular payment with a single call:

User user = paymentService.getUserForPayment(payment);

But using what I'm suggesting here, you'll need two calls:

Long userId = paymentService.getPayment(payment).getUserId();
User user = userService.getUserById(userId);

This may be a deal breaker. But if you're smart and implement caching, and implement well engineered microservices that respond in 50 - 100 ms each call, I have no doubt that these extra network calls can be crafted to not incur latency to the application.

Solution 2

It is indeed one of key problems in microservices which is quite conviniently omitted in most of articles. Fortunatelly there are solutions for this. As a basis for discussion let's have tables which you have provided in the question. Image above shows how tables will look like in monolith. Just few tables with joins.

To refactor this to microservices we can use few strategies:

Api Join

In this strategy foreign keys between microservices are broken and microservice exposes an endpoint which mimics this key. For example: Product microservice will expose findProductById endpoint. Order microservice can use this endpoint instead of join.

It has an obvious downside. It is slower.

Read only views

In the second solution you can create copy of the table in the second database. Copy is read only. Each microservice can use mutable operations on its read/write tables. When it comes to read only tables which are copied from other databases they can (obviously) use only reads

High performance read

It is possible to achieve high performance read by introducing solutions such as redis/memcached on top of read only view solution. Both sides of join should be copied to flat structure optimized for reading. You can introduce completely new stateless microservice which can be used for reading from this storage. While it seems like a lot of hassle it is worth to note that it will have higher performance than monolithic solution on top of relational database.

There are few possible solutions. Ones which are simplest in implementation have lowest performance. High performance solutions will take few weeks to implement.

Solution 3

I realise this is possibly not a good answer but what the heck. Your question was:

Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables)

WRT the database design I'd say "you can't without removing foreign keys".

That is, people pushing Microservices with the strict no shared DB rule are asking database designers to give up foreign keys (and they are doing that implicitly or explicitly). When they don't explicitly state the loss of FK's it makes you wonder if they actually know and recognise the value of foreign keys (because it is frequently not mentioned at all).

I have seen big systems broken into groups of tables. In these cases there can be either A) no FK's allowed between the groups or B) one special group that holds "core" tables that can be referenced by FK's to tables in other groups.

... but in these systems "groups of tables" is often 50+ tables so not small enough for strict compliance with microservices.

To me the other related issue to consider with the Microservice approach to splitting the DB is the impact this has reporting, the question of how all the data is brought together for reporting and/or loading into a data warehouse.

Somewhat related is also the tendency to ignore built in DB replication features in favor of messaging (and how DB based replication of the core tables / DDD shared kernel) impacts the design.

EDIT: (the cost of JOIN via REST calls)

When we split up the DB as suggested by microservices and remove FK's we not only lose the enforced declarative business rule (of the FK) but we also lose the ability for the DB to perform the join(s) across those boundaries.

In OLTP FK values are generally not "UX Friendly" and we often want to join on them.

In the example if we fetch the last 100 orders we probably don't want to show the customer id values in the UX. Instead we need to make a second call to customer to get their name. However, if we also wanted the order lines we also need to make another call to the products service to show product name, sku etc rather than product id.

In general we can find that when we break up the DB design in this way we need to do a lot of "JOIN via REST" calls. So what is the relative cost of doing this?

Actual Story: Example costs for 'JOIN via REST' vs DB Joins

There are 4 microservices and they involve a lot of "JOIN via REST". A benchmark load for these 4 services comes to ~15 minutes. Those 4 microservices converted into 1 service with 4 modules against a shared DB (that allows joins) executes the same load in ~20 seconds.

This unfortunately is not a direct apples to apples comparison for DB joins vs "JOIN via REST" as in this case we also changed from a NoSQL DB to Postgres.

Is it a surprise that "JOIN via REST" performs relatively poorly when compared to a DB that has a cost based optimiser etc.

To some extent when we break up the DB like this we are also walking away from the 'cost based optimiser' and all that in does with query execution planning for us in favor of writing our own join logic (we are somewhat writing our own relatively unsophisticated query execution plan).

10,916

smeeb

Updated on June 11, 2022

Comments

smeeb about 2 years
I just read an article on Microservices and PaaS Architecture. In that article, about a third of the way down, the author states (under Denormalize like Crazy):

Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface).

While this sounds great in theory, in practicality it has some serious hurdles to overcome. The biggest of which is that, often, databases are tightly coupled and every table has some foreign key relationship with at least one other table. Because of this it could be impossible to partition a database into n sub-databases controlled by n microservices.

So I ask: Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables) so that the fragments can be controlled by separate microservices?

For instance, given the following (rather small, but exemplar) database:
```
[users] table
=============
user_id
user_first_name
user_last_name
user_email

[products] table
================
product_id
product_name
product_description
product_unit_price

[orders] table
==============
order_id
order_datetime
user_id

[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
product_id
order_id
quantity_ordered
```
Don't spend too much time critiquing my design, I did this on the fly. The point is that, to me, it makes logical sense to split this database into 3 microservices:
1. UserService - for CRUDding users in the system; should ultimately manage the [users] table; and
2. ProductService - for CRUDding products in the system; should ultimately manage the [products] table; and
3. OrderService - for CRUDding orders in the system; should ultimately manage the [orders] and [products_x_orders] tables
However all of these tables have foreign key relationships with each other. If we denormalize them and treat them as monoliths, they lose all their semantic meaning:
```
[users] table
=============
user_id
user_first_name
user_last_name
user_email

[products] table
================
product_id
product_name
product_description
product_unit_price

[orders] table
==============
order_id
order_datetime

[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
quantity_ordered
```
Now there's no way to know who ordered what, in which quantity, or when.

So is this article typical academic hullabaloo, or is there a real world practicality to this denormalization approach, and if so, what does it look like (bonus points for using my example in the answer)?
- Mike Sherrill 'Cat Recall' over 9 years
  
  WRT "denormalize like crazy" . . . Why? I didn't see any concrete rationale in the article.
- code over 9 years
  
  Have you had in progress in resolving this problem? Seems to be one of the most avoided problems by anyone pushing microservices.
- smeeb over 9 years
  
  Hi there @ccit-spence - please see my answer and let me know what you think. I had to engineer this solution myself, and its been working nicely for several months now, but interested what other devs think about it.
- Rob Bygrave over 7 years
  
  Perhaps worth noting that the article refers to a DB that doesn't even support foreign key constraints (so that to me is an indicator that the author doesn't place value on foreign key constraints - perhaps doesn't even know what was lost?).
code over 9 years

Are all of the services tied to the same database? In our case each service is a standalone service on its own server instance. Each service has a dedicated database to that service.
smeeb over 9 years

I think you can make arguments either way. If each service has its own dedicates DB, it really does isolate the entire microservice stack (from endpoint to disk) off into its own silo, but then you can't have any meaningful foreign keys to entities living in other DBs. From a performance standpoint this may or may not be acceptable. Currently we have been using the same DB because our data is highly relational and it just made sense to let the database be monolithic and the services be finely decomposed 'microservices', but like I said, its possible to do either way.
code over 9 years

Agreed, having them in separate databases does create an issue with foreign keys. Our goal is to create microservices that are truly isolated. Even to the point that the service has its own dedicated UI service. The only way I have come up with for relationships is to use an aggregate service. Essentially an intersection service. This provides the ability for little to no dependencies between services.
Ruslan Stelmachenko over 8 years

Foreign keys doesn't add performance. The indexes is what adds performance. But indexes on FK-like column can be created in any schema, not necessarily the same. For example: Orders table can live in their own schema and have indexed user_id column, which is not "true" FK, but just ID of the user obtained from Users microservice, while users table live in it's own schema. There is almost no performance loss, but I still can't understand how some filtering/batching can be achieved. For example: find all users which have order which have product which have price > 100.
Ruslan Stelmachenko over 8 years

But what I really want to say is: If you already using microservices like this, you don't need the tables to be in single DB with "real" FKs. They can live in their own DB each. They just should have indexes on "fake" FK columns. You already can't use JOINs because of microservices, so you loose nothing if you split DB into smaller DBs.
code about 8 years

@user1294787 You are right the possibility of coupling exist. A completely decoupled system in the end will do nothing. The services that are being aggregated actually have no knowledge of the service that is aggregating them. In fact, you could have many services offering aggregation for different purposes. If the service being aggregated is no longer needed then the aggregation services themselves will also no longer be needed.
cecemel almost 8 years

But what if I create a entity with an FK that doesn't exist, e.g. an order with a reference to non-existant customer. If I want some consistency, I'll have to perform some checks with referencing other microservices no?
Robert Limanto about 7 years

newbie here, the user service resides in user service or payment service?
Steve Chamaillard over 6 years

Doesn't that couple the readers to the schema of the views they're reading ? Every single article about microservices says they should have their own datastore, keep their data private...
Marcin Szymczak over 6 years

Yes, that couples readers to producer to some extent, on the bright side readers can read only part of event and do not care about whole information. In practice in pretty much every big application you will need some shared state in between microservices. Just like in the example. Order has product and user. It is hard to redesign this case without shared information
Rob Bygrave almost 5 years

"You already can't use JOINs because of microservices... " ... I think this is similar to saying that we are walking away from the database query planner (cost based optimiser). That is, breaking into lots of small DB's means we lose the benefits of the cost based optimiser and now implement "JOINS" via rest / rpc etc.
Ruslan Stelmachenko over 2 years

@RobBygrave Exactly! But this is the whole idea of microservices... I agree that this could be very bad for performance at times. Like in my example above when you need to find all users, who have order, which have product with price > 100 - JOINing 3 tables will be much more optimized by DB than trying to "join" them in code-space by first selecting IDs of all products with price > 100 (in products ms), then selecting IDs of all users from orders table by these product IDs (in orders ms) and then finally selecting all users by user IDs from users table (in user ms).
Ruslan Stelmachenko over 2 years

@RobBygrave (continue) So, in microservices world this search query could be solved by creating a new search service (ha-ha) with a DB that will contain all required information for users, orders and products at the same time (e.g. will be updated by asynchronous events from other services) and supports effective search queries. This could even be a non-relational DB like ElasticSearch. This ofc will add pretty much complexity to the system, but also could greatly improve performance (compared to monolith with a single relational DB that allows JOINs) if data amount is very big.
Rob Bygrave over 2 years

@RuslanStelmachenko Well maybe but that search service might now not have "transactional read consistency" - do we care? In terms of "monolithic database" we should not ignore the middle ground (like DDD shared kernal). In DB terms that means we are not concerned with many of our tables that are not really massive or high growth and get benefits from keeping them together (enabling effective use of DB query planner). It's the truly fast growing massive tables that we want to keep separate from everything else (in terms of writes at least).
Rob Bygrave over 2 years

@RuslanStelmachenko WRT no performance loss for an index like user_id, that is somewhat true but I think that misses the point that we can lose the ability to use a covering index on the other side. For example indexes like customer id + name, product id + name. For example, if our query projection includes product id + name and only those 2 columns of the product table we can use a "covering index" on that part of the join. If we are projecting orders with lines and each line with product name then that covering index can be significant.
Rob Bygrave over 2 years

> find all users which have order which have product which have price > 100 The "old school" approach of read/write separation still largely works and deals with this type of problem. This approach might confirm that "Product" isn't actually big and is a candidate for logical replication. So then the order microservice db owns the order table (read + write) and is allowed a read only logically replicated product table (read only). This allows the Order microservice DB to easily optimise that query. We relax strict "order microservice can't have a copy of product" to allow read only replica