Azure Cosmos DB - Understanding Partition Key

azure azure-cosmosdb

72,990

Solution 1

Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.

But, in a nutshell: The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.

Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.

EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)

* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.

Solution 2

Partition key acts as a logical partition.

Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.

What effect does this have on the query?

While querying you would put your partition key as feed options and won't include it in your filter.

e.g: If your query was

SELECT * FROM T WHERE T.CustomerId= 'CustomerId';

It will be Now

var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};

var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();

Solution 3

CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO

Primary Key: Uniquely identifies the data Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).

Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.

Solution 4

I've put together a detailed article here Azure Cosmos DB. Partitioning.

What's logical partition?

Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.

There are two main limitation on Physical Partitions:

Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB

Logical partition has one - 20GB limit in size.

NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.

How to select right partition key for my container?

Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:

Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.

It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.

Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.

Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"

Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].

Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.

So imagine following scenario (consider days as a measure of time between actions):

You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs

IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query

This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).

Solution 5

Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.

You can read more about it here: https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data

View more solutions

72,990

Author by

Stpete111

Updated on March 28, 2021

Comments

Stpete111 over 3 years

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.

I've read the documentation here: (https://docs.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.

Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.

In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.

EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from @Porschiey.
David Makogon almost 7 years

Actually, for collections larger than 10GB, partition key is required.
Porschiey almost 7 years

@DavidMakogon you're right. Thanks. Edited to correct my answer. Thank you.
Stpete111 almost 7 years

@Porschiey great info, thanks for that, I'll watch the video. And yes mine will be over 10GB so the Partition Key will be required. I will post a screenshot of a few rows of my table.
Stpete111 almost 7 years

Screenshot of existing table added to main post. Thanks.
Porschiey almost 7 years

@Stpete111 Thanks! Well it's tough, without intimately understanding the purpose of each property there. You'll want to ask yourself "From what property do you want to individualize performance?" In other words, if you pick SourceCountry as the PartitionKey, results from other countries will not see query performance impacted if the USA has more documents. I'd either pick SourceCountry or Category - but you should play around with different approaches to see which works best for you.
Stpete111 almost 7 years

@Porschiey ok thanks for that! That's very helpful information! I will move forward with a few different scenarios. At the end of the day, I'm not sure how much query performance even matters, because we will have an Azure Search Index sitting on top of this, which is what our clients will actually be querying on.
Porschiey almost 7 years

@Stpete111 ah, makes sense. Good luck sir!
krypru over 5 years

There is no fixed storages (10GB) anymore. You have to choose partition key. But what to do if there is no need to do it?
krypru over 5 years

About last paragaph, do you mean physical or logical partition?
n0rd almost 5 years

@krypru Im in the same boat. I have a very simple collection (list of tokens, 1 field each)
jschober over 4 years

What if I have a partition key? We never created on, the items in the data explorer show "/_partitionKey". When I use that, all my queries fail. Is there a default key? That is with dotnet 3.0
Carltonp over 4 years

The video is amazing. Thanks
Anurag over 4 years

I am storing a list of documents. My structure is like this: { id: ... data: [ {age: 12, name: Anurag}, {age: 22, name: James} ] rid: ... self: ... ts: ... } How can I pass partition key(for ex. name) path in this case?
andrew almost 4 years

If I had a patient management system where users, patients, assessments all belong to a company, would that make my “company id” the logical choice for partition key? i.e viewing, searching patients and their assessments is always in the context of the same company (never “outside” of this), so company sounds like a reasonable way to group all of the documents.
Martin Godzina about 3 years

Working with batch storedprocedures is way harder, If you pick a partitionkey which is to granular ( like /id ). Storedprocedures only consume documents with the same partitionkey. Features like batch ACID transactions could not be used anymore ( batch deleting ect.)
Nick over 2 years

Sadly the video link is broken again. The MSDN page now links to this 2-minute YouTube video, but there's a better video, Partition Strategy | Azure Cosmos DB Essentials in the same playlist that goes into much more detail.
Yiping about 2 years

Will Date as int value be a good candidate in this case?