Distributed sequence number generation?

java apache-zookeeper sequences

85,242

Solution 1

OK, this is a very old question, which I'm first seeing now.

You'll need to differentiate between sequence numbers and unique IDs that are (optionally) loosely sortable by a specific criteria (typically generation time). True sequence numbers imply knowledge of what all other workers have done, and as such require shared state. There is no easy way of doing this in a distributed, high-scale manner. You could look into things like network broadcasts, windowed ranges for each worker, and distributed hash tables for unique worker IDs, but it's a lot of work.

Unique IDs are another matter, there are several good ways of generating unique IDs in a decentralized manner:

a) You could use Twitter's Snowflake ID network service. Snowflake is a:

Networked service, i.e. you make a network call to get a unique ID;
which produces 64 bit unique IDs that are ordered by generation time;
and the service is highly scalable and (potentially) highly available; each instance can generate many thousand IDs per second, and you can run multiple instances on your LAN/WAN;
written in Scala, runs on the JVM.

b) You could generate the unique IDs on the clients themselves, using an approach derived from how UUIDs and Snowflake's IDs are made. There are multiple options, but something along the lines of:

The most significant 40 or so bits: A timestamp; the generation time of the ID. (We're using the most significant bits for the timestamp to make IDs sort-able by generation time.)
The next 14 or so bits: A per-generator counter, which each generator increments by one for each new ID generated. This ensures that IDs generated at the same moment (same timestamps) do not overlap.
The last 10 or so bits: A unique value for each generator. Using this, we don't need to do any synchronization between generators (which is extremely hard), as all generators produce non-overlapping IDs because of this value.

c) You could generate the IDs on the clients, using just a timestamp and random value. This avoids the need to know all generators, and assign each generator a unique value. On the flip side, such IDs are not guaranteed to be globally unique, they're only very highly likely to be unique. (To collide, one or more generators would have to create the same random value at the exact same time.) Something along the lines of:

The most significant 32 bits: Timestamp, the generation time of the ID.
The least significant 32 bits: 32-bits of randomness, generated anew for each ID.

d) The easy way out, use UUIDs / GUIDs.

Solution 2

You could have each node have a unique ID (which you may have anyway) and then prepend that to the sequence number.

For example, node 1 generates sequence 001-00001 001-00002 001-00003 etc. and node 5 generates 005-00001 005-00002

Unique :-)

Alternately if you want some sort of a centralized system, you could consider having your sequence server give out in blocks. This reduces the overhead significantly. For example, instead of requesting a new ID from the central server for each ID that must be assigned, you request IDs in blocks of 10,000 from the central server and then only have to do another network request when you run out.

Solution 3

Now there are more options.

Though this question is "old", I got here, so I think it might be useful to leave the options I know of (so far):

You could try Hazelcast. In it's 1.9 release it includes a Distributed implementation of java.util.concurrent.AtomicLong
You can also use Zookeeper. It provides methods for creating sequence nodes (appended to znode names, though I prefer using version numbers of the nodes). Be careful with this one though: if you don't want missed numbers in your sequence, it may not be what you want.

Cheers

Solution 4

It can be done with Redisson. It implements distributed and scalable version of AtomicLong. Here is example:

Config config = new Config();
config.addAddress("some.server.com:8291");

Redisson redisson = Redisson.create(config);
RAtomicLong atomicLong = redisson.getAtomicLong("anyAtomicLong");
atomicLong.incrementAndGet();

Solution 5

If it really has to be globally sequential, and not simply unique, then I would consider creating a single, simple service for dispensing these numbers.

Distributed systems rely on lots of little services interacting, and for this simple kind of task, do you really need or would you really benefit from some other complex, distributed solution?

View more solutions

85,242

Author by

Jonathan Holloway

Consultant CTO and Architect with expertise in Java, Python, Ruby and Javascript http://www.jonathanholloway.co.uk/

Updated on January 14, 2022

Comments

Jonathan Holloway over 2 years

I've generally implemented sequence number generation using database sequences in the past.

e.g. Using Postgres SERIAL type http://www.neilconway.org/docs/sequences/

I'm curious though as how to generate sequence numbers for large distributed systems where there is no database. Does anybody have any experience or suggestions of a best practice for achieving sequence number generation in a thread safe manner for multiple clients?
Jonathan Holloway over 13 years

Zookeeper was the options I went with, there's a good description and writeup of this on the mailing list that I started - mail-archive.com/[email protected]/msg01967.h‌tml
Paolo over 13 years

Jon, thanks for pointing to that thread, that's exactly the type of solution I was thinking. BTW, did you make the code to overcome the MAX_INT limitation?
Piyush Kansal over 9 years

Cassandra supports counters (cassandra.apache.org/doc/cql3/CQL.html#counters), there are some limitations though.
Pavel almost 9 years

While UUIDs work, the problem with them is that you have to be careful how you store them if you ultimately need to index the generated keys. They will also typically take up a lot more space than a monotonically increased sequence. See percona.com/blog/2014/12/19/store-uuid-optimized-way for a discussion about storing them with MySQL.
ishan almost 9 years

i like your point about the batch id generation, but it just limits any real time calculation possibility.
Janakiram over 8 years

I have implemented a similar mechanism. In that, in addition to clients caching a block of sequences, I have added several server-hosts that cache the blocks of sequences. A (single) master generator is maintained in some highly available storage or a single-master host, accessible only the fleet of server-hosts. The server caching would also help us in more uptime inspite the single master goes down for a moment.
brucenan over 8 years

sequence numbers is easy to set position for bitmap index, but unique id sometimes too long (64bit or 128bit), how can unique id mapping to a bitmap index position? Thanks.
puneet almost 8 years

really liked option #b ..... it could allow for high scale and not cause much of concurrency issue
Onic Team over 6 years

please add some more details
Navin over 5 years

twitter/snowflake is no longer maintained
Navin over 5 years

...and what happens when the server running that service goes down?
nic ferrier over 5 years

Have an alert that tells someone to start another one? Sometimes that will be just fine. I think the answer is trying to say "keep things in perspective". The perfect distributed solution has it's own drawbacks and sometimes simpler is better.
Wpigott about 5 years

If you want an Apache2 Licensed implementation of option B, check out bitbucket.org/pythagorasio/common-libraries/src/master/… You can also get it from maven io.pythagoras.common:distributed-sequence-id-generator:1.0.0
tonix over 3 years

Did your solution with Redis scale well? If yes, for how many concurrent requests per seconds? Thank you!
tonix over 3 years

How does Stack Overflow generate sequence numbers and unique IDs for their users? It seems that their user IDs are indeed consecutive/sequential and unique. Do you think they have a single service hit by all the clients? But doesn't this lead to bottlenecks if a lot of new users register at the same point in time? Thank you!
Zohair over 3 years

Hey Tonix, we did use it for couple of months but it was not tested on a big scale. I suggest you explore Redis INCR
tonix over 3 years

What do you use now?
Zohair over 3 years

Our problem statement got obsolete - but I would definitely use Redis INCR if I had to solve this again.
Nulik about 3 years

a database is not a distributed system, it is a centralized system