Cassandra Java driver: how many contact points is reasonable?

21,777

Solution 1

I would say that configuring your client to use the same list of nodes as the list of seed nodes you configured Cassandra to use will give you the best results.

As you know Cassandra nodes use the seed nodes to find each other and discover the topology of the ring. The driver will use only one of the nodes provided in the list to establish the control connection, the one used to discover the cluster topology, but providing the client with the seed nodes will increase the chance for the client to continue to operate in case of node failures.

Solution 2

My approach is to add as many nodes as I can -- The reason is simple: seeds are necessary only for cluster boot but once the cluster is up and running seeds are just common nodes -- using only seeds may result in the impossibility to connect in a working cluster -- So I give myself the best chances to connect to the cluster keeping a more than reasonable amount of nodes -- it's enough one working node to get the current cluster configuration.

Solution 3

Documentation from DataStax

public Cluster.Builder addContactPoint(String address)

Adds a contact point.

Contact points are addresses of Cassandra nodes that the driver uses to discover the cluster topology. Only one contact point is required (the driver will retrieve the address of the other nodes automatically), but it is usually a good idea to provide more than one contact point, because if that single contact point is unavailable, the driver cannot initialize itself correctly.

Note that by default (that is, unless you use the withLoadBalancingPolicy(com.datastax.driver.core.policies.LoadBalancingPolicy)) method of this builder), the first successfully contacted host will be use to define the local data-center for the client. If follows that if you are running Cassandra in a multiple data-center setting, it is a good idea to only provided contact points that are in the same datacenter than the client, or to provide manually the load balancing policy that suits your need.

Parameters:
    address - the address of the node to connect to
Returns:
    this Builder.
Throws:
    IllegalArgumentException - if no IP address for address could be found
    SecurityException - if a security manager is present and permission to resolve the host name is denied.

From what I understand, you should just add a single contact point and the driver will discover the rest. Hope that helps. I personally use hector you should look into that too.

Solution 4

I read an interesting article about Netflix and their Cassandra installation.

They mention the fact that they used their Gorilla system to take down 33% of their Cassandra cluster and see that their systems were still working as expected.

They have some 2,000 Cassandra nodes and took 33% down. This means, 1 out of 3 nodes are gone. (About 660 nodes for Netflix)

If you are really unlucky, all the connections you specified are part of the 660 nodes... Ouch.

Chances are, though, that if you use just enough nodes and never expect a dramatic event to where more than 33% of your network goes down, then you should be able to use a pretty small number, such as 6 nodes because with such a number, you should always hit at least 4 that are up...

Now, it should certainly be chosen strategically if possible. That is, if you choose 6 nodes all in the same rack when you have 6 different racks, you probably chose wrong. Instead, you probably want to specify 1 node per rack. (That's once you grow that much, of course.)

Note that if you have a Replication Factor of 5 and 33% of your Cassandra nodes go down, you're in trouble anyway. In that situation, many nodes cannot access the database in a QUORUM manner. Notice that Netflix talks about that. Their replication factor is just 3! (i.e. 1/3 = 0.33, and 1/5 = 0.2 so 20% which is less than 33%.)

Finally, I do not know the Java driver, I use the C++ one. When it fails, I am told. So what I can do is try with another set of IPs if necessary, until it works... My system has one connection that stays up between client accesses, so this is a one time process and I can relay the fact that this server is connected to Cassandra and thus can accept client connections. If you reconnect to Cassandra each time a client sends you a request, it may be wise to not send many IPs at all.

Share:
21,777
henry
Author by

henry

Updated on July 05, 2022

Comments

  • henry
    henry almost 2 years

    In Java I connect to Cussandra cluster as this:

    Cluster cluster = Cluster.builder().addContactPoints("host-001","host-002").build();
    

    Do I need to specify all hosts of the cluster in there? What If I have a cluster of 1000 nodes? Do I randomly choose few? How many, and do I really do that randomly?