Cassandra Scalability

Creating Multiple Local Cassandra Datacenters with Docker

The Apache Cassandra database has long included support for multiple datacenters. That is, Cassandra allows an organization to configure a cluster to actively store data across disparate datacenters. There are multiple reasons do this, including:

  • Improve geographic responsiveness
  • Disaster recovery
  • Separation of transactional and analytic workloads

In general, when trying to wrap my head around some technical details, it’s usually helpful to simplify the scenario, removing as many variables as possible. In this case, I really want to learn more about multi-datacenters with Cassandra. I really don’t want to have to spend a lot of time trying to set up a multi-datacenter scenario using a cloud provider, wrangling Kubernetes, read up on Terraform, and so forth. I really want to try to make things as simple as reasonably possible. Fortunately, most of want to try can be achieved with Docker. (Yes, I get that this is another layer, but it’s a little bit more narrow in scope.)

As we’ll see, by using Docker containers on a local machine, we can simulate a multi-datacenter setup.

Docker Support for Cassandra

Launching multiple Docker containers on a single box is usually very simple. However, getting all the networking bits to work so that the containers can talk to each other often trips me up. That’s why I was so happy to read an embarrassingly simple approach in the third edition of Cassandra: The Definitive Guide (on pages 226 and 227). It turns out that I had most of the key bits in place, but it was the intersection of introducing the concept of the Docker network and the official Docker Cassandra image that really got me past futzing with Docker to focusing on Cassandra itself.

The Docker community has been maintaining and hosting an official Cassandra Docker image on the Docker Hub website since 2015. They support multiple versions of Cassandra from the 2.1 series up to the most current, stable version of the 3.11 series.

The overview page provides some basic usage of how to run the container and even how to create a little local cluster. But I’ll provide the tl;dr summary in the following.

Creating Your First Multiple Local Cassandra Datacenters with Docker

One interesting byproduct of being able to create a Cassandra cluster via Docker on our local machines is that we can create the illusion of multiple, regionally separate datacenters, one on the west coast and the other on the east coast:

We’ll be “automating” this a bit with a pinch of bash scripting. Here’s the complete script:

#!/bin/bash -x
# Define the name of our Docker network
# Define the name of the seed node
docker network create $network_name
for data_center in east west ; do
for node_num in {1..2} ; do
if [ "$seed_name" = "" ] ; then
# We only need one seed name (not recommended for production), so use
# whatever the first node name is.
echo "Launching node $node_name in $data_center datacenter"
docker run \
--detach \
--name $node_name \
--network $network_name \
-e CASSANDRA_DC=$data_center \
-e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch \
-e CASSANDRA_SEEDS=$seed_name \
echo "Sleeping $SLEEPY_TIME seconds to let the gossip settle for Cassandra startup"

Lines 29 and 30 are the key pieces from the above script:

  • Line 29 – defining CASSANDRA_DC. Sets the datacenter ID of this container/node. In our example, we use east and west as the names of our datacenters. As you can see, it’s possible to dynamically assign the name; it doesn’t have to be configured or set up beforehand.
  • Line 30 – defining CASSANDRA_ENDPOINT_SNITCH. Setting this to GossipingPropertyFileSnitch is required in order for the nodes
    to discover each other.

Once that script has finished running, we can test that our two datacenters are active within the same cluster:

docker exec -it $seed_name nodetool status

Did it work?


Next Steps

There are many different directions to go from here to test out Cassandra’s behavior using multiple datacenters. Here are some I plan to explore:

Well… there you have it. Hopefully that will help you to get a little more familiar with the behavior of Cassandra’s out-of-the-box multi-datacenter capabilities without having to spend too much time provisioning, configuring, and installing Cassandra on actual boxes/VMs/instances/etc. Of course, this is nowhere near a production system for a lot of reasons. It does, however, give you a nice playground in which to start exploring.

Now go back to baking sourdough bread.