Cassandra Basics - Introduction

In Greek Mythology, Cassandra was cursed with the ability to predict the future only to have her prophecies never be believed. Apache Cassandra was named after this figure as a nod to the Oracle Database, named for another ancient Greek prophet.

Apache Cassandra is a free and open source database. Specifically a distributed, highly available, eventually consistent, NoSQL, wide column store, database management system. So what do all these buzzwords mean?

Distributed

Cassandra is designed to be run on several machines (nodes), specifically on commodity servers. These are servers you can provision from a cloud computing provider like Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, or DigitalOcean. A group of these nodes connected to one another running Cassandra is called a cluster. Distribution across nodes means that you can scale Cassandra by increasing the number of nodes in your cluster.

Cassandra is also masterless. This means that no single node is in charge of all of the nodes in a cluster. In Cassandra no nodes are special. This allows Cassandra to evenly distribute work across all nodes, without one node needing to do work for every single database operation — becoming a bottleneck. In theory, doubling the number of nodes in a Cassandra Cluster should double the amount of work that cluster can do. This sort of scaling behavior is called linear scalability.

Highly Available

In theory, highly available means that your database will never fail to execute your database operations. Of course no system is perfect so failures are inevitable but in a high availability system failures are extremely uncommon (>= 99.999 uptime is a good rule of thumb). Cassandra achieves this by replicating all data across several nodes and not requiring nodes to communicate with each other to execute an operation (though there are some exceptions to this). This means that Cassandra is fault tolerant — a single node crashing won't bring down your whole cluster.

Eventually Consistent

Eventual consistency is one of the prices you pay to get the availability, performance, and scalability Cassandra provides. A database system is strongly consistent if every read is guaranteed to reflect the most recent write. Cassandra is not strongly consistent. In practice, this means that if you write a value to your database, for a short period of time, when you try to read it there will be a chance that you'll read the old value and not the value you just wrote.

Thanks to eventual consistency Cassandra can guarantee that after that short period you will be able to read your new value every time. Eventual consistency means that you will eventually be able to read the latest write. In an eventually consistent system, if there are two writes the final value will eventually be the value from the latest write. In a system with no consistency, earlier writes may win out over later ones. For a practical example imagine making an application to track the scores of a soccer match. After a point is scored you trigger an update to the score of the game. Team A scores two goals resulting in two writes: Team A Score = 1, Team A Score = 2. In an eventually consistent system you are guaranteed to get a final score of 2. In a system with no consistency you are more likely to get a final score of 2; though you can't be sure. Without eventual consistency the Team A Score = 1 may overwrite the second update for a final score of 1.

Consistency in Cassandra is a bit more complicated than what I described above. Depending on the configuration of your cluster and other factors the consistency behavior you can expect may vary. I will go more into detail on consistency in some later posts.

NoSQL

Cassandra Query Language (CQL) looks a lot like SQL but this is a trick — something very different is going on. The types of queries you can make in Cassandra are extremely limited compared to what you can do with SQL. If you already know SQL (like I did before I started with CQL) you will be pretty frustrated at first with all the queries you can't make.

Cassandra is not relational so there are no JOINs. You can use SELECT, ORDER BY, and aggregate functions but your usage of these must be specifically tuned to the schema of the table. A lot of this guide will be about navigating these limitations and learning why they exist.

Wide Column Store

A wide column store is a NoSQL database that uses tables, rows and columns like a relational database but each row/record can have different column names and types from the other rows/records in the same table. They also support each row having a huge number of these dynamic columns (as many as billions). Another way to think of wide column stores is as two dimensional key value stores, where one dimension's key is the key of row and the other dimension's key is the name of the column.

Though wide column stores may appear similar to relational databases on the surface the implementation under the hood is extremely different. This is the reason behind a lot of the feature support and performance characteristics of Cassandra. For example, wide column stores are not optimized for joins in the way columns in relational databases are, which is why join operations are not supported in Cassandra.

What will I learn in this guide?

This guide is intended for developers who are new to Cassandra on teams that are already using Cassandra. This is the position I was in when I first picked up Cassandra so this is the guide that I wish I had when I started out. Though Cassandra is often frustrating it's a really cool piece of technology that is rewarding to learn. It will allow you to build applications with unparalleled performance and scalability.

This guide will cover topics like designing table schemas and queries to develop an application. They not cover topics related to setting up or managing a Cassandra cluster.

Should I use Cassandra for my project?

If you are asking this question there is a good chance you shouldn't. Cassandra isn't the right choice for most projects. Cassandra is designed for extreme scale. It is more difficult to build and maintain applications with Cassandra than with other database management systems like MySQL or MongoDB. Your application needs to be pretty large to require this type of scale. You should think about using Cassandra if:

  • Your application requires a lot of writes (> ~100,000 per second)
  • Write availability is important to your application
  • Your application has strict performance requirements
  • Strong consistency is not important to your application
  • Someone on your team is familiar with Cassandra

If your application seems like a good fit for Cassandra I also recommend looking into AWS's DynamoDB. DynamoDB has very similar properties to Cassandra but it's fully managed by AWS. AWS charges a premium for this management but depending on the resources your team has to dedicate to managing Cassandra it may be well worth it.

Setup

This guide will have a lot of examples for you to follow along with if you'd like. This guide will help you get everything set up to run these examples. You can use a different setup if you prefer but this is what I found to be the easiest way to get started.

NOTE: This setup guide is not intended for production systems, or anything beyond exploration.

Prerequisites

I am assuming you already have the following installed and setup before we begin:

  1. docker/docker-compose
  2. python 3+
  3. pip

Running Cassandra

There are lots of ways to run Cassandra but I find the easiest way to get it up and running locally to play around with is to use bitnami's docker image + docker compose file. This will get Cassandra running with two commands:

curl -sSL https://raw.githubusercontent.com/bitnami/bitnami-docker-cassandra/master/docker-compose.yml > docker-compose.yml
docker-compose up -d

Installing cqlsh

You will need cqlsh to interact with cassandra directly. You can install it using pip:

pip install cqlsh

Connecting to Cassandra

In a lot of my examples I will just provide CQL for you to execute. In order to execute this CQL you will need to connect to Cassandra with cqlsh. Here is how to connect to the local Cassandra instance we created with docker compose earlier:

cqlsh --cqlversion=3.4.4 -u cassandra -p cassandra

Making Our Keyspace

Every Cassandra table must be in a keyspace. A lot of my examples will be in a keyspace called photos. Before you can run the CQL from these examples you need to create the photos keyspace like so (ignore the class and replication_factor for now, we will get into these later):

cassandra@cqlsh> CREATE KEYSPACE photos
   ...   WITH REPLICATION = {
   ...    'class' : 'SimpleStrategy',
   ...    'replication_factor' : 1
   ...   };

To use our new keyspace execute:

cassandra@cqlsh>use photos;

Installing the Python Driver

Some of the examples in this guide will involve some python so you will need the python cassandra driver. I recommend you use virtualenv or pipenv for your python package installations but I will leave that up to your preference and just show you the pip installation command. To install the python cassandra driver:

pip install cassandra-driver