Categories
Cassandra

Cassandra Secondary Indexes – An Overview

One of the primary goals of Apache Cassandra is the storage and retrieval of data as quickly as possible. One of the tools that databases provide is a mechanism called an index, which, like an index at the back of a book, provides quick access to the desired data. Cassandra provides an implementation of secondary indexes, though its design and implementation are different enough from a relational database to warrant more exploration. This is the first in a series of posts that detail secondary indexes and it will provide an overview of the basics of Cassandra’s native secondary index usage and known gotchas.

As you can imagine, the more research that is done into this topic the more still there is found to research. These are the other subjects under consideration:

  • Local vs. Global Index
  • Cardinality
  • Index maintenance
  • Performance
  • Non-native Indexes (includes SASI, custom, etc.)

Naturally, the above list of topics may change, merge, split, etc. as the research goes on, but this is the starting point.

So let’s begin…

Creating an Index

Creating a secondary index in Cassandra is straightforward. This would usually be performed using the cqlsh prompt.

Assuming you have a keyspace named my_keyspace that you’ve use-d in cqlsh, and assuming you have a users table with the user’s name, email address, state, ZIP code, etc., the statement itself to create an index is:

CREATE INDEX zip_code_idx ON users(zip_code);

Specifying the index name is optional; if you omit it, Cassandra will provide an index name with the format:

<keyspace name>.<table name>_<column name>_idx

Therefore, the following statement will create a new index named my_keyspace.users_zip_code_idx:

CREATE INDEX ON users(zip_code);

The default name isn’t half bad. It’s there if you need it.

Also note that you can create custom indexes, indexes on collections columns, and all kinds of crazy stuff that is outside the scope of this post.


Gotcha #1: Only One Column is Allowed

Unlike most relational databases, by default Cassandra will not allow you to create an index with more than one column. Attempting to execute the following will cause an error:

CREATE INDEX ON users(zip_code, state);

The specific error returned is:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Only CUSTOM indexes support multiple columns"

OK, so “custom” indexes apparently allow for this. Hopefully I can research what those are for a future article in this series.


Gotcha #2: Be Aware of Your Column’s Cardinality

Unlike most relational databases, creating indexes in Cassandra with cardinality that is either too high or too low will cause performance issues. Here is an example of creating an index with very high cardinality:

CREATE INDEX ON users(email_address);

Cassandra will allow you to create this index, but it will likely cause some headache down the road. The subject of cardinality with Cassandra indexes is a topic warranting a dedicated post and is planned for this article series.


As new data is inserted into the table that owns in the index, Cassandra will automagically update the index entries correctly.

Creating an index on a table with lots of existing data may take time. Cassandra must essentially iterate over all of the rows in the source table and insert index entries. How do you know if the index is done being created? It’s apparently possible to use nodetool to divine this information, but as yet I don’t know how reliably.

Gotcha #3: Don’t Use an Index Until It’s Fully Created

Unlike many relational databases, Cassandra will not block on the creation of the index for a table with existing data. The population of the index is done in the background. Depending on the size of the source table that population could take some time during which users querying the table using that index before it is complete may see an error similar to this:

ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures: UNKNOWN from localhost/127.0.0.1:7000" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

Work is in progress to make this behavior and/or error message better.


Using an Index

Using a secondary index in your query is about as straightforward as you would hope:

SELECT * FROM users WHERE zip_code = 95014;

The actual mechanics of how the appropriate data is located and returned will the subject of future posts in this series.


Gotcha #4: Cassandra Won’t Let Your Forget to Create an Index

In a relational database, it’s possible to filter the results by using any of the database table’s columns. Performance may be severely impacted if the columns in question are not part of an index.

For example, in a relational database, we could query on the state column even though it wasn’t indexed by simply executing this:

SELECT * FROM users WHERE state = 'CA';

However, because there isn’t an index on that column, the database will perform a full table scan which is horribly inefficient over large data sets.

Cassandra, on the other hand, seeks to maintain predictable performance for its queries. By default Cassandra considers queries against non-indexed columns to be invalid. It will literally give an error and refuse to execute the query:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Therefore, in Cassandra, you don’t have to worry about slow queries as a result of accidentally forgetting to create an index. It will give you an alert up front.


Getting Information About an Index

The cqlsh tool provides a way to describe an index:

DESCRIBE INDEX my_keyspace.users_email_address_idx;

The output is a little underwhelming in terms of usable details:

CREATE INDEX ON users(state, email_address);

You can get similar information out of the internal system tables:

SELECT * FROM system_schema.indexes;

This returns a bit of the additional data (like COMPOSITES, whatever those are):

 keyspace_name | table_name | index_name                   | kind       | options
---------------+------------+------------------------------+------------+-----------------------------
       test_ks | test_table |                        idx_1 | COMPOSITES |     {'target': 'user_name'}
       test_ks | test_table |                        idx_3 | COMPOSITES |         {'target': 'state'}
       test_ks | test_table | test_table_email_address_idx | COMPOSITES | {'target': 'email_address'}

It might be helpful if there was a means to retrieve some more meta data about the index. For example, I’d love to have it answer some of the following questions:

  • How many primary keys are stored in the index?
  • How much disk space is it using?
  • Is it 100% complete, or is it still being built?

Some of that information is available by knowing how to interpret nodetool output and logs.

Conclusion

The foregoing was just a brief overview of Cassandra’s indexes. Trust me, there are a lot more to indexes in Cassandra than there might seem at first blush. They are extremely powerful but are a bit persnickety and can actually hurt performance if used incorrectly.

If you’re really hungry for more details on Cassandra indexes, please dig into the following blog posts: