Data modeling concept in cassandra are slightly different from Relational databases. Therefore it is worth spending time to understand this concept. In this post we are going to discuss more about different Keys/Filtering options available in Cassandra. Let's start understanding Primary, Compound, Partition, Clustering Keys, Secondary Index and Allow Filtering

Primary/Row Key

Similar to any database Cassandra can have multiple Row. Each row in Cassandra is referenced/identified by its Primary key also known as Row key. Being a NoSQL database number of Columns can vary in different Rows (i.e. one row in a table can have three columns whereas the other row in the same table can have ten columns.
Both column names and values are binary types in the Cassandra.

Compound Keys

As the name suggests, (Similar to RDBMS Composite Key) compound keys are comprised of one or more columns which are referenced in the primary key. Compound key can also be combination of Partition and clustering key where one component is called partition key whereas the other component is called clustering key. Different variations of a Compound Key are:
C1 Primary key has only one partition key and no cluster key.
(C1, C2) Column C1 is a partition key and column C2 is a cluster key.
(C1,C2,C3,…): Column C1 is a partition key and columns C2, C3 and so on make cluster key.
(C1, (C2, C3,…)): Column C1 is a partition key and columns C2,C3,… make cluster key.
((C1, C2,…), (C3,C4,…)): Columns C1, C2 make partition key and columns C3,C4,… make cluster key.

Partition Key

The purpose of partition key is to identify the partition or a node in the cluster which stores that row. When data is read or write from the cluster a function called Partitioner is used to compute the hash value of the partition key. This hash value is used to determine the node/partition which contains that row.

Clustering Key

The purpose of clustering key is to store row data in sorted order. The sorting of data is based on columns which are included in the clustering key. This arrangement makes it efficient to retrieve data using clustering key.

Order By

What if we wanted to change the default sort order from ascending to descending? There is an additional WITH clause that needs to be added while creating table for clustering key.
WITH CLUSTERING ORDER BY (column1 DESC, column2 ASC, column3 ASC)
We need to specify all the columns in the sort order if we add one column for ordering even if we want them to sorted on default order. We don't need to define partition key in the sort order. Since it’s the partition key, there is nothing to sort as hashed values won’t be close to each other in the cluster.

Secondary Index

Usually we create Secondary index to filter data for any column which is not available in the primary key. This is quite different from the RDBMS index... This is quite faster then the Allow filtering if used along with the Partition key... If we don't use partition key in the query this becomes a problematic in the long run, because Secondary index will search on all the partition if we fire a query without having partition key the where clause.

Allow Filtering

Allow filtering is expensive then the Secondary Index in some cases as it needs load all the records in memory before applying provided filter on the data. Let's consider an example, we have a table with 1 million rows
  • If 95% rows are returned as a result, we should use ALLOW FILTERING, as the query will be efficient.
  • If only 2 rows are returned as a result, we should not use ALLOW FILTERING, and the query will be extremely inefficient because Cassandra will load 999, 998 rows for nothing.
Resist uses to ALLOW FILTERING, think about the data model and what we are trying to do, if query is rejected by Cassandra because it needs filtering,

Queries

Even though Cassandra queries looks exectly similar to the RDBMS queries, they are quite different and does not support lot of RDBMS concepts because Cassandra is created for writing data faster.
Cassandra table design should be done based on queries needs to fire on that table along with filter criteria.
Lets understand this using few example of queries
PRIMARY KEY((col1, col2), col10, col4))
So the valid queries for above primary key are (excluding secondary indexes and Allow filtering)
col1 and col2
col1 and col2 and col10
col1 and col2 and col10 and col 4
Invalid queries for above primary key are :
col1 and col2 and col4
anything that does not contain both col1 and col2

Final Thought

Each update statement requires a precise set of primary keys to be specified using a WHERE clause. We need to specify all keys in a table having compound and clustering columns.