Data modeling concept in cassandra are slightly different from Relational databases. Therefore
it is worth spending time to understand this concept. In this post we are going to discuss more about different
Keys/Filtering options available in Cassandra. Let's start understanding
Primary, Compound, Partition, Clustering Keys, Secondary Index and Allow Filtering
Primary/Row Key
Similar to any database Cassandra can have multiple Row. Each row in Cassandra is referenced/identified by its Primary key also known as Row key. Being a NoSQL database number of Columns can vary in different Rows (i.e. one row in a table can have three columns whereas the other row in the same table can have ten columns.Both column names and values are binary types in the Cassandra.
Compound Keys
As the name suggests, (Similar to RDBMS Composite Key) compound keys are
comprised of one or more columns which are referenced in the primary key. Compound key can
also be combination of Partition and clustering key where one component is called partition key whereas the
other component is called clustering key. Different variations of a Compound Key are:
C1
Primary key has only one partition key and no cluster key.
(C1, C2)
Column C1 is a partition key and column C2 is a
cluster key.
(C1,C2,C3,…)
: Column C1 is a partition key and columns C2, C3
and so on make cluster key.
(C1, (C2, C3,…))
: Column C1 is a partition key and columns
C2,C3,… make cluster key.
((C1, C2,…), (C3,C4,…))
: Columns C1, C2 make partition key and columns
C3,C4,… make cluster key.
Partition Key
The purpose of partition key is to identify the partition or a node in the cluster which stores that row. When data is read or write from the cluster a function called Partitioner is used to compute the hash value of the partition key. This hash value is used to determine the node/partition which contains that row.Clustering Key
The purpose of clustering key is to store row data in sorted order. The sorting of data is based on columns which are included in the clustering key. This arrangement makes it efficient to retrieve data using clustering key.Order By
What if we wanted to change the default sort order from ascending to descending? There is an additionalWITH
clause that needs to be added while creating table for clustering key.
WITH CLUSTERING ORDER BY (column1 DESC, column2 ASC, column3 ASC)
We need to specify all the columns in the sort order if we add one column for ordering even if we want them to
sorted on default order. We don't need to define partition key in the sort order. Since it’s the partition key,
there is nothing to sort as hashed values won’t be close to each other in the cluster.
Secondary Index
Usually we create Secondary index to filter data for any column which is not available in the primary key. This is quite different from the RDBMS index... This is quite faster then the Allow filtering if used along with the Partition key... If we don't use partition key in the query this becomes a problematic in the long run, because Secondary index will search on all the partition if we fire a query without having partition key the where clause.Allow Filtering
Allow filtering is expensive then the Secondary Index in some cases as it needs load all the records in memory before applying provided filter on the data. Let's consider an example, we have a table with 1 million rows- If 95% rows are returned as a result, we should use ALLOW FILTERING, as the query will be efficient.
- If only 2 rows are returned as a result, we should not use ALLOW FILTERING, and the query will be extremely inefficient because Cassandra will load 999, 998 rows for nothing.
Resist uses to ALLOW FILTERING, think about the data model and what we are trying to do, if query is rejected by Cassandra because it needs filtering,
Queries
Even though Cassandra queries looks exectly similar to the RDBMS queries, they are quite different and does not support lot of RDBMS concepts because Cassandra is created for writing data faster.Cassandra table design should be done based on queries needs to fire on that table along with filter criteria.Lets understand this using few example of queries
PRIMARY KEY((col1, col2), col10, col4))
So the valid queries for above primary key are (excluding secondary indexes and Allow filtering)
col1 and col2
col1 and col2 and col10
col1 and col2 and col10 and col 4
Invalid queries for above primary key are :
col1 and col2 and col10
col1 and col2 and col10 and col 4
col1 and col2 and col4
anything that does not contain both col1 and col2
anything that does not contain both col1 and col2
1 Comments
Thanks for explaining this.
ReplyDeletePost a Comment