Modern Big Data requirements include:
Traditional relational databases (RDBMS) are not enough to meet these requirements because the relational model provides normalized schema, table joins and ACID compliance which causes a massive overhead for big data. In order to support these features, we need high-end systems which are not cost effective. Additionally, such systems often require substantial downtime.
So, how do we come overcome these challenges? We need a few strategies to manage Big Data, some of which are listed below:
This is where non-relational databases also known as NoSQL databases are useful.
There are four broad classes of NoSQL databases:
There are three important factors which help navigate the NoSQL landscape:
In distributed systems, consistency, availability, and partition tolerance exist in a mutually dependent relationship. This means that we can't achieve all three properties in one NoSQL database, we have to sacrifice one property.
As this blog is specific to Apache Cassandra, we will go through what is Cassandra, its architecture, benefits, and use cases.
Cassandra is a top level Apache project, it is an Open Source, fully distributed, linearly scalable storage system for managing large amount of data spread across the world with no single point of failure.
Cassandra is built using two core technologies:
Facebook integrated these two technologies and released it as Cassandra.
Cassandra belongs to the column family type of NoSQL database it picks availability and partition tolerance from the CAP properties. However, one of the stronger features of Cassandra is its ability to utilize tuneable consistency.
Now, let's understand a few terminologies in Cassandra:
First, the Cassandra driver chooses a coordinator for a write request, then it hashes partition keys and generates tokens for each partition. Next, it sends the data to corresponding node which owns this token value. If the replication factor (RF) is greater than one (let's say two), then it sends a copy of the data to the nearest node, similarly for RF = 3 and so on.
Similarly, the Cassandra driver first identifies a coordinator, then gets all the replications from the original and copied nodes, hashes the data received from replicated nodes and compares the hash values based on set consistency level (CL). If the CL is set to one, then it just sends out whichever node has the latest data, if it is two then it compares two copies. It returns data only if the values match else an error message is displayed. If the CL is set to Quorum which is nothing but (RF/2) + 1 (for example if RF = 3, then CL will be 3 according to the formula).
Write requests run on the Hinted Handoff mechanism a recovery mechanism for writes targeting offline nodes. In this mechanism, a coordinator node stores the data if the target node for a write is down or fails to acknowledge. Backed up data in a coordinator is rewritten when the target node comes online within a specified amount of time.
Read requests depend on the RF and CL values if CL < RF, it returns the data by comparing with other replicated nodes and if CL = RF then it displays an error message.
As mentioned already, it is keys mapped to sets of 'n' number of typed columns.
Note: Column Family terminologies (Thrift API terms) are used for Cassandra 1.1 and its previous versions
CQL table provides two dimensional views of a column family containing potentially multi-dimensional data due to composite keys and collections.
CQL Table and Column Family are largely interchangeable terms supported by Cassandra query language.
Cassandra 1.2+ relies on CQL schema, concepts and terminology. The following table shows the correlation between CQL API and Thrift API terms.
The following figures show examples of CQL table view and Column Family view
CQL Table:
Column Family:
Cassandra is the ideal solution when you need:
It is not ideal when you need ACID Compliance transactions with rollback. Traditional RDMS excel in such scenarios.
Messaging, Fraud Detection, Personalization, Playlists and Collections, Recommendation engines etc.
The following chart measuring the scalability and elasticity of distributed database systems shows that Cassandra is highly preferred as compared to other NoSQL databases.
Santosh is a certified Apache Cassandra Administrator and Data Warehousing professional with expertise in various modules of Oracle BI Applications working from KPI Partners Offshore Technology Center. Apart from Oracle, Santosh has worked on Salesforce integration and analytics projects.