Cassandra – The Right Data Store for Scalability, Performance, Availability and Maintainability
Cassandra is an open source, distributed, column – based database management system (DBMS). Cassandra claims to be designed for large volumes of distributed data, with high availability, high throughput and high reliability.
When we look at the overall ecosystem here, we find that Cassandra is being challenged from three directions:
- Traditional RDBMS for mission critical application
- NoSQL Databases for Bigdata Analytics
- In-Memory Database
But the big question is – Is Cassandra that great a database, as it is claims to be? Is it really a multipurpose and multi-dimensional data store? Is it more of an hype or in real?
To find out the answer to this, let us place Cassandra for a SPAM test. This article will illustrate, how Cassandra achieves Scalability, Performance, Availability and Maintainability factors.
1. High Availability
Before we get started with the availability test, let us understand some of the key features of Cassandra Architecture.
- Ring Structure – In contrary to the typical master-slave model of working, Cassandra clusters work in a Ring fashion. Each node in the ring has the same role and responsivity. In other words, all nodes are the same and there is no master node that controls other nodes. Therefore, there is no single point of failure.
- Multi Data Center – The ring of Cassandra cluster can spread across multiple data centers (DC). Cassandra supports, both, virtual DC as well as physical DC. For example, we have two data centers located at two different geographical locations- Douglous County (US) and Rotterdam (EU). The data is stored across both the DCs, but when we look at the Cassandra’s ring of cluster, all nodes give us a similar output to any query, while keeping the local DC and distant DC behavior intact. Queries can be performed in local DC, only, or across all DCs in the ring.
- Replication – Cassandra, also, provides built-in but customizable replication, which stores redundant copies of data, across nodes, in a Cassandra ring. This means that if any node in a cluster goes down, one or more copies of that node’s data is available on other machines in the cluster. Also, if any Datacenter goes down, data is available on other DCs. The data can be replicated in multiple nodes in a single datacenter, across multiple datacenters or across multiple cloud providers.
The Cassandra’s tunable characteristics allow customizing the replication factor and the distribution of replication through configurations.
Cassandra Replication can be:
- In a single Datacenter
- Across multiple Datacenters
- Across multiple cloud providers
Let us now understand the multi-datacenter replication with a diagram:
Cassandra is linear scalable. It means that the capacity or scalability can be increased by, simply, adding new nodes. Cassandra can scale both horizontally (adding more datacenters) or vertically (adding more nodes).
- Scaling in a Single Datacenter
- Scaling into Multiple Datacenters
- Node Commissioning and Decommissioning – One of the major factors in scaling capability of a database system is how easy and smooth the ‘add’ and ‘remove’ operations are. With the Virtual nodes (vnodes) coming in, adding nodes to an existing cluster or removing one, is greatly simplified. When a new node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread, evenly, across other nodes in the cluster. A simple nodetool command (nodetool decommission) safely removes a node from the cluster. This assigns the ranges that the node was responsible for, to other nodes, and replicates the data appropriately.
In general, NoSQL databases come with a number of architectural best practices that affect performance. Being a fully NoSQL database, Cassandra has all the performance advantages that any other NoSQL database can have. Cassandra, actually, incorporates all of the NoSQL best practices, putting it ahead of other NoSQL competitors.
- Fully Distributed: Cassandra provides automatic data distribution, across all nodes in a ring of cluster. Every node in Cassandra ring handles a proportionate share of every activity in the cluster. The masterless architecture of Cassandra helps in delivering lower latency in read and write.
- Asynchronous: Synchronous technology can result in an unsatisfactorily slow response time because the distributed DBMS is spending considerable time checking that an update is accurately and completely propagated across the network. Cassandra’s asynchronous distribution overcomes this problem by handling the propagation asynchronously and thus, provides high performing reads and writes.
- Eventual Consistency: Cassandra’s eventually consistent data model and node repair features ensure that the consistency of the cluster will be automatically maintained over time.
Cassandra’s architecture allows the authorized user to connect to any node in any data center and access data using the CQL language. Most requests for data, by users, at a particular site can be satisfied by data stored at that site (local read/write). This speeds up query processing since communication and central computer delays are minimized. It, may, also, be possible to split complex queries into sub queries that can be processed, in parallel, at several sites, providing even faster response.
To find out, how Cassandra performs against other NoSQL databases, read more.
- Self-Healing: Cassandra’s eventually consistency feature makes it easy to recover failed nodes. If data in a node corrupts, we can take the node offline, scrub the corrupted data and then attach the node back to ring. Cassandra’s eventually consistency will then propagate the data from other nodes, as configured.
- Smooth-Upgrade: Cassandra’s eventually consistency feature, also, supports smooth upgrade and even do in-place version upgrades.
- Snapshot and Backup: Cassandra supports automated backup of data/cluster snapshots, using tools like nodetoo or opscenter. Backup can be done for one or all or specific key spaces. Backup can be stored to local storage or to remote cloud storage like Amazon S3 bucket. OpsCenter allows full or table level or point-in-time restorations.
- Flexi Restore: Cassandra data can be restored from a Cassandra snapshot, using nodetool. It allows full or table level or point-in-time restorations.
Considering, these advantages and results, I’d say Cassandra is definitely a great database for enterprises, to address scalability, performance, availability and maintainability requirement.
Technical Architect & Cloud Expert, RapidValue Solutions