zephyr: NoSQL - An Introduction

NoSQL:

Not Only SQL often mentioned as NoSQL provides a mechanism to store and retrieve data not through tabular format as in relational databases.

There are different NoSQL solutions that are matured and being adopted widely Ex : Redis,Riak,HBase,Cassandra,Couchbase,MongoDB.

It is critical to understand the concepts of NoSQL why and how NoSQL has been used for a specific application architecture because every NoSQL solution is unique in its own way and different from general RDBMS solutions.

Need for NoSQL:

With the explosion of web and social interactions the volume and complexity of data has grown tremendously huge, it is the need of the hour for each applications to scale seamlessly without any compromise in performance.

If we look at RDBMS performance starts degrading at some point of data volume and complexity and applications has to think adopting various NoSQL solutions to match the growth of huge volume and complexity.

Polyglot persistence:

NoSQL Solutions has become more matured and enterprise data architects has started implementing NoSQL in their solutions giving a strong message that RDBMS is not the only solution to data needs.

Problem in data persistence are unique and each problem needs specific solution to handle the scenario better. The concept of Polyglot persistence evolved to insist that application needs to use specific persistence solution to handle specific scenarios.

The table below helps to describe some scenarios in a retail web application and how different persistence solution can help to satisfy those needs.

Scenario		Persistence solution
User Sessions		Re-dis
Financial data		RDBMS
Shopping Cart		Riak
Recommendations		Neo4j
Product Catalog		MongoDB
Analytics		Cassandra
User Activity Logs		Cassandra

[Source: http://martinfowler.com/bliki/PolyglotPersistence.html]

Coming out of relational mindset:

One of the biggest problem with the adoption of NoSQL solution is to keep the people out of relational mindset. The minds of data modeling is deeply rooted with RDBMS and relational concepts.

It will be difficult initially to conceptualize data out of relational world, but if we understand these concepts and look back at our data solutions made, many of them may not need the normalized modeling.

Data is not normalized.

Data will be duplicated.

Tables will be schema less and doesn’t follow a predefined pattern

Data can be stored in different formats like JSON, XML, audio, video etc.

Database may have some compromise on some attributes on ACID properties

Data may have some compromise on attributes like consistency.

CAP theorem:

CAP theorem defines set of basic attributes for any distributed system. Understanding the dimensions of CAP theorem helps to understand any NoSQL solution better. The below diagram describes the attributes satisfied by different distributed database system on multiple server deployment environment.

The important point to note here is that none of the distributed system can completely satisfy all the three dimensions of CAP theorem Consistency, Availability and Partition Tolerance.

Any distributed system can a maximum 2 dimensions of CAP completely, depending on the application requirement people have to choose for the specific distributed system that suits their needs.

It is critically important to understand the application requirements and understand where the specific NoSQL solution falls.

ACID Compliance:

ACID stands for Atomicity, Consistency, Isolation, Durability, these are set of properties that guarantee transactional behavior in RDBMS operations.

RDBMS concepts that focuses more on integrity, concurrency, consistency and data validity, but many of the data needs in software applications may not be interested in these aggregation, integrity and validity or can handled in upper layers.

Compromising any of these in database architecture may bring high performance and scalability that RDBMS is currently lagging.

NoSQL database for example is not strictly ACID compliance where it can compromise on one of the attributes of ACID to achieve extreme scalability and performance.

It is critically important to understand the application requirements and understand the specific NoSQL used and how the compromise is made.

BASE versus ACID:

NoSQL instead of adhering ACID compliance it tends to be BASE compliance in order to achieve scalability and high performance. The following are defined to be BASE attributes that NoSQL solution are trying to adopt

Basic Availability
Soft-state
Eventual consistency

NoSQL Categorization based on data modeling

Key Value Stores Ex : Redis, Riak, Amazon Simple DB
Column Family Stores ( Big Tables ) Ex : Cassandra , HBase
Document databases Ex : CouchDB , Couchbase , MongoDB
Graph databases Ex : Neo4j, Titan

Each of this NoSQL provide unique advantage on specific functionalities, selection of a specific NoSQL category is critical for the design of the application needs.

At high level the specific NoSQL solution can be chosen based on the complexity and querying associated with the data model.

The below diagram provides a good comparison on the different NoSQL databases.

[Source: https://highlyscalable.wordpress.com/2012/03/01/NoSQL-data-modeling-techniques/]

NoSQL based on system architecture:

Based on the system architecture, NoSQL can be categorized into the following.

P2P ( Ring Topology )
Master Slave

Each architecture has some pros and cons and a decision has to be made based on the needs.

	P2P ( Ring Topology )	Master Slave
Role	All Nodes carries equal role	Master – Slave architecture with specific responsibilities on specific nodes
Consistency	Eventual	Strong
Write/Read	Read and Write happens through all the nodes	Mostly write is driven through restricted nodes
Availability	High Availability	Availability is little compensated when master / Write node fails
Data	Data is partitioned across all nodes with replication	Data is partitioned into multiple slave nodes with replication
Examples	Cassandra, Couch base	HBase, MongoDB

Data read / writes:

The need of NoSQL type of solutions arrives when you tend to operate with huge volume of data and high requirements for performance towards read and writes.

Below are the typical use cases where NoSQL databases will be used

Scalable databases
High availability and fault tolerance
Ever growing set of data
Bulk read / write operations

Some NoSQL will be good for write intensive workloads and some are good for read intensive workloads and some are good for mixed workloads, specific analysis has to be done to decide on the NoSQL solution based on the needs.

Other important concepts that I would like to highlight specific to any NoSQL solutions:

Shrading:

Shrading is one of the important concept in NoSQL solution by which the data is partitioned horizontally across different nodes in the cluster. This means the data is split based on some logic say some a hash code and spread across different nodes.

Replication:

The data is not only partitioned by different nodes but also replicated across different cluster nodes. The replication factor will be a configuration in the solution. Replication ability gives high availability and automatic fail over when a specific node goes down.

Reference:

http://martinfowler.com/

http://highscalability.com

http://nosqlguide.com/

zephyr

Saturday, January 3, 2015

NoSQL - An Introduction

No comments:

Post a Comment