Saturday, January 3, 2015

NoSQL - An Introduction

NoSQL:

Not Only SQL often mentioned as NoSQL provides a mechanism to store and retrieve data not through tabular format as in relational databases.

There are different NoSQL solutions that are matured and being adopted widely Ex : Redis,Riak,HBase,Cassandra,Couchbase,MongoDB.

It is critical to understand the concepts of NoSQL why and how NoSQL has been used for a specific application architecture because every NoSQL solution is unique in its own way and different from general RDBMS solutions.

Need for NoSQL:

With the explosion of web and social interactions the volume and complexity of data has grown tremendously huge, it is the need of the hour for each applications to scale seamlessly without any compromise in performance.

If we look at RDBMS performance starts degrading at some point of data volume and complexity and applications has to think adopting various NoSQL solutions to match the growth of huge volume and complexity.

Polyglot persistence:

NoSQL Solutions has become more matured and enterprise data architects has started implementing NoSQL in their solutions giving a strong message that RDBMS is not the only solution to data needs.

Problem in data persistence are unique and each problem needs specific solution to handle the scenario better. The concept of Polyglot persistence evolved to insist that application needs to use specific persistence solution to handle specific scenarios.

The table below helps to describe some scenarios in a retail web application and how different persistence solution can help to satisfy those needs.

Scenario Persistence solution
User Sessions Re-dis
Financial data RDBMS
Shopping Cart Riak
Recommendations Neo4j
Product Catalog MongoDB
Analytics Cassandra
User Activity Logs Cassandra

[Source: http://martinfowler.com/bliki/PolyglotPersistence.html]

Coming out of relational mindset:

One of the biggest problem with the adoption of NoSQL solution is to keep the people out of relational mindset. The minds of data modeling is deeply rooted with RDBMS and relational concepts.

It will be difficult initially to conceptualize data out of relational world, but if we understand these concepts and look back at our data solutions made, many of them may not need the normalized modeling.

  • Data is not normalized.
  • Data will be duplicated.
  • Tables will be schema less and doesn’t follow a predefined pattern
  • Data can be stored in different formats like JSON, XML, audio, video etc.
  • Database may have some compromise on some attributes on ACID properties
  • Data may have some compromise on attributes like consistency.
  • CAP theorem:

    CAP theorem defines set of basic attributes for any distributed system. Understanding the dimensions of CAP theorem helps to understand any NoSQL solution better. The below diagram describes the attributes satisfied by different distributed database system on multiple server deployment environment.

    The important point to note here is that none of the distributed system can completely satisfy all the three dimensions of CAP theorem Consistency, Availability and Partition Tolerance.

    Any distributed system can a maximum 2 dimensions of CAP completely, depending on the application requirement people have to choose for the specific distributed system that suits their needs.

    It is critically important to understand the application requirements and understand where the specific NoSQL solution falls.

    ACID Compliance:

    ACID stands for Atomicity, Consistency, Isolation, Durability, these are set of properties that guarantee transactional behavior in RDBMS operations.

    RDBMS concepts that focuses more on integrity, concurrency, consistency and data validity, but many of the data needs in software applications may not be interested in these aggregation, integrity and validity or can handled in upper layers.

    Compromising any of these in database architecture may bring high performance and scalability that RDBMS is currently lagging.

    NoSQL database for example is not strictly ACID compliance where it can compromise on one of the attributes of ACID to achieve extreme scalability and performance.

    It is critically important to understand the application requirements and understand the specific NoSQL used and how the compromise is made.

    BASE versus ACID:

    NoSQL instead of adhering ACID compliance it tends to be BASE compliance in order to achieve scalability and high performance. The following are defined to be BASE attributes that NoSQL solution are trying to adopt

    • Basic Availability
    • Soft-state
    • Eventual consistency

    NoSQL Categorization based on data modeling

    • Key Value Stores Ex : Redis, Riak, Amazon Simple DB
    • Column Family Stores ( Big Tables ) Ex : Cassandra , HBase
    • Document databases Ex : CouchDB , Couchbase , MongoDB
    • Graph databases Ex : Neo4j, Titan

    Each of this NoSQL provide unique advantage on specific functionalities, selection of a specific NoSQL category is critical for the design of the application needs.

    At high level the specific NoSQL solution can be chosen based on the complexity and querying associated with the data model.

    The below diagram provides a good comparison on the different NoSQL databases.

    [Source: https://highlyscalable.wordpress.com/2012/03/01/NoSQL-data-modeling-techniques/]

    NoSQL based on system architecture:

    Based on the system architecture, NoSQL can be categorized into the following.

    • P2P ( Ring Topology )
    • Master Slave

    Each architecture has some pros and cons and a decision has to be made based on the needs.

    P2P ( Ring Topology ) Master Slave
    Role All Nodes carries equal role Master – Slave architecture with specific responsibilities on specific nodes
    Consistency Eventual Strong
    Write/Read Read and Write happens through all the nodes Mostly write is driven through restricted nodes
    Availability High Availability Availability is little compensated when master / Write node fails
    Data Data is partitioned across all nodes with replication Data is partitioned into multiple slave nodes with replication
    Examples Cassandra, Couch base HBase, MongoDB

    Data read / writes:

    The need of NoSQL type of solutions arrives when you tend to operate with huge volume of data and high requirements for performance towards read and writes.

    Below are the typical use cases where NoSQL databases will be used

    • Scalable databases
    • High availability and fault tolerance
    • Ever growing set of data
    • Bulk read / write operations

    Some NoSQL will be good for write intensive workloads and some are good for read intensive workloads and some are good for mixed workloads, specific analysis has to be done to decide on the NoSQL solution based on the needs.

    Other important concepts that I would like to highlight specific to any NoSQL solutions:

    Shrading:

    Shrading is one of the important concept in NoSQL solution by which the data is partitioned horizontally across different nodes in the cluster. This means the data is split based on some logic say some a hash code and spread across different nodes.

    Replication:

    The data is not only partitioned by different nodes but also replicated across different cluster nodes. The replication factor will be a configuration in the solution. Replication ability gives high availability and automatic fail over when a specific node goes down.

    Reference:

    http://martinfowler.com/

    http://highscalability.com

    http://nosqlguide.com/

    No comments:

    Post a Comment