Friday, January 23, 2015

Two Phase Commit

Bottlenecks in database layer

Database has been seen as a most common place of bottleneck for performance across different tiers of the application. Few possible reasons restricting RDBMS performance

  • RDBMS not able to scale horizontally
  • Locking at row level / data page level / table level during database transactions
  • NoSQL on the rescue

    I have mentioned about NoSQL data stores in my previous blog which achieves horizontally scalability distribution . In this blog I would like cover how transactional behaviour is achieved with high performance in NoSQL

    Transactions in RDBMS

    Let us try to understand how transactions operates in RDBMS,transactions with ACID (Atomicity, Consistency, Isolation and Durability) compliance executes all the actions involved in the transaction in a single step, if all the actions succeeds it commits the changes otherwise all the changes are revoked. To achieve this locking happens across the tables and hence performance becomes bottleneck

    Let us take a simple transaction in order placement and analyse , An simple order management transaction involves two tables involving order and billing.

  • Confirm the product for the order by decrementing a count in the product catalogue
  • Confirm billing for payment
  • If payment succeeds, transaction as a whole has to be committed . If payment fails for some reason , change in product catalogue has to be revoked to original state so that it is available for others to consume. RDBMS achieves this whole process as a single step by locking these tables until transaction is completed and hence gets the ability to commit or revoke at the end of transaction, but this gives a overhead in performance as these tables gets locked and any read/ write on those are kept on hold unless stale read is enabled

    Restrictions with NoSQL

    Let us understand restrictions in NoSQL towards achieving this type of transactions

  • NoSQL provides locking at row level and not across rows, tables etc
  • With adoption of polyghot persistence and distributed transactions we may need to perform a transaction across different datastores as well
  • Two Phase Commit

    Two Phase Commit is an approach followed in NoSQL to achieve transaction like behaviour. As the name mentions transactions happens in two phases with the ability to commit or revoke the changes made in phase 1 during phase 2. The approach introduces a additional component transaction manager which helps to commit or roll back the changes made in each phase of the transaction

    Advantages with Two Phase Commit approach

  • Provides high performance with transactions
  • Ability to retry for failure portions of the transactions ( interesting )
  • Provides distributed transaction like capabilities across data stores
  • My personal experience with Two Phase Commit

    Recently I personally came to see a Two Phase Commit scenario handled in amazon for my order placement which became inspiration for this post.

    I placed an order ( a laptop desk ) in Amazon where my order placement was received and I went for sleep.Looks the payment got failed for some reason. Next day morning I got a notification to retry my payment, in this case Amazon instead of revoking the order placed ,Amazon holded the order for additional time say for a day or two and provided option to retry payment failure.

    My order confirmation

    Payment retry for my order

    I believe Amazon has implemented some form of two phase commit to achieve this, I was personally happy with the way amazon handled my payment failure as the order was not revoked and i was given retry option later to complete the order with my laptop desk was still reserved for me.

    This also opens the door for other mode of payments like cash on delivery etc.

    Few links on Two Phase Commit

  • Star Bucks Approach for Performance
  • MongoDB - how to perform 2PC
  • Saturday, January 3, 2015

    NoSQL - An Introduction


    Not Only SQL often mentioned as NoSQL provides a mechanism to store and retrieve data not through tabular format as in relational databases.

    There are different NoSQL solutions that are matured and being adopted widely Ex : Redis,Riak,HBase,Cassandra,Couchbase,MongoDB.

    It is critical to understand the concepts of NoSQL why and how NoSQL has been used for a specific application architecture because every NoSQL solution is unique in its own way and different from general RDBMS solutions.

    Need for NoSQL:

    With the explosion of web and social interactions the volume and complexity of data has grown tremendously huge, it is the need of the hour for each applications to scale seamlessly without any compromise in performance.

    If we look at RDBMS performance starts degrading at some point of data volume and complexity and applications has to think adopting various NoSQL solutions to match the growth of huge volume and complexity.

    Polyglot persistence:

    NoSQL Solutions has become more matured and enterprise data architects has started implementing NoSQL in their solutions giving a strong message that RDBMS is not the only solution to data needs.

    Problem in data persistence are unique and each problem needs specific solution to handle the scenario better. The concept of Polyglot persistence evolved to insist that application needs to use specific persistence solution to handle specific scenarios.

    The table below helps to describe some scenarios in a retail web application and how different persistence solution can help to satisfy those needs.

    Scenario Persistence solution
    User Sessions Re-dis
    Financial data RDBMS
    Shopping Cart Riak
    Recommendations Neo4j
    Product Catalog MongoDB
    Analytics Cassandra
    User Activity Logs Cassandra


    Coming out of relational mindset:

    One of the biggest problem with the adoption of NoSQL solution is to keep the people out of relational mindset. The minds of data modeling is deeply rooted with RDBMS and relational concepts.

    It will be difficult initially to conceptualize data out of relational world, but if we understand these concepts and look back at our data solutions made, many of them may not need the normalized modeling.

  • Data is not normalized.
  • Data will be duplicated.
  • Tables will be schema less and doesn’t follow a predefined pattern
  • Data can be stored in different formats like JSON, XML, audio, video etc.
  • Database may have some compromise on some attributes on ACID properties
  • Data may have some compromise on attributes like consistency.
  • CAP theorem:

    CAP theorem defines set of basic attributes for any distributed system. Understanding the dimensions of CAP theorem helps to understand any NoSQL solution better. The below diagram describes the attributes satisfied by different distributed database system on multiple server deployment environment.

    The important point to note here is that none of the distributed system can completely satisfy all the three dimensions of CAP theorem Consistency, Availability and Partition Tolerance.

    Any distributed system can a maximum 2 dimensions of CAP completely, depending on the application requirement people have to choose for the specific distributed system that suits their needs.

    It is critically important to understand the application requirements and understand where the specific NoSQL solution falls.

    ACID Compliance:

    ACID stands for Atomicity, Consistency, Isolation, Durability, these are set of properties that guarantee transactional behavior in RDBMS operations.

    RDBMS concepts that focuses more on integrity, concurrency, consistency and data validity, but many of the data needs in software applications may not be interested in these aggregation, integrity and validity or can handled in upper layers.

    Compromising any of these in database architecture may bring high performance and scalability that RDBMS is currently lagging.

    NoSQL database for example is not strictly ACID compliance where it can compromise on one of the attributes of ACID to achieve extreme scalability and performance.

    It is critically important to understand the application requirements and understand the specific NoSQL used and how the compromise is made.

    BASE versus ACID:

    NoSQL instead of adhering ACID compliance it tends to be BASE compliance in order to achieve scalability and high performance. The following are defined to be BASE attributes that NoSQL solution are trying to adopt

    • Basic Availability
    • Soft-state
    • Eventual consistency

    NoSQL Categorization based on data modeling

    • Key Value Stores Ex : Redis, Riak, Amazon Simple DB
    • Column Family Stores ( Big Tables ) Ex : Cassandra , HBase
    • Document databases Ex : CouchDB , Couchbase , MongoDB
    • Graph databases Ex : Neo4j, Titan

    Each of this NoSQL provide unique advantage on specific functionalities, selection of a specific NoSQL category is critical for the design of the application needs.

    At high level the specific NoSQL solution can be chosen based on the complexity and querying associated with the data model.

    The below diagram provides a good comparison on the different NoSQL databases.


    NoSQL based on system architecture:

    Based on the system architecture, NoSQL can be categorized into the following.

    • P2P ( Ring Topology )
    • Master Slave

    Each architecture has some pros and cons and a decision has to be made based on the needs.

    P2P ( Ring Topology ) Master Slave
    Role All Nodes carries equal role Master – Slave architecture with specific responsibilities on specific nodes
    Consistency Eventual Strong
    Write/Read Read and Write happens through all the nodes Mostly write is driven through restricted nodes
    Availability High Availability Availability is little compensated when master / Write node fails
    Data Data is partitioned across all nodes with replication Data is partitioned into multiple slave nodes with replication
    Examples Cassandra, Couch base HBase, MongoDB

    Data read / writes:

    The need of NoSQL type of solutions arrives when you tend to operate with huge volume of data and high requirements for performance towards read and writes.

    Below are the typical use cases where NoSQL databases will be used

    • Scalable databases
    • High availability and fault tolerance
    • Ever growing set of data
    • Bulk read / write operations

    Some NoSQL will be good for write intensive workloads and some are good for read intensive workloads and some are good for mixed workloads, specific analysis has to be done to decide on the NoSQL solution based on the needs.

    Other important concepts that I would like to highlight specific to any NoSQL solutions:


    Shrading is one of the important concept in NoSQL solution by which the data is partitioned horizontally across different nodes in the cluster. This means the data is split based on some logic say some a hash code and spread across different nodes.


    The data is not only partitioned by different nodes but also replicated across different cluster nodes. The replication factor will be a configuration in the solution. Replication ability gives high availability and automatic fail over when a specific node goes down.