zephyr

Feature engineering tips for improving predictions..

2016-12-19T04:45:00.001-08:00

Visit my blog posted on wordpress https://asimovweb.wordpress.com/2016/12/19/feature-engineering-tips-for-improving-predictions/

Choose your best platform for machine learning

2016-10-06T08:14:00.003-07:00

My new blog entry on choosing best platform for machine learning based solution... https://asimovweb.wordpress.com/2016/10/06/choose-your-best-platform-for-machine-learning-solution/

Moving to wordpress blog

2016-10-06T08:13:00.002-07:00

I have moved my blog to wordpress , Please continue to follow me at https://asimovweb.wordpress.com/

Fast forward transformation process in data science with Apache Spark

2016-05-01T13:28:00.001-07:00

Data Curation :

Curation is a critical process in data science that helps to prepare data for feature extraction to run with machine learning algorithms. Curation generally involves extracting, organising, integrating data from different sources. Curation may be a difficult and time consuming process depending on the complexity and volume of the data involved.

Most of the time data won't be readily available for feature extraction process, data may be hidden is unobstructed and complex data sources and has to undergo multiple transformational process before feature extraction .

Also when the volume of data is huge this will be a huge time consuming process and can be a bottle neck for the whole machine learning pipeline.

General Tools used in Data Science :

R Language - Widely adopted in data science with lot of supporting libraries

Mat lab - Commercial tool with lot of builtin libraries for data science

Apache Spark - New, powerful and gaining traction, Spark on Hadoop provides distributed and Resilient architecture help to fasten the curation process by multiple times.

Recent Study

One of my project involved curing and extracting the features from huge volume of data in natural language conversation text. We started with using R programming language for the transformation process, R language is simple with lot of functionalities in statistics and data science space but has limitations in terms of computation and memory and in turn efficiency and speed. We tried to migrate the transformation process to Apache Spark and observed tremendous improvement in the performance of transformation, We were able to bring down the time for transformation from more day to almost an hour of time for huge volume of data.

Here are some of the benefits that I would like to highlight the benefits of Apache Spark over R.

Effective Utilization of resources:

By default R runs in a single core and is limited by the capabilities of the single core and memory usage. Even though you have multi core system R is limited with using only one core, for memory it has the process limitations of a 32 bit R execution with virtual memory user space of 3 GB and for 64 bit R execution limited to amount of RAM. R has some parallel lib packages that can help to span the processing to multi cores.

Spark can run in distributed form with the processing running on executors with each executor running on its own process utilizing the cpu and memory.Spark brings the concept of RDD (Resilient Distributed Dataset) to achieve distributed , resilient and scalable processing solution.

Optimized transformation:

Spark has the concept of Transformation and Actions where the transformation perform lazy evaluation of job execution until an Action task is being called and intern brings optimization when multiple transformations are involved before an Action task which leads to transferring the results back to the driver program

Integration to Hadoop Eco System

Spark integrates well in the Hadoop ecosystem with yarn architecture and can easily bind to HDFS , multiple NOSQL database like HBase, Cassandra etc.

Support for multiple languages:

Spark API's has support on multiple programming languages like Scala, Java and Python

Google Auto Awesome Video - Is it a machine learning solution ???

2016-01-03T17:46:00.001-08:00

Wish you all a very happy and a wonderful new year 2016 !!!

Happy to start the year with a blog covering some aspects on machine learning and this post is actually an inspiration from the new year eve celebration.

In the new year eve celebration with friends and I captured some photo moments using google photos app. The next day morning when I woke up I got a notification in mobile , would you like to review and save the video made out of photos in the new year eve event with some nice background music added, in google terms they call as Auto Awesome videos in google photos app.

I am happy to see the video that has been made automatically and ready to share , there is also a manual mode where we can customize photos for the video. But my interest is on the automatic creation and started thinking how this design could have been ?

At first cut I was able to sense this could potentially be a machine learning implementation and with my limited data science knowledge I thought to do provide some guess work on how this could have been designed while running at a large scale for millions of tenants at the server end

Let us understand the requirement in detail , given a collection of images we have to perform the following

Categorize the images into groups and pick the group corresponding to a specific event say the new year celebration in this case.

To improve accuracy , check and eliminate any irrelevant images that went into the group by error

Judge the mood of the event and add appropriate background music

Now let us analyse the type of machine learning solutions that could have potentially been used for this design

First part of the problem is towards categorizing the images into groups based on some parameters , Clustering algorithm could be a best fit to perform this. Given a specific dataset clustering helps to categorize the dataset into different partitions based on features of data, In our scenario grouping could be based on time of photo taken but I have seen cases where grouping is done based on image background and persons involved.

Next is to eliminate outliers in the grouping ,some photos might have accidentally went in to the group. Algorithms like anomaly detection can be executed to eliminate those outlier images in the collection.

Final step is to understand the mood of the images and add relevant background music to the video, sentiment analysis algorithm on pictures could potentially help to understand the mood of the images.

Disclaimer : This is purely my own guess work of the design and google might have done in a different way :)

NOSQL with RDBMS fallback:

2015-12-07T17:07:00.002-08:00

NOSQL adoption becoming prominent across different critical applications to reap the benefits of performance, fault tolerance, high availability for bigger volume database needs. While migrating to NOSQL one of the risk that architects feel is what if the application gets into some unseen issues and take more time to fix , as NOSQL adoption is not battle tested across different domain and sectors and how to design some fallback strategy.

Few factors that people may think while migrating to NOSQL

What will happen if we get into unexpected errors in production and if takes more time to fix?

What if the product vendors itself haven’t faced such scenarios?

What if I have reporting or other dependent system that are well integrated with RDBMS and difficult to migrate from RDBMS in the current phase?

Architects would like to design some fallback option as RDBMS where application can switch to RDBMS on unrecoverable NOSQL issues. This raises few questions in mind on how to design the same.

How do I sync up data both in NOSQL and RDBMS for a high data volume without losing the order of update?

How do I sync up without adding much overhead to application? Synchronous update to both NOSQL and RDBMS will be too much of overhead...

How to reliably update the data between the two systems without any loss?

What if RDBMS goes down and how can I design to sync up reliably even on failures?

I can think of design depicted below to address the same.

Few components involved in the design are Apache Kafka receiving the updates and Apache Storm process the data to update the same to RDBMS. Both of these system are designed to work for big data needs in a reliable and distributed form.

Apache Kafka is a high performance message queuing system. Application pose the messages (Insert / Update / Delete) to Kafka message queue. To improve the performance with parallel processing the queue can be partitioned by table / region / logical data design as per the NOSQL model.

Apache Storm is a real time processing engine that can consume message through Spout component, do some processing through bolts and update the data to RDBMS. Storm has topologies to process guaranteed data processing, transactional mode of commitments that makes it suitable to handle partial failures and during commitments.

Benefits:

Information can be updated in an asynchronous way to RDBMS reliably.

Apache Kafka providing high available, reliable, partitioned queuing system fits best to handle huge data volume.

Storm doing real time processing for Kafka messages on the partitioned queue and provides a reliable way to update RDBMS

On RDBMS failures Kafka will persist the messages and Storm can continue to sync up the messages when it is comes back.

Next Generation Enterprise Application Architecture

2015-12-02T16:18:00.001-08:00

New generation applications are architectured not only with the goal of desiging the application functionally and performing stable but also focuses on different aspects that are becoming critical

Scalability – Elastic scalability for all the layers of the application including data tier

Fault Tolerance - Ability to handle failure smartly and avoid cascading failures from and to dependent systems

High Availability – Ability to have application highly available on all layers including database even on data center failures

Efficient utilization of Infrastructure - Ability to scale up and down on demand

Faster access to underlying data on high load and data volumes

Ability to handle different data formats efficiently

Few reasons that are tied to this evolution are the need and benefits towards cloud adoption ( could be either private or public cloud ) and the need to handle huge data volume with faster response on the data tiers

	Benefits	Solutions
Physical -> IaaS -> PaaS	Elastic Scalability High Availability Efficient Infrastructure utilization Zero downtime deployment	VMWare , Open Stack – Private Cloud IaaS AWS, Azure – Public Cloud IaaS , PaaS Cloud Foundry – PaaS on private and public cloud
Circuit Breaker	Fault Tolerance Better failure handling Avoid avalanche failures	Netflix Hystrix Polly
Service Registry	Registry for dynamic instance scaling	Netflix Eureka Apache Zookeeper
Intelligent Load balancing	Intelligent Load Balancing utilizing the elastic scaling and self-discovery	Netflix Ribbon F5 Nginix
Search	Quick search needs from huge data sets, full text search, pattern matching	Elastic Search Solr
Data Grid	Faster read write data, Reduce read / write overhead to database, high availability to data	Coherance Gemfire Membase
Queue	Reliable data transfer across different data layers	Kafka RabbitMQ JMS
NoSQL	Big data – database needs Heavy Read / Write on high data volumes Faster response needs on the data High Availability on data Fault Tolerance on data Distributed database Scalable database	Couchbase MongoDB HBase Cassandra Graph DB ( Titan, OrientDB )
Hadoop	Distributed file processing and storage ecosystem High speed batch (MapReduce) / real time ( Storm, Spark ) processing	Different Hadoop distributions like Hortonworks, CloudEra,MapR

Kafka messaging system

2015-08-30T14:40:00.001-07:00

Apache Kafka :

Kafka is an open source message queuing solution under Apache project, Kafka is new when compared to existing queue solutions like RabbitMQ, ActiveMQ, AWS SQS on product maturity but is quickly gaining momentum due to its features. In this post we will analyze some features of Kafka to see why it is gaining attention in the market.

The demand for processing huge data sets is growing everyday across enterprise systems and data is being processed in batch or real time and the queuing systems play an important role in connecting the data from source system / producer to destination / consumers. With huge dataset in transit enterprise are looking for message solution that can provide high throughput per second , scale horizontally, provides high availability and integrate well with other solutions.

Scalability:

This is one of feature where Kafka gets edge over other solutions, the ability to scale horizontally, Kafka achieves it by means of partitioning. We can set the number of partition while defining a topic (queue) and these partitions will get distributed across the broker nodes in the cluster and hence when we want to scale the system we can add more broker nodes and hence the partitions get realigned across the added broker nodes.

Fault Tolerance and High Availability:

Kafka achieves high availability by means of replication the partitions get replicated across different broker nodes and Kafka uses Zookeeper for its co-ordination. When a broker node goes down zookeeper co-ordinates so that the data is continued to be served from the replicated broker node partition and hence high availability for data is achieved.

Unit of Order:

Kafka guarantees unit of order delivery at each partition level and messages posted across different partitions are not guaranteed to be in order.

Reliability & Guaranteed delivery:

Kafka provides reliability to the message delivery and has options of synchronous and asynchronous acknowledgements for the message delivery.

Integration with Big Data solutions:

Kafka comes as part of Hadoop distributions and integrates with Hadoop map reduce for bulk consumption in parallel, for real time stream processing needs Kafka has good integration with systems like Apache Storm and Spark.

Reference : Kafka

Build your own monitoring solution for couch base

2015-08-15T20:11:00.002-07:00

Recently i was trying to build a monitoring solution for couch base , i followed a simple approach that worked out well, thought to share the same in this post.

Requirements for the solution

Simple solution that can collect metrics from the http stats endpoint of couch base

Script based solution that can customized by operation team

Visualization dashboards

No additional software installation on couch base servers

Solution Needs

An light weight app server that can collects metrics on regular interval

A persistence layer that stores the data

A visualization tool that can bind well with the persistent data

Solution Architecture

Solution Highlights

Solution collects json stats data which has thousands of metrics as a whole and stores it to elastic search

NodeJS is a light weight server based on java script

Couch base stat endpoint exposes JSON based metrics and elastic search works well with storing JSON data

Kibana provides nice visualization for elastic search through different charts

NodeJS provides built libraries to elastic search

Hosting NodeJS, elastic search, Kibana are very simple, you can setup easily all of these components in few minutes through dockers

Elastic search is highly scalable

The approach can be applied for any monitoring where metrics are exposed through json format

Please find the reference of the solution under github

Two Phase Commit

2015-01-23T23:11:00.004-08:00

Bottlenecks in database layer

Database has been seen as a most common place of bottleneck for performance across different tiers of the application. Few possible reasons restricting RDBMS performance

RDBMS not able to scale horizontally

Locking at row level / data page level / table level during database transactions

NoSQL on the rescue

I have mentioned about NoSQL data stores in my previous blog which achieves horizontally scalability distribution . In this blog I would like cover how transactional behaviour is achieved with high performance in NoSQL

Transactions in RDBMS

Let us try to understand how transactions operates in RDBMS,transactions with ACID (Atomicity, Consistency, Isolation and Durability) compliance executes all the actions involved in the transaction in a single step, if all the actions succeeds it commits the changes otherwise all the changes are revoked. To achieve this locking happens across the tables and hence performance becomes bottleneck

Let us take a simple transaction in order placement and analyse , An simple order management transaction involves two tables involving order and billing.

Confirm the product for the order by decrementing a count in the product catalogue

Confirm billing for payment

If payment succeeds, transaction as a whole has to be committed . If payment fails for some reason , change in product catalogue has to be revoked to original state so that it is available for others to consume. RDBMS achieves this whole process as a single step by locking these tables until transaction is completed and hence gets the ability to commit or revoke at the end of transaction, but this gives a overhead in performance as these tables gets locked and any read/ write on those are kept on hold unless stale read is enabled

Restrictions with NoSQL

Let us understand restrictions in NoSQL towards achieving this type of transactions

NoSQL provides locking at row level and not across rows, tables etc

With adoption of polyghot persistence and distributed transactions we may need to perform a transaction across different datastores as well

Two Phase Commit

Two Phase Commit is an approach followed in NoSQL to achieve transaction like behaviour. As the name mentions transactions happens in two phases with the ability to commit or revoke the changes made in phase 1 during phase 2. The approach introduces a additional component transaction manager which helps to commit or roll back the changes made in each phase of the transaction

Advantages with Two Phase Commit approach

Provides high performance with transactions

Ability to retry for failure portions of the transactions ( interesting )

Provides distributed transaction like capabilities across data stores

My personal experience with Two Phase Commit

Recently I personally came to see a Two Phase Commit scenario handled in amazon for my order placement which became inspiration for this post.

I placed an order ( a laptop desk ) in Amazon where my order placement was received and I went for sleep.Looks the payment got failed for some reason. Next day morning I got a notification to retry my payment, in this case Amazon instead of revoking the order placed ,Amazon holded the order for additional time say for a day or two and provided option to retry payment failure.

My order confirmation

Payment retry for my order

I believe Amazon has implemented some form of two phase commit to achieve this, I was personally happy with the way amazon handled my payment failure as the order was not revoked and i was given retry option later to complete the order with my laptop desk was still reserved for me.

This also opens the door for other mode of payments like cash on delivery etc.

Few links on Two Phase Commit

Star Bucks Approach for Performance

MongoDB - how to perform 2PC

NoSQL - An Introduction

2015-01-03T12:29:00.002-08:00

NoSQL:

Not Only SQL often mentioned as NoSQL provides a mechanism to store and retrieve data not through tabular format as in relational databases.

There are different NoSQL solutions that are matured and being adopted widely Ex : Redis,Riak,HBase,Cassandra,Couchbase,MongoDB.

It is critical to understand the concepts of NoSQL why and how NoSQL has been used for a specific application architecture because every NoSQL solution is unique in its own way and different from general RDBMS solutions.

Need for NoSQL:

With the explosion of web and social interactions the volume and complexity of data has grown tremendously huge, it is the need of the hour for each applications to scale seamlessly without any compromise in performance.

If we look at RDBMS performance starts degrading at some point of data volume and complexity and applications has to think adopting various NoSQL solutions to match the growth of huge volume and complexity.

Polyglot persistence:

NoSQL Solutions has become more matured and enterprise data architects has started implementing NoSQL in their solutions giving a strong message that RDBMS is not the only solution to data needs.

Problem in data persistence are unique and each problem needs specific solution to handle the scenario better. The concept of Polyglot persistence evolved to insist that application needs to use specific persistence solution to handle specific scenarios.

The table below helps to describe some scenarios in a retail web application and how different persistence solution can help to satisfy those needs.

Scenario		Persistence solution
User Sessions		Re-dis
Financial data		RDBMS
Shopping Cart		Riak
Recommendations		Neo4j
Product Catalog		MongoDB
Analytics		Cassandra
User Activity Logs		Cassandra

[Source: http://martinfowler.com/bliki/PolyglotPersistence.html]

Coming out of relational mindset:

One of the biggest problem with the adoption of NoSQL solution is to keep the people out of relational mindset. The minds of data modeling is deeply rooted with RDBMS and relational concepts.

It will be difficult initially to conceptualize data out of relational world, but if we understand these concepts and look back at our data solutions made, many of them may not need the normalized modeling.

Data is not normalized.

Data will be duplicated.

Tables will be schema less and doesn’t follow a predefined pattern

Data can be stored in different formats like JSON, XML, audio, video etc.

Database may have some compromise on some attributes on ACID properties

Data may have some compromise on attributes like consistency.

CAP theorem:

CAP theorem defines set of basic attributes for any distributed system. Understanding the dimensions of CAP theorem helps to understand any NoSQL solution better. The below diagram describes the attributes satisfied by different distributed database system on multiple server deployment environment.

The important point to note here is that none of the distributed system can completely satisfy all the three dimensions of CAP theorem Consistency, Availability and Partition Tolerance.

Any distributed system can a maximum 2 dimensions of CAP completely, depending on the application requirement people have to choose for the specific distributed system that suits their needs.

It is critically important to understand the application requirements and understand where the specific NoSQL solution falls.

ACID Compliance:

ACID stands for Atomicity, Consistency, Isolation, Durability, these are set of properties that guarantee transactional behavior in RDBMS operations.

RDBMS concepts that focuses more on integrity, concurrency, consistency and data validity, but many of the data needs in software applications may not be interested in these aggregation, integrity and validity or can handled in upper layers.

Compromising any of these in database architecture may bring high performance and scalability that RDBMS is currently lagging.

NoSQL database for example is not strictly ACID compliance where it can compromise on one of the attributes of ACID to achieve extreme scalability and performance.

It is critically important to understand the application requirements and understand the specific NoSQL used and how the compromise is made.

BASE versus ACID:

NoSQL instead of adhering ACID compliance it tends to be BASE compliance in order to achieve scalability and high performance. The following are defined to be BASE attributes that NoSQL solution are trying to adopt

Basic Availability
Soft-state
Eventual consistency

NoSQL Categorization based on data modeling

Key Value Stores Ex : Redis, Riak, Amazon Simple DB
Column Family Stores ( Big Tables ) Ex : Cassandra , HBase
Document databases Ex : CouchDB , Couchbase , MongoDB
Graph databases Ex : Neo4j, Titan

Each of this NoSQL provide unique advantage on specific functionalities, selection of a specific NoSQL category is critical for the design of the application needs.

At high level the specific NoSQL solution can be chosen based on the complexity and querying associated with the data model.

The below diagram provides a good comparison on the different NoSQL databases.

[Source: https://highlyscalable.wordpress.com/2012/03/01/NoSQL-data-modeling-techniques/]

NoSQL based on system architecture:

Based on the system architecture, NoSQL can be categorized into the following.

P2P ( Ring Topology )
Master Slave

Each architecture has some pros and cons and a decision has to be made based on the needs.

	P2P ( Ring Topology )	Master Slave
Role	All Nodes carries equal role	Master – Slave architecture with specific responsibilities on specific nodes
Consistency	Eventual	Strong
Write/Read	Read and Write happens through all the nodes	Mostly write is driven through restricted nodes
Availability	High Availability	Availability is little compensated when master / Write node fails
Data	Data is partitioned across all nodes with replication	Data is partitioned into multiple slave nodes with replication
Examples	Cassandra, Couch base	HBase, MongoDB

Data read / writes:

The need of NoSQL type of solutions arrives when you tend to operate with huge volume of data and high requirements for performance towards read and writes.

Below are the typical use cases where NoSQL databases will be used

Scalable databases
High availability and fault tolerance
Ever growing set of data
Bulk read / write operations

Some NoSQL will be good for write intensive workloads and some are good for read intensive workloads and some are good for mixed workloads, specific analysis has to be done to decide on the NoSQL solution based on the needs.

Other important concepts that I would like to highlight specific to any NoSQL solutions:

Shrading:

Shrading is one of the important concept in NoSQL solution by which the data is partitioned horizontally across different nodes in the cluster. This means the data is split based on some logic say some a hash code and spread across different nodes.

Replication:

The data is not only partitioned by different nodes but also replicated across different cluster nodes. The replication factor will be a configuration in the solution. Replication ability gives high availability and automatic fail over when a specific node goes down.

Reference:

http://martinfowler.com/

http://highscalability.com

http://nosqlguide.com/

Configuring Apache Hadoop Cluster in a standalone machine

2013-05-02T20:19:00.000-07:00

Introduction

In this post I have tried to explain how to setup and configure Apache hadoop cluster with 2 or more nodes in a standalone machine probably ur windows laptop or desktop. This will help you to build map reduce program and run in a real cluster like environment and will help you to understand hadoop better.

Apache hadoop is a free open source software release for reliable and scalable distributed computing. It is a framework that allows for distributed processing large data sets across clusters of computers.

During this hadoop cluster setup, at high level the following activities will be performed

v Creating base nodes for the cluster

v Setting up base operating system for the cluster

v Setup hadoop dependencies in the nodes

v Configure hadoop users ,access

v Setup authenticity across the cluster nodes

v Configure hadoop roles for the nodes

v Run hadoop daemons for each roles

v Browse for hadoop hdfs and job tracker sites

Creating base nodes for the cluster:

If you are planning to try out this setup on your local windows laptop or desktop, download VMware player which is a free tool that can help you with setting up virtual machines with their local IP, so at the end you have a simple network of servers that can talk to each other. Nowadays laptop are coming with multiple cores and 4 GB of Memory, so it is easy to setup at least 3 nodes in your personal laptop or desktop.

Setup a Linux flavor of OS in the base nodes:

On the base VM nodes you have set with VMware player, you can install a linux based OS with a ISO file, I choose ubuntu server as the OS, it is available free to download . Download the ISO and complete the VM creation with the VM Player.

Once the OS installation is done, you will be ended with a root or sudo user for the server. You can get the IP address of the servers by typing the command ifconfig , note down the IP addresses for the servers.

Setup Hadoop and its dependencies:

We have the servers setup with OS and a sudo user to operate on,now we can start setting up hadoop in the nodes.

Apache hadoop has the following dependencies

1. Java version 6 or higher

2. SSH

Download and set up in the server, I setup up JRE under a folder /opt/jre1.6.0_45 and set Java Home under ~/.bashrc , you can verify the setup by typing the command Java -version and check the version details displayed.

SSH can be installed by using the command - sudo apt-get install openssh-server

Verify SSH by executing the command SSH localhost to that machine itself.

Download a stable version of hadoop . I choose 1.0.X as the version to setup.

If you have downloaded the .tar.gz file you can use the command tar -zxvf {file.tar.gz} to unzip the contents. I have set it to the location /opt/hadoop-1.0.4 .

Configure Hadoop

We have hadoop and its dependencies set, we can now start configuring hadoop in that server, this involves the following activities

1. Create a new user , say hadoop, In Ubuntu I used the command Adduser #user

2. Add the sudo access to the user by editing /etc/sudoers file , this can be achieved by the following commands

a. sudo visudo

add the line in the file hadoop ALL=(ALL:ALL) ALL

3. Add full permission for this hadoop user to /opt/hadoop-1.0.4 where we have the hadoop binaries folder installed , this can be done by the following commands

a. Chown –R hadoop:hadoop

b. Chmod –R 777 hadoop-1.0.4

You have to repeat the above steps for all the nodes in the cluster or simply clone the virtual machines but make sure each virtual machine has got different IP Address. Consider you have created 3 nodes for this cluster.

Now we have 3 nodes created, we have to decide on the roles of the nodes considering one node to be master node playing roles of namenode and jobtracker and other nodes playing datanode and tasktracker, we can call the nodes as hdpMaster, hdpSlave1, hdpSlave2.

Configuring authenticated SSH access between master and other nodes

We need to configure authenticated SSH access (password less) for hadoop user from masternode to rest of slavenodes. Perform the following steps to setup the same.

$ssh-keygen -t rsa ( generates the key file)

Copy the key file to all the slave machines

$scp .ssh/id_rsa.pub hadoop@192.168.8.129:~hadoop/.ssh/authorized_keys (Slave1)

$scp .ssh/id_rsa.pub hadoop@192.168.8.130:~hadoop/.ssh/authorized_keys (Slave2)

You should also able to ssh without password into the same, otherwise you have to do the following to do the same.

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Once the key is added to authorized keys of master, password less access to machines will be possible.

Verify whether you are able to connect using ssh to localhost and all the slaves by using ssh command

ssh localhost

ssh slave1IP

ssh slave2IP

HostEntry for the Server:

Update the host file with hostnames at etc/hosts , if you want to call the servers with hostnames

Configure hadoop roles for master and slaves:

We have all set for the hadoop to start, we are at the last step of configuring the roles for the nodes and start the cluster.

In the master node, perform the following steps

1. Go to the HadoopHome \ Conf location

2. Update hadoop-env.sh with JAVA_HOME location to the Java installation path

3. Update core-site.xml to the following

4. Update hdfs-site.xml to the following

5. Update mapred-site.xml to the following

6. Update masters file with the masterhostname

7. Update slaves file with all slavehostname.

Repeat step 1 -4 to all the slave nodes.

Hadoop cluster is now configured for hdfs and mapreduce. We can start the corresponding daemons on the cluster

Step 1 : go to HadoopHome location

Step 2: Format namenode by running the command bin/hadoop namenode –format

Step 3: go to bin folder, Run namenode, datanode daemons , Run Jobtracker, tasktracker daemons

Option 1: Run ./start-all.sh in master node, this will start all the daemons in all the nodes cluster as configured in masters,slaves file

Option 2: Run ./start-dfs.sh in master node, this will start namenode and datanodes , Run ./start-mapred.sh , this will start jobtracker and tasktracker in the nodes.

Option 3:Run the following

In Master node

./hadoop-daemon.sh start namenode

./hadoop-daemon.sh start jobtracker

In Slaves node run

./hadoop-daemon.sh start datanode

./hadoop-daemon.sh start tasktracker

You can check the logs of the nodes or any errors during initialization under HadoopHome/logs in each of the nodes.

If everything went fine, you should be able to see the following sites for tracking hdfs and hadoop jobs

http://masternode:50070/dfshealth.jsp - to track hdfs and its health

http://masternode:50030/jobtracker.jsp - to track job running and its status

Reference : Apache hadoop cluster setup

A2A 'Cloud Comparison' - Database as a Service

2011-02-27T09:20:00.000-08:00

This is part of my series of articles on A2A Cloud Comparison. In my previous articles I was explaining my views of A2A Comparison with Compute and Storage. In this article I will provide my views on Database as a Service with Amazon and Azure.

Introduction
We all know how data is crucial to an application take an example whether it is a banking application or an online music store application, data is very important to the whole system. Say you have recently registered and created a user on a specific site and if the user identity is not found next time when you login to the site think how much hesitation will get and you will think twice before continuing to use the site. Think of what will happen if you lose some data in a critical financial application. Losing the data will incur heavy loss to the system or make the application really obsolete. The reason why I am talking about data criticality is because in this blog I am going to talk about the database as a service offering from the cloud computing providers.
When we talk about data most of the applications store their data in a database and managing the database will be a crucial task for the system. Database administration helps to manage the database and assures to keep the database updated and highly available. I want to list some to tasks performed as part of database administration
1. Patching the database software up to date
2. Taking backups of the database
3. Maintaining the backup for the specified retention period
4. Point in time recovery

Database as Service
What if all the database administration tasks have been taken care and have ability to scale the capacity with high availability and reliability? Database as a Service is the answer for that.

Amazon and Azure Offerings
Both Amazon and Azure provides offerings in the database as a service space and are differentiated in some ways. Amazon provides its offering as RDS (Relational Database as a Service) while Microsoft Azure provides its offering as SQL Azure.
Amazon operates RDS in Infrastructure as a Service space while Microsoft SQL Azure operates at Platform as a Service space, I will be explaining it in detail below. Following the general cloud pricing model this service will also be charged in a Pay as you use model.

RDS:
Amazon offering for Database as a service called RDS (Relational Database as a Service) provides database service for MYSQL database. Recently Amazon has made an announcement that will extend RDS for Oracle database, that means you will be able to create an Oracle database with all the setup ready in matter of minutes and you can able to create and delete the instances with hourly chargeback model and with all database administration tasks taken care..Sounds interesting?
Every RDS instance in Amazon will get a dedicated virtual server instance, database storages with all the data backup and retention policies configured, this is why I called RDS operating in Infrastructure as a Service space and because of its underlying virtualization model the instance can be migrated to a bigger server configuration if needed. Database servers can also be configured for Read replication or Multi Availability Zone deployment for high availability and Disaster Recovery.
Recently I have to validate the performance of Oracle database in a specific use case for a POC, for scenarios like this it will be difficult in non cloud model because Oracle software licenses will be charged for duration of a year at least and the licenses are Processor based or Socket based. It will be difficult to compromise with express edition or a single socket license as we have to validate performance scenario and now with cloud model it is easy to execute, create and use it for the period needed and release it when POC is done , as simple as that.

SQL Azure:
Microsoft offering for Database as a Service called SQL Azure provides service for SQL Server database. With SQL Azure we will be able to create databases for 1GB, 5GB up to a maximum of 50GB. We can create a smaller DB during creation and can later alter to a maximum of 50GB with all the database management tasks taken care operating in a pay as you use model.
Microsoft operates SQL Azure in a way bit different from Amazon RDS. Unlike RDS SQL Azure does not spare a dedicated virtual server for databases instead multiple SQL Azure databases will be hosted in a bigger SQL Server instance and will be operating more like a shared multi tenant environment with all the tenant specific security measures taken care, this architecture will be abstracted from the end user as the end user will be able to operate the database in a usual way and the user is assured with high availability and scalability.
One thing that has to take care in SQL Azure is that it can scale to a maximum of 50GB as of now and beyond that we have to plan for horizontal scaling of database in our application architecture.

References:
http://social.technet.microsoft.com/wiki/contents/articles/inside-sql-azure.aspx
http://aws.amazon.com/rds/

A2A ‘Cloud Comparison’ – Storage Services

2010-11-08T00:14:00.000-08:00

This is part of my series of article 'A2A Cloud Comparison' ; in my previous article I have compared Amazon and Azure on Computing Services space. In this article I have given my view on Cloud Storage Services in general and the corresponding services by Amazon and Azure Cloud Providers.
Storage in Cloud
One of the important services that are provided by Cloud is the Storage Service. Cloud Storage provides enormous amount of storage space that is accessible over internet with features added on top of it. Also as with other cloud services this comes with Pay as you use model. Let us understand why the storage services in cloud is going to be important, year by year the cost of storage disks keep on reducing but still the enterprise storage cost keep on increasing year by year, the problem with conventional storage costing is that even though the hardware cost keeps on reducing cost on operation and maintenance keeps the total cost increased, also it is difficult to keep with the exponential need in the storage needs. Cloud Storage Services tries to address all these problems.

Understanding Storage in Cloud:
Cloud Storage operates on a base concept called Storage Virtualization. Storage Virtualization system provides a logical data store that maps over the physical storage system through a mapping table.
Storage Virtualization in general achieves the following
1. Location independence – Abstracts the physical location and thus enables data movement across different physical locations.
2. Replication – Enables replication of the storage data across multiple locations
3. Data migration – Enables movement of storage data to a faster / better infrastructure if needed.
4. Dynamic scaling - Enables to scale the capacity of the storage space when needed

Storage Services in Amazon and Azure
Amazon, Azure the top public cloud computing service providers provides services in Storage segment. Both of them provide similar type of services in storage segment. These storage services can be accessed by a REST based API or web service API calls.

Let us try to compare the cost of these storage services by these vendors. Generally the cost of these services will vary based on geographic location and also will be revised (generally reduced), the costing I am mentioning is as of today.

(Please note that billing fees are subject to change.)

Please refer to the following links for the detailed pricing
http://aws.amazon.com/s3/#pricing
http://www.microsoft.com/windowsazure/pricing/

Security options
Security in Data Transition:
Security in data transition can be achieved by means of secured http channel.
Security in Data Source:
Highly sensitive data that needs to be secured at the source can be achieved by means of data encryptions.
Security in Access:
Cloud providers are coming up with Authentication, Authorization mechanism by which access to these resources can be secured.
Security in Virtualization:
Virtual Servers in the same physical servers are properly secured by means of virtual firewall by the cloud providers and hence data is kept secured between virtual servers on same physical server.

Best Practices
1. Choose the Cloud Storage Data centre location closer to the end user
2. Segregate the data into different buckets(Amazon) or Containers(Azure) so that different level of security access can be achieved
3. Partition the data properly to achieve higher throughput and efficiency.

CDN Integration
Both Amazon and Azure provides Content Delivery Network (CDN) that can be integrated with their storage services to provide closer delivery of data to the clients with higher performance and better reliability.

Tools
There are few cloud storage explorer management tools that are available that facilitates a user to view the data on cloud storage
Cloudberry Explorer - http://cloudberrylab.com/
Explorer Tools: S3Fox, BucketExplorer, awszone.com
Azure Storage Explorer - http://www.cerebrata.com/Blog/file.axd?file=2009%2F10%2Fcomparing_azure_storage_management_tools.pdf
Azure Storage Manager - http://azurestoragemanager.codeplex.com/

Other Cloud Storage Providers in the market:
Nirvanix - http://www.nirvanix.com/
EMS Automos - http://www.atmosonline.com/

A2A ‘Cloud Comparison’ – Compute Service

2010-10-19T01:16:00.000-07:00

A2A ‘Cloud Comparison’ – Compute Service

As many of us know Amazon and Azure are among the major providers in the public cloud service space. This will be series of blogs depicting my views on comparing Amazon to Azure (A 2 A) on public cloud services on various dimensions of their services like Compute, Storage, Bandwidth, Pricing, Security, DB Services, and CDN etc. Please tag to this space to follow closely on the series.
In this current blog I have taken compute service offering from both of these providers and provided their features as per my knowledge.

Amazon EC2 Compute Instances:
Amazon provides services in Infrastructure services space where in compute instances it provides compute services in terms of virtual servers, the compute instances so called EC2 (Elastic Compute) instances provide different flavours in terms of hardware configuration and software configuration, some of the flavours in hardware are Micro, Small, Large, XLarge, High CPU, High Memory etc., you can find more details of it here at http://aws.amazon.com/ec2/instance-types/ , the costing of the instances varies based on the flavour. Each instance flavour can differ in terms of hardware configuration and software configuration. Amazon as a provider provides instance for some predefined software like Windows Server 2003, 2008, SQL Server editions, RHEL etc. In addition amazon has partnership with major vendors like IBM, Oracle and provides pre built alliances, for example you can have a prebuilt appliance with oracle 11g with different hardware configuration provided by Oracle, and similarly for IBM you have appliances provided by them.
Details of the partnership EC2 instances for IBM and Oracle are available under
http://aws.amazon.com/solutions/global-solution-providers/oracle/
http://aws.amazon.com/solutions/global-solution-providers/ibm/

Some of the benefits you can find with Amazon Compute instances are
1. Prebuilt appliance and save your time and avoid expertise from setting up with proper environment
2. Some of the instances through partnership comes as Pay as you use model and hence avoids licensing and costing issues, suppose you want to test or do some POC with IBM Web sphere Portal Server for a week or even a day you can very well find the instance and use it with amazon ec2 instance in no matter of time.
3. Many software vendors started providing their products through Amazon instances with the correct environment set, this way it becomes easy for customers to try out any software of their interest with less turnaround time.
4. Start and terminate instances when ever needed and pay only for used time.
5. Set firewall and other security for the instances as you need.
6. Ability to monitor the health status of the instances.
7. Easy to migrate existing applications with same flavour on the cloud platform.

Azure Compute instances:
Microsoft Azure as we know operates in Platform Services layer, in the sense user won’t be exposed to the server directly, but when it hosts the application it provides a virtual server for running the application. Similar to Amazon Azure also provides some option on the virtual server configuration like Small, Medium, Large, Extra Large etc. Details of the instance can be found at http://www.microsoft.com/windowsazure/windowsazure/default.aspx With respect to operational model Azure provides compute instances in two different flavours as web role and worker role. Web role instances are used when the applications needs front end handlers handled by IIS web server and worker roles are used when the application needs a back end handling process ex: a batch job application or a windows service application.

Benefits of Azure Compute Instances:
1. Instances are self health monitored by Azure Fabric
2. Auto scaling can be enabled on the instances.
3. Control security policy over the instance.
4. Easy to build and migrate applications based on IIS7 and ASP.Net
5. Development fabric on Windows Azure SDK provides a simulated environment for service deployments and role instances on local machine.

To make a comparison study on Amazon and Azure with respect to compute instances
1. Amazon Compute instances are at infrastructure level and hence have more control over the instances, while for Azure Compute the control is limited as it provides platform services. In Azure some of the overhead like application monitoring , high availability it taken care by Azure fabric where in Amazon we have integrate few Amazon services to achieve it manually.
2. Azure allows deploying only one role per compute instance, where in Amazon you can deploy multiple applications / services as we do with normal servers. For example if you want to deploy a ASP.Net based application and WCF service at back end, you may need 2 compute instance in Azure, where in Amazon EC2 instances you can deploy them in same virtual machine.
3. Azure instances are self monitored and controlled by Fabric where in Amazon EC2 instances we add a service called cloud watch to monitor specific instances.
4. Azure instances will allow running applications based on Windows environment where in Amazon we can run applications based on windows and Linux environments.
5. Applications with some 3rd party dependencies or using commercial of the shelf products will be less suitable to migrate to Azure as the platform needs the dependencies on the Azure platform and licensing of the products needs to be worked out.
Microsoft is planning to release virtual servers (VMRole) in Infrastructure as a Service space similar to Amazon in near future http://blogs.msdn.com/b/usisvde/archive/2010/03/29/vm-support-in-windows-azure.aspx , with vmrole Azure platform will gain more power and benefits for migrating Microsoft based applications to cloud.

Cloud Hosting versus Web Hosting

2010-08-15T02:57:00.000-07:00

I have talked about cloud services interms of infrastructure services, platform services, software services , when detailing about infrastructure services many people have this doubt in mind , how is the cloud infrastructure services ( ex : Amazon ) differ from the normal web hosting providers,if I use virtual servers from web hosting providers, how cloud hosting is different ?, i thought to provide my thoughts on that.

Let us talk a sometime to talk about web hosting providers, there are different type of providers, with dedicated hosting provider we can rent physical servers not shared with anybody and have full control over the server including the server administration, with shared web hosting provider we can share the server space with others to be cost efficient but have less control over the server like administration , we can also procure interms of virtual servers and use.

So what's the big deal with cloud infrastructure providers like amazon , they also provides servers in the form of virtual servers so what's the big difference and what makes amazon to be called as cloud provider?

When we say cloud provider the main difference we have to clearly observe is the utility model of computing in all dimensions of usage , the chargeback will happen based on how much we utilize the cloud elements like compute units, bandwidth, power usage, storage etc..Also the turn around time to set up an infrastructure with amazon will be much easier and quick compared to web hosting providers.

For example when a rent a server with a hosting provider we have to commit for their space say interms of months / years. It is not easy to dynamically expand and reduce the space depends on our need, you need to have minium commitment for specific duration. Also you have to get dedicated intenet bandwidth for our application need. Similarly we have to procure storage requirements for our need, all these needs minimum commitment with the provider. We cannot dynamically scale up and down instantly with general web hosting providers.

Taking an example with Amazon Infrastructure Cloud hosting provider we see how the cloud elements can be used in utilization model. Amazon provides a simple web interface in the form of plugin with firefox called 'Elastic fox' with which any user having an amazon account can securely create , destroy instances , attach, detach disk , set security settings for the instances , it also exposes SDK to operate on the elements, so you can programatically operate on the elements based on your application needs. So you basically pay for what you use and scale dynamically for seasonal needs. It also assures of high availability .

In addition to these, amazon infrastructure services provides blob storage services, Simple DB services, Amazon Relational database services, simple queue services , notification services. Using all these services in utilization model you can effectively build an architecture that use these infrastructure in a effective way to operate your app in OPEX (Operational Expense )model than in CAPEX ( Capital Expense ) model.

Cloud platform services like Microsoft Azure , Google App engine provides more services built on top of infrastructure services and cloud software service providers like sales force provides services at higher level than platform services.

Getting the power of cloud computing ...

SaaS(Software as a Service) versus Cloud

2010-05-10T09:36:00.000-07:00

Many of us will have this doubt in mind , what is a SaaS(Software as a Service) application?, what is cloud application? Can I say all SaaS application are cloud applications? , or the vice versa is true , how both of these applications are interrelated..
Does all cloud applications provide SaaS type of service? When I develop an application say in Windows Azure or Amazon EC2, will I get the appliation in SaaS model? The answer is big No.
When we define cloud computing we say SaaS as one of the services of cloud computing , what does it mean then? Let us try to understand what is SaaS ? SaaS is Software as a Service in which the application is available as a service , where in a new customer wants to use that application for his usage, he can just pay and on board as a tenant to the SaaS application , do some level of customization available and use that for his use with the specified level of data security and isolation needed, so how this SaaS type of application is related to cloud computing?
Designing SaaS type of application is comparitivily difficult to design and implement because of its extensive functionalities. High availability and massive scalability are some of the basic requirements of SaaS type of applications and cloud computing techiques helps to solve the high availability and scalability in a simple way.
To say in a simple way, cloud computing enables to build SaaS applications easily , SaaS enablement is achieved easily through cloud computing techniques.

Cloud computing segments

2010-03-16T09:16:00.000-07:00

cloud computing solutions can be classfied into three broad segments, solutions on private cloud or public cloud or hybrid cloud.

Enterprise customers who already has datacenters in place that will prefer to migrate their datacenters into private cloud , also companies that faces security as top most concern and takes no excuses unless they find any proven record on using public cloud , they will prefer to take their path towards private cloud solutions.

Companies that operates on huge amount of data and do manipulations on those data on temporary and permanent basis and want to share the data among business will prefer to go for utilizing public cloud storage and computing. For example media and entertainment companies will prefer to move towards public cloud where they can store and share huge data in widely spread public cloud.

Companies that has some medium level data centers and want to extend to public cloud on need and for less critical applications will prefer to migrate to build hybrid solutions.. for examples corporates can build a medium level data center for high critical applications and use public cloud for less critical applications.. that way they can manage concerns on security as well as costs.

In the fore coming I will talk about players in private cloud ,public cloud etc..

Pay per second for mobile calls - inspired from cloud model ???

2009-11-14T09:35:00.000-08:00

When I was watching television these days i noticed few adds from various indian telecom service providers regarding pay per second billing scheme.. Initially a new service provider came up with the Pay for second model billing and consequently all others tend to follow to provide the same in order to catch up the competion, ok understand you are thinking why I am talking about this here now, right....
I was able to corelate this model of billing similar to cloud model, so i thought to put my thoughts here so that you can better clarity on the cloud model...
Before this 'Pay per second' model , for the mobile providers the unit of billing was say for 30 seconds , so if you take a call and complete the call in 1 second you have to pay for the whole unit of 30 seconds, the question arised why I have to pay for the remaining 29 seconds which I haven't used ??
Similarly i can compare it here with the cloud model of billing earlier applications hosted on servers will have the resources in the server reserved whether the resources have been utilized or not , resources might have been utilized effectively only during peak load period , during the remaining periods it might have been under utilized ... now with the cloud model you will be paying only for the resources you have utilized and that too for that specified period only...
Whether these service providers got inspired by the cloud model ???? :)

Why Cloud the buzz word now...

2009-11-11T08:53:00.000-08:00

In my previous post I gave some introduction to cloud computing, now i can brief why Cloud is the buzz word of the technology today. From a invester prespective one of the biggest benefit that cloud computing provides is that it reduces the Capital Expense (CapEx) .

I can explain it more clearly with an example suppose you had some innovative idea to develop some e-business application. You feel so confident about the application and your business analysis say that the application has to be support accessibility by around 1000 users simultaneous. You have developed the application and now you have to make it ready for 1000 simultaneous users.

What can you do now... do some performance tuning to make the application ready, Make some capacity planning for 1000 users , procure hardware to serve that many number of users, invest huge amount to procure the hardware, ok you have done everything and the application is hosted.. What if the application didn't reach well as you expected or what if the application is used by only 100 users and not 1000 as you predicted ? The hugee money you have invested into the hardware is not utilized and you are not making money as you predicted...

I can provide you a simple solution for this, once you have the application ready you can host the application in a environment which takes care of backing up the application , manage failover, takes care of scaling based on demand if all these can be done at a cost of less than 10 Indian Rs (.12 cents USD) per hour what do you do ? Yes the answer is cloud, currently public cloud providers like Amazon, Microsoft, Google are providing cloud environment with the cost as I have mentioned. See how much of investment risk has been reduced , how much of capital expense has been reduced.

Also if the application is a hit and you want to scale the application you do it with a mear change in configuration file. You pay for what you use in the cloud. What else you want.. gotcha why cloud is so buzz now..

What is cloud computing ?

2009-11-07T22:45:00.000-08:00

Cloud computing is a methodology by which resources are consumed dynamically on demand over internet where the resources can be storage, memory, core and extend to infrastructure, platform , application.

Think of the change we had when web technologies able to migrate from static to dynamic contents, i can compare this to such a revolution where as cloud computing provides ability to consume resources dynamically.

No wonder Gartner has predicted 'Cloud Computing' as top strategic technology that most organizations will drive for during the year 2010 http://www.gartner.com/it/page.jsp?id=1210613

Few factors that drive cloud computing

1. Effective utilization of resources

2. Capacity on demand

3. Pay as you use model

4. Green IT

5. Reduce expense on hardware