Sunday, 4 September 2016

Big Data and Hadoop Introduction

What is Big Data? Is it just a buzzword

When a volume of data that cannot be handled by a single server or machine, that is called as big data. Its the collection of large data sets that cannot be processed under traditional computing techniques. Gartner defines big data as follows (3Vs Definition)
"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Additionally, a new V "Veracity" is added by some organizations to describe it

  1. Volume:- Enterprise data is grows exponentially, preserving this large data set is a big challenge. Data sets could grow from terabytes to petabytes and from petabytes to exabytes. This huge amount of data refers to Volume in Big data.
  2. Velocity:- Every day large amount of data is getting generated. Rapid growth of data posses challenges while processing data. Large data sets has to process data or provide query results as quickly as possible
  3. Variety:- Various types of data that is being generated, lets consider social media where different kinds of data is being generated such as document, audio, videos, photos etc. Handling various kinds of data refers to variety in Big Data
  4. Veracity:- Its is the quality of data that has been gathers that may affect to provide accurate analysis

How it all began?

Google published a paper in the year 2004 on a process called MapReduce. The MapReduce concept provides a parallel processing model, that could process huge amounts of data. What MapReduce does is, it splits the queries and distributes across parallel nodes and processed in parallel (the Map step). The processed results are gathered and delivered (the Reduce step). An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop

What is Hadoop?

Apache Hadoop is an open-source software framework for distributed storage and distributed processing. Its built on sets of computer clusters mostly commodity hardware to work on very large data sets. Apache hadoop includes distributed file system known as HDFS. HDFS splits the input and stroed the data on to the different nodes in the cluster and lets data to be processed in parellel. Data is processed in parallel that makes the system very fast and efficient

Core Modules of Hadoop

Apache Hadoop framework is composed of the following modules:

  1. Hadoop Common:- These are JAVA libraries and utilities needed by other Hadoop modules
  2. Hadoop Distributed File System (HDFS):- a distributed file-system that stores data on commodity hardware, providing very high bandwidth across the cluster
  3. Hadoop YARN:- YARN (Yet Another Resource Negotiator) is a resource management platform that is responsible for managing cluster resources in a Hadoop Cluster
  4. Hadoop MapReduce:- The framework that understands and assigns work to the nodes in a cluster. MapReduce program is used for large scale data processing

Advantage of Hadoop


  1. Scalablability:- New nodes can be added as needed and added without needing to change data formats
  2. Cost effective:- Hadoop brings massively parallel computing to commodity servers
  3. Flexible:- Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources
  4. Fault tolerant:- When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat

No comments:

Post a Comment