Big Data

Saturday, 14 October 2017

MongoDB Architecture Introduction

Introduction

MongoDB is a NoSQL open-source, document-oriented database. It uses JSON(Java Script Object Notation) like documents, however data is stored in the database in BSON form. In MongaDB document is the basic unit of storage. It doesn't make any sense to call MongoDB as schemaless as there are collections involved in it. But schema design is very important in MongoDB, we can configure dynamic schema

2007 - Company 10gen began developing MongoDB
2009 - Shifted to an open source development model and began commercial support and services
2013 - 10gen changed its name to MongoDB Inc

Some Important Files in MongoDB

journal - This is like redologs for crash recovery
<database>.ns - This is namespace it stores metadata information
<database>.o , <database>.1 - These are the datafiles that stores data

Why use MongoDB? Advantages

It makes development task easy
Can easily scale up horizontally
Parallelism can be achieved at server or hardware level
Structured, unstructured data can be stored

Nexus Architecture

MongoDB’s design philosophy is focused on combining the critical capabilities of relational databases with the innovations of NoSQL technologies.

Expressive query language & secondary Indexes

Users should be able to access and manipulate their data in sophisticated ways to support both operational and analytical applications. Indexes play a critical role in providing efficient access to data, supported natively by the database rather than maintained in application code.

Strong consistency

Applications should be able to immediately read what has been written to the database. It is much more complex to build applications around an eventually consistent model, imposing significant work on the developer, even for the most sophisticated engineering teams.

Enterprise Management and Integrations

Databases are just one piece of application infrastructure, and need to fit seamlessly into the enterprise IT stack. Organizations need a database that can be secured, monitored, automated, and integrated with their existing technology infrastructure, processes, and staff, including operations teams, DBAs, and data analysts.

Flexible Data Model

NoSQL databases emerged to address the requirements for the data we see dominating modern applications. Whether document, graph, key-value, or wide-column, all of them offer a flexible data model, making it easy to store and combine data of any structure and allow dynamic modification of the schema without downtime or performance impact.

Scalability and Performance

NoSQL databases were all built with a focus on scalability, so they all include some form of sharding or partitioning. This allows the database to scale out on commodity hardware deployed on-premises or in the cloud, enabling almost unlimited growth with higher throughput and lower latency than relational databases.

Always-On Global Deployments

NoSQL databases are designed for highly available systems that provide a consistent, high quality experience for users all over the world. They are designed to run across many nodes, including replication to automatically synchronize data across servers, racks, and data centers.

Tuesday, 20 September 2016

Install Apache, PHP and Configure PHP Mongo Driver on Linux

Article demonstrates on how to install Apache and Php. Later on will configure PHP Mongo driver. This articles covers only PHP Mongo driver configuration Please refer the below mentioned link for installing MongoDB in Linux.

How to install MongoDB on Linux system

So lets begin with Apache configuration.

Installing Apache

[root@pract1 ~]# yum install httpd
Loaded plugins: refresh-packagekit, security
Setting up Install Process
Resolving Dependencies
--> Running transaction check
.....
Total download size: 910 k
Is this ok [y/N]: y
Downloading Packages:
(1/2): httpd-2.2.15-54.0.1.el6_8.x86_64.rpm | 832 kB 00:00
(2/2): httpd-tools-2.2.15-54.0.1.el6_8.x86_64.rpm | 78 kB 00:00
.....
Updated:
httpd.x86_64 0:2.2.15-54.0.1.el6_8

Dependency Updated:
httpd-tools.x86_64 0:2.2.15-54.0.1.el6_8
.....
Complete!

Installing PHP

[root@pract1 ~]# yum install php php-pear php-devel gcc
Loaded plugins: refresh-packagekit, security
mongodb | 951 B 00:00
ol6_UEK_latest | 1.2 kB 00:00
ol6_latest | 1.4 kB 00:00
Setting up Install Process
Resolving Dependencies
--> Running transaction check
.....
Total download size: 33 M
Is this ok [y/N]: y
Downloading Packages:
(1/17): cpp-4.4.7-17.el6.x86_64.rpm | 3.7 MB 00:03
(2/17): gcc-4.4.7-17.el6.x86_64.rpm | 10 MB 00:09
(3/17): gcc-c++-4.4.7-17.el6.x86_64.rpm | 4.7 MB 00:04
(4/17): gcc-gfortran-4.4.7-17.el6.x86_64.rpm | 4.7 MB 00:04
(5/17): libgcc-4.4.7-17.el6.i686.rpm | 114 kB 00:00
(6/17): libgcc-4.4.7-17.el6.x86_64.rpm | 103 kB 00:00
(7/17): libgfortran-4.4.7-17.el6.x86_64.rpm | 267 kB 00:00
(8/17): libgomp-4.4.7-17.el6.x86_64.rpm | 134 kB 00:00
(9/17): libstdc++-4.4.7-17.el6.x86_64.rpm | 295 kB 00:00
(10/17): libstdc++-devel-4.4.7-17.el6.x86_64.rpm | 1.6 MB 00:01
(11/17): openssl-1.0.1e-48.el6_8.1.x86_64.rpm | 1.5 MB 00:01
(12/17): openssl-devel-1.0.1e-48.el6_8.1.x86_64.rpm | 1.2 MB 00:01
(13/17): php-5.3.3-48.el6_8.x86_64.rpm | 1.1 MB 00:01
(14/17): php-cli-5.3.3-48.el6_8.x86_64.rpm | 2.2 MB 00:01
(15/17): php-common-5.3.3-48.el6_8.x86_64.rpm | 529 kB 00:00
(16/17): php-devel-5.3.3-48.el6_8.x86_64.rpm | 512 kB 00:01
(17/17): php-pear-1.9.4-5.el6.noarch.rpm | 393 kB 00:00
.....
Total 1.0 MB/s | 33 MB 00:33
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
......
Installed:
php.x86_64 0:5.3.3-48.el6_8 php-devel.x86_64 0:5.3.3-48.el6_8 php-pear.noarch 1:1.9.4-5.el6

Dependency Installed:
php-cli.x86_64 0:5.3.3-48.el6_8 php-common.x86_64 0:5.3.3-48.el6_8

Updated:
gcc.x86_64 0:4.4.7-17.el6

Dependency Updated:
cpp.x86_64 0:4.4.7-17.el6 gcc-c++.x86_64 0:4.4.7-17.el6 gcc-gfortran.x86_64 0:4.4.7-17.el6 libgcc.i686 0:4.4.7-17.el6 libgcc.x86_64 0:4.4.7-17.el6
libgfortran.x86_64 0:4.4.7-17.el6 libgomp.x86_64 0:4.4.7-17.el6 libstdc++.x86_64 0:4.4.7-17.el6 libstdc++-devel.x86_64 0:4.4.7-17.el6 openssl.x86_64 0:1.0.1e-48.el6_8.1
openssl-devel.x86_64 0:1.0.1e-48.el6_8.1

Complete!

Configure PHP Mongo driver

[root@pract1 ~]# pecl install mongo

WARNING: "pecl/mongo" is deprecated in favor of "channel:///mongodb"

downloading mongo-1.6.14.tgz ...

Starting to download mongo-1.6.14.tgz (210,095 bytes)

.............................................done: 210,095 bytes

118 source files, building

running: phpize

Configuring for:

PHP Api Version: 20090626

Zend Module Api No: 20090626

Zend Extension Api No: 220090626

Build with Cyrus SASL (MongoDB Enterprise Authentication) support? [no] :

building in /var/tmp/pear-build-rootDwggHq/mongo-1.6.14

running: /var/tmp/mongo/configure --with-mongo-sasl=no

.....

Build complete.

Don't forget to run 'make test'.

......

running: make INSTALL_ROOT="/var/tmp/pear-build-rootDwggHq/install-mongo-1.6.14" install

Installing shared extensions: /var/tmp/pear-build-rootDwggHq/install-mongo-1.6.14/usr/lib64/php/modules/

running: find "/var/tmp/pear-build-rootDwggHq/install-mongo-1.6.14" | xargs ls -dils

521531 4 drwxr-xr-x 3 root root 4096 Sep 19 21:33 /var/tmp/pear-build-rootDwggHq/install-mongo-1.6.14

.....

532128 1784 -rwxr-xr-x 1 root root 1824969 Sep 19 21:33 /var/tmp/pear-build-rootDwggHq/install-mongo-1.6.14/usr/lib64/php/modules/mongo.so

Build process completed successfully

Installing '/usr/lib64/php/modules/mongo.so'

install ok: channel://pecl.php.net/mongo-1.6.14

configuration option "php_ini" is not set to php.ini location

You should add "extension=mongo.so" to php.ini

[root@pract1 ~]#

Add MongoDB Extension and Verify

[root@pract1 ~]# vi /etc/php.ini

Press i

extension=mongo.so

Press esc

:wq

[root@pract1 ~]#

Restart the apache services

[root@pract1 ~]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
[root@pract1 ~]#

Now Verify if MongoDB drivers have been configured for PHP

[root@pract1 html]# php -m | grep -i mongo
mongo --Output must be mongo which implies that PHP MongoDB driver have been configured successfully
[root@pract1 html]#

Another way to verify is to create one php file with below mentioned content
[root@pract1 ~]# vi /var/www/html/phpinfo.php
Press i
<?php
phpinfo();
?>
Press esc
:wq

Open the web browser and check the phpinfo.php file using below mentioned address
http://<ipaddress/hostname>/<port>/phpinfo.php
Ex:- http://192.168.56.101/phpinfo.php
On web browser there has to be a mongo module similar to the one mentioned below

Saturday, 17 September 2016

Installing MongoDB on Linux

Article illustrates the step by step procedure, how to install the MongoDB on Linux. MongoDB is one of the popular document based NoSQL database.

Add MongoDB Repository

Lets login to the server as root user and add the MongoDB repository on our system. Goto repository location

[root@pract1 ~]# cd /etc/yum.repos.d

Create the mongodb repository. Goto vi editor to create repository
[root@pract1 yum.repos.d]# vi mongodb.repo
Press i It performs insert operation in the file

[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/
gpgcheck=0
enabled=1

Press ESC button then
:wq
[root@pract1 ~]#
Note:- :wq or :x in editor will save the file and quit the vi editor

Begin MongoDB Installation

Now install the MongoDB by running below mentioned command

[root@pract1 yum.repos.d]# yum install mongodb-org
Loaded plugins: refresh-packagekit, security
mongodb | 951 B 00:00
mongodb/primary | 45 kB 00:00
mongodb 279/279
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package mongodb-org.x86_64 0:2.6.12-1 will be installed
--> Processing Dependency: mongodb-org-shell = 2.6.12 for package: mongodb-org-2.6.12-1.x86_64
--> Processing Dependency: mongodb-org-server = 2.6.12 for package: mongodb-org-2.6.12-1.x86_64
--> Processing Dependency: mongodb-org-tools = 2.6.12 for package: mongodb-org-2.6.12-1.x86_64
--> Processing Dependency: mongodb-org-mongos = 2.6.12 for package: mongodb-org-2.6.12-1.x86_64
--> Running transaction check
---> Package mongodb-org-mongos.x86_64 0:2.6.12-1 will be installed
---> Package mongodb-org-server.x86_64 0:2.6.12-1 will be installed
---> Package mongodb-org-shell.x86_64 0:2.6.12-1 will be installed
---> Package mongodb-org-tools.x86_64 0:2.6.12-1 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

====================================================================
Package Arch Version Repository Size
====================================================================
Installing:
mongodb-org x86_64 2.6.12-1 mongodb 4.6 k
Installing for dependencies:
mongodb-org-mongos x86_64 2.6.12-1 mongodb 6.9 M
mongodb-org-server x86_64 2.6.12-1 mongodb 9.1 M
mongodb-org-shell x86_64 2.6.12-1 mongodb 4.3 M
mongodb-org-tools x86_64 2.6.12-1 mongodb 90 M

Transaction Summary
====================================================================
Install 5 Package(s)

Total download size: 110 M
Installed size: 279 M
Is this ok [y/N]: y
Downloading Packages:
(1/5): mongodb-org-2.6.12-1.x86_64.rpm | 4.6 kB 00:00
(2/5): mongodb-org-mongos-2.6.12-1.x86_64.rpm | 6.9 MB 00:07
(3/5): mongodb-org-server-2.6.12-1.x86_64.rpm | 9.1 MB 00:16
(4/5): mongodb-org-shell-2.6.12-1.x86_64.rpm | 4.3 MB 00:05
(5/5): mongodb-org-tools-2.6.12-1.x86_64.rpm | 90 MB 01:24
------------------------------------------------------------------------------
Total 968 kB/s | 110 MB 01:56
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : mongodb-org-server-2.6.12-1.x86_64 1/5
Installing : mongodb-org-mongos-2.6.12-1.x86_64 2/5
Installing : mongodb-org-tools-2.6.12-1.x86_64 3/5
Installing : mongodb-org-shell-2.6.12-1.x86_64 4/5
Installing : mongodb-org-2.6.12-1.x86_64 5/5
Verifying : mongodb-org-shell-2.6.12-1.x86_64 1/5
Verifying : mongodb-org-tools-2.6.12-1.x86_64 2/5
Verifying : mongodb-org-mongos-2.6.12-1.x86_64 3/5
Verifying : mongodb-org-server-2.6.12-1.x86_64 4/5
Verifying : mongodb-org-2.6.12-1.x86_64 5/5

Installed:
mongodb-org.x86_64 0:2.6.12-1

Dependency Installed:
mongodb-org-mongos.x86_64 0:2.6.12-1 mongodb-org-server.x86_64 0:2.6.12-1 mongodb-org-shell.x86_64 0:2.6.12-1 mongodb-org-tools.x86_64 0:2.6.12-1

Complete!
This completes the installation

Verify the MongoDB installation

[root@pract1 yum.repos.d]# rpm -ql mongodb-org-server
/etc/init.d/mongod
/etc/mongod.conf
/etc/sysconfig/mongod
/usr/bin/mongod
/usr/share/man/man1/mongod.1
/var/lib/mongo
/var/log/mongodb
/var/log/mongodb/mongod.log
/var/run/mongodb

Start the MongoDB service

Lets check the mongod service if its stopped, start service with the command specified below
[root@pract1 yum.repos.d]# service mongod status
mongod is stopped
[root@pract1 yum.repos.d]# service mongod start
Starting mongod: [ OK ]
[root@pract1 yum.repos.d]#

Also will check the log if there is any errors reported. Here in the below log everything seems to be fine
[root@pract1 yum.repos.d]# cat /var/log/mongodb/mongod.log
2016-09-16T18:21:09.609+0530 ***** SERVER RESTARTED *****
2016-09-16T18:21:09.612+0530 [initandlisten] MongoDB starting : pid=2791 port=27017 dbpath=/var/lib/mongo 64-bit host=pract1.localdomain
2016-09-16T18:21:09.612+0530 [initandlisten] db version v2.6.12
2016-09-16T18:21:09.612+0530 [initandlisten] git version: d73c92b1c85703828b55c2916a5dd4ad46535f6a
2016-09-16T18:21:09.612+0530 [initandlisten] build info: Linux build5.ny.cbi.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2016-09-16T18:21:09.612+0530 [initandlisten] allocator: tcmalloc
2016-09-16T18:21:09.612+0530 [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, storage: { dbPath: "/var/lib/mongo" }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2016-09-16T18:21:09.637+0530 [initandlisten] journal dir=/var/lib/mongo/journal
2016-09-16T18:21:09.637+0530 [initandlisten] recover : no journal files present, no recovery needed
2016-09-16T18:21:09.785+0530 [initandlisten] allocating new ns file /var/lib/mongo/local.ns, filling with zeroes...
2016-09-16T18:21:10.037+0530 [FileAllocator] allocating new datafile /var/lib/mongo/local.0, filling with zeroes...
2016-09-16T18:21:10.037+0530 [FileAllocator] creating directory /var/lib/mongo/_tmp
2016-09-16T18:21:10.082+0530 [FileAllocator] done allocating datafile /var/lib/mongo/local.0, size: 64MB, took 0.043 secs
2016-09-16T18:21:10.087+0530 [initandlisten] build index on: local.startup_log properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.startup_log" }
2016-09-16T18:21:10.087+0530 [initandlisten] added index to empty collection
2016-09-16T18:21:10.087+0530 [initandlisten] command local.$cmd command: create { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0 numYields:0 reslen:37 301ms
2016-09-16T18:21:10.092+0530 [initandlisten] waiting for connections on port 27017
2016-09-16T18:22:09.793+0530 [clientcursormon] mem (MB) res:30 virt:456
2016-09-16T18:22:09.793+0530 [clientcursormon] mapped (incl journal view):160
2016-09-16T18:22:09.793+0530 [clientcursormon] connections:0

Perform Basic Tasks onMongoDB

In MongoDB we use mongo shell to connect to the MongoDB database. Goto mongo shell as specified below

[root@pract1 yum.repos.d]# mongo
MongoDB shell version: 2.6.12
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
>
********** By default mongo connects to test database. So lets create a new database called mydb
> use mydb
switched to db mydb
>
********** Now create a collection in mydb
> db.createCollection("mycollection");
{ "ok" : 1 }
********** Get the list of collections in the database
> show collections
mycollection
system.indexes
********** We will insert some documents to the collection
> db.mycollection.insert({"Name":"Manjunath","City":"Bangalore"});
WriteResult({ "nInserted" : 1 })
*********** Lets check the documents stored in the collections
> db.mycollection.find();
{ "_id" : ObjectId("57dd17915aaf4c5e06833092"), "Name" : "Manjunath", "City" : "Bangalore" }
********** Below mentioned command will show the list of databases
> show dbs
admin (empty)
local 0.078GB
mydb 0.078GB
test 0.078GB
>

Sunday, 4 September 2016

Big Data and Hadoop Introduction

What is Big Data? Is it just a buzzword

When a volume of data that cannot be handled by a single server or machine, that is called as big data. Its the collection of large data sets that cannot be processed under traditional computing techniques. Gartner defines big data as follows (3Vs Definition)

"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Additionally, a new V "Veracity" is added by some organizations to describe it

Volume:- Enterprise data is grows exponentially, preserving this large data set is a big challenge. Data sets could grow from terabytes to petabytes and from petabytes to exabytes. This huge amount of data refers to Volume in Big data.
Velocity:- Every day large amount of data is getting generated. Rapid growth of data posses challenges while processing data. Large data sets has to process data or provide query results as quickly as possible
Variety:- Various types of data that is being generated, lets consider social media where different kinds of data is being generated such as document, audio, videos, photos etc. Handling various kinds of data refers to variety in Big Data
Veracity:- Its is the quality of data that has been gathers that may affect to provide accurate analysis

How it all began?

Google published a paper in the year 2004 on a process called MapReduce. The MapReduce concept provides a parallel processing model, that could process huge amounts of data. What MapReduce does is, it splits the queries and distributes across parallel nodes and processed in parallel (the Map step). The processed results are gathered and delivered (the Reduce step). An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop

What is Hadoop?

Apache Hadoop is an open-source software framework for distributed storage and distributed processing. Its built on sets of computer clusters mostly commodity hardware to work on very large data sets. Apache hadoop includes distributed file system known as HDFS. HDFS splits the input and stroed the data on to the different nodes in the cluster and lets data to be processed in parellel. Data is processed in parallel that makes the system very fast and efficient

Core Modules of Hadoop

Apache Hadoop framework is composed of the following modules:

Hadoop Common:- These are JAVA libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS):- a distributed file-system that stores data on commodity hardware, providing very high bandwidth across the cluster
Hadoop YARN:- YARN (Yet Another Resource Negotiator) is a resource management platform that is responsible for managing cluster resources in a Hadoop Cluster
Hadoop MapReduce:- The framework that understands and assigns work to the nodes in a cluster. MapReduce program is used for large scale data processing

Advantage of Hadoop

Scalablability:- New nodes can be added as needed and added without needing to change data formats
Cost effective:- Hadoop brings massively parallel computing to commodity servers
Flexible:- Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources
Fault tolerant:- When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat

Monday, 29 August 2016

What is NO SQL? Categories of NO SQL Database

Introduction

NO SQL refers to "not only sql". Its the non relational database technology. However there are some NO SQL databases that supports SQL language as well. No SQL databases are widely used where there are real time web applications such as Google, Amazon, Facebook etc.

Why to use No SQL?

We use NO SQL to get some of the advantages such as
Simpler design
Easy horizontal scaling of machines
Better control over availability
Cost effective as commodity hardware is used
Better performance over relational database management systems

In order to gain these advantages you will have to compromise with the consistency

Understanding CAP Theorem

The CAP theorem, also named Brewer's theorem which states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)

We can run NO SQL databases on single server or in multiple commodity servers. It employs distributed architecture with salient features like

Commodity Servers are used in many nosql databases
Commodity servers put together to run as single system
Provides redundant storage
Provides geographic distribution
It avoids having single point of failure i.e. outage on single system will bring the whole system down

Categories of NO SQL

Key value store
Columnar
Document Store
Graph Database

Relational Database:- Its a database model where data is organised in the form of rows and columns with unique key identifying each row or tuple. Some of the popular relational databases are Oracle, SQL server, etc.

Key-Value Store:- Fundamental data model used in key-value pairs are associative array(map or dictionary) where data is represented as collection of key-value pairs. This model can be extended to a discretely ordered model that maintains keys in lexicographic order. Extension is computationally powerful and can efficiently retrieve selective key ranges. Some of the popular databases in this category includes Memcache, Radis etc

Column- Oriented database:- These database work by creating collections of one or more key/value pairs that match the record. It doesn't need pre-structured table to work with data. Records that come in the form of single or multiple columns having information. Each column of every record can be different.

Document Store:- These are the database stores where data is stored in the form of documents that are usually in the form of JSON/BSON etc. Document posses the unique key that represents each of the document stored in the database. There are various ways to organize these documents such as collections, tags etc.

Graph Database:- These databases are designed to store the relative data where they can be represented in the form of graph. Lets consider social networking, person x is married to person y where as person x is cousin for person z, also person z is friend of person x. Other example representation of Graph data would be public transport links, road maps or network topologies.