Today, we’re surrounded by data:
People uploading videos, taking pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth.
On day-to-day basis, we create 2.5 Quintilian bytes of data —more than 90% of the data in the world today has been created in the last two years.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook handles more than 40 billion photos from its user base.
Challenges for Cutting Edge Business:
The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft.
They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people.
Existing tools were becoming inadequate to process such large data sets.
What is BigData:
How is big “Big Data”? Is 30 40 Terabyte BigData ? ….
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Today Terabyte, Petabyte, Exabyte. Tomorrow?
As per wiki “Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”.
So big, that a single dataset may contains few terabytes to many petabytes of data.
Key Characteristics of BigData:
Volume: This Characteristic describes the relative size of data to the processing capability.
Terabytes of data within few minutes. Overcoming the volume issue requires technologies that store vast amounts of data in a scalable fashion and provide distributed approaches to querying that data.
Velocity: Characteristic describes the frequency at which data is generated, captured, and shared.
Even at 140 characters per tweet, the high frequency of Twitter data ensures large volumes (over 8 TB per day).
Real-time offers in a world of engagement, require fast matching and immediate feedback loops so promotions align with geo location data, customer purchase history, and current sentiment.
Variety: Non-traditional data formats exhibit a dizzying rate of change. A proliferation of data types from social, machine and mobile sources no longer fits into neat, easy to consume structures
Based on problem stated above for exponential growth & huge size data along other BigData characteristics – we need:
-To Store the data into a scalable file system. And Hadoop provide us Distributed File System which we refer as HDFS
-Parallel processing on the data which is Hadoop Map- reduce
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
Hadoop architecture divided into MapReduce & HDFS layer
Hadoop architecture divided into MapReduce & HDFS layer:
Map Reduce is model where task is divided in to mappers for parallel computation & reducers will reduce the output (by all map tasks) to final
Output. It increase the performance when we have huge data to be processed. it helps in splitting the tasks in to small tasks & executing in parallel. In Hadoop Gen1: MapReduce has Job tracker which acts as master – task tracker which are actually the workers which performs the tasks.
Whereas HDFS is distributed file system – based on master & slave architecture. It has name node which keeps the metadata for the files stored, whereas Data Nodes actually stores the data.
Name Node stores metadata which is nothing but (file permissions, file name, blocks path. Etc.). We have secondary name node which sounds like back up of Name node but in actually it does not, it works to reduce the burden of Name node.
Please Note: All the Namenode, Datanode, Secondary namenode, Jobtracker, task tracker ->runs as service (background daemon thread)
A set of machines running HDFS and MapReduce is known as a Hadoop Cluster
– Individual machines are known as nodes
– A cluster can have as few as one node, as many as several thousands
– More nodes = better performance!
Little bit on Hadoop History:
Hadoop is based on work done by Google in the late 1990s/early 2000s
– Specifically, on papers describing the Google File System (GFS)
– Published in 2003, and MapReduce published in 2004
The traditional data processing model has data stored in ‘storage cluster’, data is copied over to ‘compute cluster’ for processing and the results are written back to storage cluster.
This model however doesn’t quite work for Big Data. Because copying out so much data out to compute cluster might be too time consuming or impossible. So what is the answer?
One solution is to process Big Data ‘in place’ — as in storage cluster. HADOOP Brings data locality concept by effectively utilizing underlying hardware and network.
This work takes a radical new approach to the problem of Distributed computing
– Meets all the requirements we have for reliability and scalability
Core concept: distribute the data as it is initially stored in the System
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial Processing
There are many other projects based around core Hadoop
– Often referred to as the ‘Hadoop Ecosystem’
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc