Recently hot technology is Big data and hadoop concept before go to the explanation we need to clarify about Big data and Hadoop.
What is Big Data ?
Big Data is nothing but large volume of data in the form of structured and non structured
- Structured Data – Database
- Non structured – File System
If you are application has Big Data then how you handle. It is very hard to process that big data across the server.So that apache introduced new technology which is hadoop and it written in JAVA.
What is Hadoop?
Hadoop is opensource software for distributed large data processing. Itcan be used to query a large set of data and get the results faster using reliable and scalable architecture.
Ref : http://hadoop.apache.org/
How hadoop helps to Big data?
- It structure the unstructured data for data mining
- It helps to perform sorting and analyzing the big data
- Distributed architecture, both data and processing are distributed across multiple servers
Originally Google started using the distributed computing model based on GFS(Google FileSystem) and MapReduce technique.
Next how about Distributed file system and map reduce technique
Hadoop Distributed File System(HDFS):
HDFS is a storage system used by Hadoop.The following simple architecture explain how HDFS works
Architecture consist of single Namenode and multiple data node(Server) based on your application.
If you are going dump data to HDFS, Then it store it as data blocks across the hadoop cluster. Also HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in this way that will be reliable and can be retrieved faster while data processing. Single data block size is 128 MB . Every data block replicated to multiple data nodes across the hadoop cluster.
Namenode keeps meta data about file system.While you execute a query from a client, it will reach to the NameNode to get the file metadata information, and then it will reach out to the DataNodes to get the real data blocks.
MapReduce is framework to process the big data in-parallel in hadoop cluster. It is combination of two functions map and reduce. Every Map/Reduce program must specify a Mapper and typically a Reducer. The Mapper has a map method that transforms input (key, value) pairs into any number of intermediate (key’, value’) pairs.The Reducer has a reduce method that transforms intermediate (key’, value’*) aggregates into any number of output (key’’, value’’) pairs. Here is diagrammatic representation of Map/Reduce process.
More about Map/Reduce: https://docs.marklogic.com/guide/mapreduce/hadoop
Don’t forgot share and subscribe for my latest updates.