In olden days elephants were used to carry heavy loads. But loads that need to be transported were getting heavier at tremendous pace. Our older generation was so vice that they did not look for bigger elephants to transport bigger load, instead they harnessed multiple elephants together and distributed load uni-formally and started transporting heavier loads.
With same analogy, size of the data that needs to processed, understood and analyzed is increasing in tremendous pace day by day. Even high end servers are not in a position to handle this data in effective way. Server configuration is not able to scale up with the pace of data size.
This is where hadoop comes into picture. Hadoop can be compared with group of highly disciplined, well trained group of elephants which will work under single master. Hadoop is built for processing huge data by an uniform and planned distribution of work among multiple slaves.
Hadoop is a distributed, scalable and portable storage/computing system which supports large data processing and better fault tolerance.
Evolution( Year and Event)
- 2002 – Doug cutting & Mike Cafarella stated working on distributed data management system with project name ‘Nutch’
- 2004 – Mapreduce was added to Nutch
- 2006 – Hadoop spins out of Nutch
- 2008 – Hive was launched to give SQL support for hadoop
2009 – Hadoop was refactored to decouple Mapreduce and Hadoop Distributed File System
2010 – Hive, Pig, Avro and Hbase sub projects are added
2011 – Zookeeper added
Hadoop is master a slave architecture where master decides on who should what and slave will do the real the work and report to master.
Hadoop typical cluster :
Applicable Domains :
Hadoop can be used in all the functional domains where large data needs to be stored, processed and computed. Here we will cover the list of domains where hadoop is being used. Broadly we cover high level use cases in each domain
Use of hadoop in various popular social networking sites
- For optimized data storage
- workflow solutions
- Customer usage patterns
- CDR(Call detail record) optimal storage and processing
- Analyze existing data and provide accurate feedback about user in order to reduce risk
- Analyze the trends of trade
- To store and maintain health records
- To analyze gene sequence
- TRP ratings
- Images archival and restore
It is the base project of hadoop and it takes care of communication among multiple other modules. It is one of the core component
Hadoop Distributed File System (HDFS)
- HDFS is a distributed, scalable and portable file system
- It has two main components i.e. Namenode and datanode
- Namenode typically acts as master and decides what data needs to be stored and where it should be stored
- Datanode is slave which has real data stored
- In case one of the datanode is down, name node will pass instructions to for other node to store failures node data.This way it takes care of replication.
- It does not support concurrent write operations
- It is written in java
- Map/Reduce is distributed, scalable computing framework
- It has two components i.e. Job tracker and task tracker
- Job tracker acts like master and send commands to slaves for specific task
- Task tracker will take care of real execution of task and report back to job tracker
- Map/Reduce programs will be written in java langugae
- Every program you write should have seperate map and reduce methods
Hadoop is highly scalable and distributed storage/computing framework. It is being used in different domains and it is positioned well to play bigger in big data computing.
Hope you enjoyed reading this article.