Hadoop is the most commonly used integrated big data platform on the market today. However, there are a number of technologies that make all this possible. These include proprietary Hadoop distributions developed by the big players in the Big Data world, but majority of the commercial products are made possible through open source projects.
The Hadoop ecosystem is composed of a set of tools developed for function around HDFS and MapReduce, the two core Hadoop components. These tools provide support for analytics-related tasks as well as storage and management of data. As the new technologies developed for Hadoop continue to rise, it is essential to note that there are some products which will be better suited for specific requirements than others.
The aim of this article is to give an overview of the entire suite of technologies collectively constituting the Hadoop ecosystem. These include tools for database and data management, core functionalities, data transfer, security enhancement, analytics, data serialization as well as Hadoop-based cloud computing functions.
Core Hadoop Elements: HDFS and MapReduce
Core Hadoop technologies have an inbuilt fault tolerance system that enables storage of large datasets. Data in Hadoop is typically stored within the Hadoop Distributed File System (HDFS). Here data files are subdivided into blocks and distributed through the available servers. HDFS is designed to run on a number of clusters and also to provide for failure resilience, which is done by making several copies of data blocks in different clusters.
On the other hand, MapReduce is a paradigm that enables data process activity. MapReduce was the pioneer programming method used for application development within Hadoop. It is made up of two Java-based programs: Mappers, which is responsible for extraction of data from HDFS and placement into 'maps' and Reducers which is responsible for aggregation of results provided by the mappers.
Data and database management tools in Hadoop mostly use the NoSQL paradigm for data storage and management. They do not run the more known SQL (Structured Query Language) with its database schema and other relational database-based operations. Common databases using the NoSQL paradigm include:
- Document databases such as CouchDB and MongoDB
- Graph databases such as Giraph and Neo4j
- Key-value databases such as Cassandra
Data Analytics Tools
Analytics tools in Hadoop enable per-processing data operations including data integration, data cleaning, data reduction and data transformation. Machine learning algorithms such as regression and classification can also be implemented to enable data insight extraction and make for easier business intelligence (BI).
Apache Mahout, for instance, is a set of machine learning algorithms used to perform complex analytics operations, including k-means for data clustering, logistic regression and random forest for data classification.
Pig is another project that was designed particularly for Hadoop, and it's a project that enables data processing and speeds up code to make it handier. Pig can perform extract, transform and load (known better in the database world as ETL) data operations. It utilizes procedural strategy for data processing, compared with say, Hive, which relies on written, logic-driven queries.
Data Transfer Tools
Data transfer tools enable moving of data blocks across different Hadoop clusters and to and from external data sources. Apache Flume,for instance, is used to collect, aggregate and move data into the HDFS. It supports both fan-in and fan-out processes as well as multi-hop flows. A configuration file is used for storage of source, sink and channel information.
The Flume project initially was designed to provide a method for scalable and easy data collection, made possible by running agents within source machines. These agents would send data updates to collectors and the collectors would aggregate the data into larger chunks to be later saved as HDFS files.
Sqoop (shortening for SQL to Hadoop) is another tool that is designed to power transfer to and from Hadoop clusters too from and to relational databases like Microsoft SQL Server and Oracle that conventionally run SQL queries.
Data Security Tools
Giant software development companies also responded to the increasing need for applications and tools that would secure Hadoop ecosystems from malicious infiltration and use. These security issues are related to user authentication (identity determination), authorization (privilege determination), encryption (securing data during transit) and audits (service logging). This led to the development of tools such as Sentry and Knox.
Another tool, Kerberos, enables authentication procedures within Hadoop clusters, though it was initially intended to be a network authentication tool which offers encrypted tickets for clients and servers.
Cloud Computing Tools
Hadoop cloud computing enables organizations to take advantage of lower startup costs and easier, seamless scalability with growth. Cloud computing services enable important functions such as remote DBA support, rapid elasticity, resource sharing, on-demand self-service, network access as well as controlled resource service to cloud-based Hadoop clusters.
Another concept closely related to cloud computing is virtualization (creation of virtual computing entities) and both come at a cost: a small reduction in performance capabilities. The most popular cloud computing service today is Amazon Web Service (AWS). However, there is also Serengeti, which is a virtualization tool in Hadoop that enables building of virtual, cloud-based Hadoop clusters.
Data serialization refers to how data looks like in the process of transit from one location to another; i.e., its references and representations. This is especially significant for big data requirements that entail iterative transfer of data within different parts of the enterprise systems.
During the different phases of data processing, different applications and hence application programming interfaces (APIs) may be required. With this in mind, there are factors that must be considered before choosing a serialization format: speed of read/write processes by computers, size of the data, ease of use and the data complexity.
JSON's most significant advantage, however, lies in its ability to map to the data structure within most programming languages and its ability to maintain the simplicity of schema design and parsing code.
Big Data Analytics Architectures, Frameworks, and Tools
An Overview of the NoSQL World
Jack Dowson is a web developer and has good knowledge of database. He has written many articles on database, web design, hosting, Big Data, etc. You can follow him on Google+.