IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines
Auerbach Publications

IT Performance Improvement

Management

Security

Networking and Telecommunications

Software Engineering

Project Management

Database


Share This Article

Bookmark and Share


Free Subscription to IT Today





Powered by VerticalResponse

 
Cloud Enterprise Architecture by Pethuru Raj; ISBN 9781466502321
Big Data Analytics: A Practical Guide for Managers by Kim H. Pries and Robert Dunnigan; ISBN 9781482234510
Large Scale and Big Data: Processing and Management by Sherif Sakr and Mohamed Gaber; ISBN 9781466581500
Open Source Data Warehousing and Business Intelligence by Lakshman Bulusu; ISBN 9781439816400
Introduction to Communications Technologies: A Guide for Non-Engineers, Third Edition; ISBN 978-1-4987-0293-5
Developing Essbase Applications: Hybrid Techniques and Practices by Cameron Lackpour; ISBN 9781498723282

Providing an Integrated Environment for Big Data Management with the Hadoop Ecosystem

by Jack Dowson

Hadoop is the most commonly used integrated big data platform on the market today. However, there are a number of technologies that make all this possible. These include proprietary Hadoop distributions developed by the big players in the Big Data world, but majority of the commercial products are made possible through open source projects.

The Hadoop ecosystem is composed of a set of tools developed for function around HDFS and MapReduce, the two core Hadoop components. These tools provide support for analytics-related tasks as well as storage and management of data. As the new technologies developed for Hadoop continue to rise, it is essential to note that there are some products which will be better suited for specific requirements than others.

The aim of this article is to give an overview of the entire suite of technologies collectively constituting the Hadoop ecosystem. These include tools for database and data management, core functionalities, data transfer, security enhancement, analytics, data serialization as well as Hadoop-based cloud computing functions.

Core Hadoop Elements: HDFS and MapReduce

Core Hadoop technologies have an inbuilt fault tolerance system that enables storage of large datasets. Data in Hadoop is typically stored within the Hadoop Distributed File System (HDFS). Here data files are subdivided into blocks and distributed through the available servers. HDFS is designed to run on a number of clusters and also to provide for failure resilience, which is done by making several copies of data blocks in different clusters.

On the other hand, MapReduce is a paradigm that enables data process activity. MapReduce was the pioneer programming method used for application development within Hadoop. It is made up of two Java-based programs: Mappers, which is responsible for extraction of data from HDFS and placement into 'maps' and Reducers which is responsible for aggregation of results provided by the mappers.

Data and database management tools in Hadoop mostly use the NoSQL paradigm for data storage and management. They do not run the more known SQL (Structured Query Language) with its database schema and other relational database-based operations. Common databases using the NoSQL paradigm include:

  • Document databases such as CouchDB and MongoDB
  • Graph databases such as Giraph and Neo4j
  • Key-value databases such as Cassandra

Data Analytics Tools

Analytics tools in Hadoop enable per-processing data operations including data integration, data cleaning, data reduction and data transformation. Machine learning algorithms such as regression and classification can also be implemented to enable data insight extraction and make for easier business intelligence (BI).

Apache Mahout, for instance, is a set of machine learning algorithms used to perform complex analytics operations, including k-means for data clustering, logistic regression and random forest for data classification.

Pig is another project that was designed particularly for Hadoop, and it's a project that enables data processing and speeds up code to make it handier. Pig can perform extract, transform and load (known better in the database world as ETL) data operations. It utilizes procedural strategy for data processing, compared with say, Hive, which relies on written, logic-driven queries.

Data Transfer Tools

Data transfer tools enable moving of data blocks across different Hadoop clusters and to and from external data sources. Apache Flume,for instance, is used to collect, aggregate and move data into the HDFS. It supports both fan-in and fan-out processes as well as multi-hop flows. A configuration file is used for storage of source, sink and channel information.

The Flume project initially was designed to provide a method for scalable and easy data collection, made possible by running agents within source machines. These agents would send data updates to collectors and the collectors would aggregate the data into larger chunks to be later saved as HDFS files.

Sqoop (shortening for SQL to Hadoop) is another tool that is designed to power transfer to and from Hadoop clusters too from and to relational databases like Microsoft SQL Server and Oracle that conventionally run SQL queries.

Data Security Tools

Giant software development companies also responded to the increasing need for applications and tools that would secure Hadoop ecosystems from malicious infiltration and use. These security issues are related to user authentication (identity determination), authorization (privilege determination), encryption (securing data during transit) and audits (service logging). This led to the development of tools such as Sentry and Knox.

Another tool, Kerberos, enables authentication procedures within Hadoop clusters, though it was initially intended to be a network authentication tool which offers encrypted tickets for clients and servers.

Cloud Computing Tools

Hadoop cloud computing enables organizations to take advantage of lower startup costs and easier, seamless scalability with growth. Cloud computing services enable important functions such as remote DBA support, rapid elasticity, resource sharing, on-demand self-service, network access as well as controlled resource service to cloud-based Hadoop clusters.

Another concept closely related to cloud computing is virtualization (creation of virtual computing entities) and both come at a cost: a small reduction in performance capabilities. The most popular cloud computing service today is Amazon Web Service (AWS). However, there is also Serengeti, which is a virtualization tool in Hadoop that enables building of virtual, cloud-based Hadoop clusters.

Data Serialization

Data serialization refers to how data looks like in the process of transit from one location to another; i.e., its references and representations. This is especially significant for big data requirements that entail iterative transfer of data within different parts of the enterprise systems.

During the different phases of data processing, different applications and hence application programming interfaces (APIs) may be required. With this in mind, there are factors that must be considered before choosing a serialization format: speed of read/write processes by computers, size of the data, ease of use and the data complexity.

Apart from Avro, Thrift, Protobuf (Protocol Buffers) and Parquet, JSON (JavaScript Object Notation) and BSON (binary JSON) are popular ways of data transfer within the Hadoop environment. JSON is especially preferred since it has a fairly easy format, is self-descriptive and describes data through the key-value method.

JSON's most significant advantage, however, lies in its ability to map to the data structure within most programming languages and its ability to maintain the simplicity of schema design and parsing code.

Related Reading

Big Data Analytics Architectures, Frameworks, and Tools

An Overview of the NoSQL World

About the Author

Jack Dowson is a web developer and has good knowledge of database. He has written many articles on database, web design, hosting, Big Data, etc. You can follow him on Google+.

© Copyright 2015 Auerbach Publications