Big Data Analytics Architectures, Frameworks, and Tools
Wullianallur Raghupathi and Viju Raghupathi
Like big data, the analytics associated with big data is also described by three primary characteristics: volume, velocity, and variety (http://www01.ibm.com/software/data/bigdata/). There is no doubt data will continue to be created and collected, continually leading to incredible volume of data. Second, this data is being accumulated at a rapid pace, and in real time. This is indicative of velocity. Third, gone are the days of data being collected in standard quantitative formats and stored in spreadsheets or relational databases. Increasingly, the data is in multimedia format and unstructured. This is the variety characteristic. Considering volume, velocity, and variety, the analytics techniques have also evolved to accommodate these characteristics to scale up to the complex and sophisticated analytics needed (Russom, 2011; Zikopoulos et al., 2013). Some practitioners and researchers have introduced a fourth characteristic: veracity (Ohlhorst, 2012). The implication of this is data assurance. That is, both the data and the analytics and outcomes are error-free and credible.
Simultaneously, the architectures and platforms, algorithms, methodologies, and tools have also scaled up in granularity and performance to match the demands of big data (Ferguson, 2012; Zikopoulos et al., 2012). For example, big data analytics is executed in distributed processing across several servers (nodes) to utilize the paradigm of parallel computing and a divide and process approach. It is evident that the analytics tools for structured and unstructured big data are very different from the traditional business intelligence (BI) tools. The architectures and tools for big data analytics have to necessarily be of industrial strength. Likewise, the models and techniques such as data mining and statistical approaches, algorithms, visualization techniques, etc., have to be mindful of the characteristics of big data analytics. For example, the National Oceanic and Atmospheric Administration (NOAA) uses big data analytics to assist with climate, ecosystem, and environment, weather forecasting and pattern analysis, and commercial translational applications. NASA engages big data analytics for aeronautical and other types of research (Ohlhorst, 2012). Pharmaceutical companies are using big data analytics for drug discovery, analysis of clinical trial data, side effects and reactions, etc. Banking companies are utilizing big data analytics for investments, loans, customer demographics, etc. Insurance and healthcare provider and media companies are other big data analytics industries.
The 4Vs are a starting point for the discussion about big data analytics. Other issues include the number of architectures and platform, the dominance of the open-source paradigm in the availability of tools, the challenge of developing methodologies, and the need for user-friendly interfaces. While the overall cost of the hardware and software is declining, these issues have to be addressed to harness and maximize the potential of big data analytics. We next delve into the architectures, platforms, and tools.
Architectures, Frameworks, and Tools
The conceptual framework for a big data analytics project is similar to that for a traditional business intelligence or analytics project. The key difference lies in how the processing is executed. In a regular analytics project, the analysis can be performed with a business intelligence tool installed on a stand-alone system such as a desktop or laptop. Since the big data is large by definition, the processing is broken down and executed across multiple nodes. While the concepts of distributed processing are not new and have existed for decades, their use in analyzing very large data sets is relatively new as companies start to tap into their data repositories to gain insight to make informed decisions. Additionally, the availability of open-source platforms such as Hadoop/MapReduce on the cloud has further encouraged the application of big data analytics in various domains. Third, while the algorithms and models are similar, the user interfaces are entirely different at this time. Classical business analytics tools have become very user-friendly and transparent. On the other hand, big data analytics tools are extremely complex, programming intensive, and need the application of a variety of skills. As Figure 1 indicates, a primary component is the data itself. The data can be from internal and external sources, often in multiple formats, residing at multiple locations in numerous legacy and other applications. All this data has to be pooled together for analytics purposes. The data is still in a raw state and needs to be transformed. Here, several options are available. A service-oriented architectural approach combined with web services (middleware) is one possibility. The data continues to be in the same state, and services are used to call, retrieve, and process the data. On the other hand, data warehousing is another approach wherein all the data from the different sources are aggregated and made ready for processing. However, the data is unavailable in real time. Via the steps of extract, transform, and load (ETL), the data from
Figure 1. An applied conceptual architecture of big data analytics.
diverse sources is cleansed and made ready. Depending on whether the data is structured or unstructured, several data formats can be input to the Hadoop/MapReduce platform (Sathi, 2012; Zikopoulos et al., 2013).
In this next stage in the conceptual framework, several decisions are made regarding the data input approach, distributed design, tool selection, and analytics models. Finally, to the far right the four typical applications of big data analytics are shown. These include queries, reports, online analytic processing (OLAP), and data mining. Visualization is an overarching theme across the four applications. A wide variety of techniques and technologies have been developed and adapted to aggregate, manipulate, analyze, and visualize big data. These techniques and technologies draw from several fields, including statistics, computer science, applied mathematics, and economics (Courtney, 2013; HP 2012).
The most significant platform for big data analytics is the open-source distributed data processing platform Hadoop (Apache platform), initially developed for routine functions such as aggregating web search indexes. It belongs to the class NoSQL technologies (others include CouchDB and MongoDB) that have evolved to aggregate data in unique ways. Hadoop has the potential to process extremely large amounts of data by mainly allocating partitioned data sets to numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result (Ohlhorst, 2012; Zikopoulos et al., 2012). It can serve in the twin roles of either as a data organizer or as an analytics tool. Hadoop offers a great deal of potential in enabling enterprises to harness the data that was, until now, difficult to manage and analyze. Specifically, Hadoop makes it possible to process extremely large volumes of data with varying structures (or no structure at all). However, Hadoop can be complex to install, configure, and administer, and there is not yet readily available individuals with Hadoop skills. Furthermore, organizations are not ready as well to embrace Hadoop completely.
It is generally accepted that there are two important modules in Hadoop (Borkar et al., 2012; Mone, 2013):
- The Hadoop Distributed File System (HDFS). This facilitates the underlying storage for the Hadoop cluster. When data for the analytics arrives in the cluster, HDFS breaks it into smaller parts and redistributes the parts among the different servers (nodes) engaged in the cluster. Only a small chunk of the entire data set resides on each server/node, and it is conceivable each chunk is duplicated on other servers/nodes.
- MapReduce. Since the Hadoop platform stores the complete data set in small pieces across a connected set of servers/nodes in distributed fashion, the analytics tasks can be distributed across the servers/nodes too. Results from the individual pieces of processing are aggregated or pooled together for an integrated solution. MapReduce provides the interface for the distribution of the subtasks and then the gathering of the outputs (Sathi, 2012; Zikopoulos et al., 2012, 2013). MapReduce is discussed further below.
A major advantage of parallel/distributed processing is graceful degradation or capability to cope with possible failures. Therefore, HDFS and MapReduce are configured to continue to execute in the event of a failure. HDFS, for example, monitors the servers/nodes and storage devices continually. If a problem is detected, it automatically reroutes and restores data onto an alternative server/node. In other words, it is configured and designed to continue processing in light of a failure. In addition, replication adds a level of redundancy and backup. Similarly, when tasks are executed, MapReduce tracks the processing of each server/node. If it detects any anomalies such as reduced speed, going into a hiatus, or reaching a dead end, the task is transferred to another server/node that holds the duplicate data. Overall, the synergy between HDFS and MapReduce in the cloud environment facilitates industrial strength, scalable, reliable, and fault-tolerant support for both the storage and analytics (Zikopoulos et al., 2012, 2013).
In an example, it is reported that Yahoo! is an early user of Hadoop (Ohlhorst, 2012). Its key objective was to gain insight from the large amounts of data stored across the numerous and disparate servers. The integration of the data and the application of big data analytics was mission critical. Hadoop appeared to be the perfect platform for such an endeavor. Presently, Yahoo! is apparently one of the largest users of Hadoop and has deployed it on thousands on servers/nodes. The Yahoo! Hadoop cluster apparently holds huge "log files" of user-clicked data, advertisements, and lists of all Yahoo! published content. From a big data analytics perspective, Hadoop is used for a number of tasks, including correlation and cluster analysis to find patterns in the unstructured data sets.
Some of the more notable Hadoop-related application development-oriented initiatives include Apache Avro (for data serialization), Cassandra and HBase (databases), Chukka (a monitoring system specifically designed with large distributed systems in view), Hive (provides ad hoc Structured Query Language (SQL)-like queries for data aggregation and summarization), Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data flow language and execution framework for parallel computation), Zookeeper (provides coordination services for distributed applications), and others (Zikopoulos et al., 2012, 2013). The key ones are described below.
MapReduce, as discussed above, is a programming framework developed by Google that supports the underlying Hadoop platform to process the big data sets residing on distributed servers (nodes) in order to produce the aggregated results. The primary component of an algorithm would map the broken up tasks (e.g., calculations) to the various locations in the distributed file system and consolidate the individual results (the reduce step) that are computed at the individual nodes of the file system. In summary, the data mining algorithm would perform computations at the server/node level and simultaneously in the overall distributed system to summate the individual outputs (Zikopoulos et al., 2012). It is important to note that the primary Hadoop MapReduce application programming interfaces (APIs) are mainly called from Java. This requires skilled programmers. In addition, advanced skills are indeed needed for development and maintenance.
In order to abstract some of the complexity of the Hadoop programming framework, several application development languages have emerged that run on top of Hadoop. Three popular ones are Pig, Hive, and Jaql. These are briefly described below.
Pig and PigLatin
Pig was originally developed at Yahoo! The Pig programming language is configured to assimilate all types of data (structured/unstructured, etc.). Two key modules are comprised in it: the language itself, called PigLatin, and the runtime version in which the PigLatin code is executed (Borkar et al., 2012). According to Zikopoulos et al. (2012), the initial step in a Pig program is to load the data to be subject to analytics in HDFS. This is followed by a series of manipulations wherein the data is converted into a series of mapper and reducer tasks in the background. Last, the program dumps the data to the screen or stores the outputs at another location. The key advantage of Pig is that it enables the programmers utilizing Hadoop to focus more on the big data analytics and less on developing the mapper and reducer code (Zikopoulos et al., 2012).
While Pig is robust and relatively easy to use, it still has a learning curve. This means the programmer needs to become proficient (Zikopoulos et al., 2012). To address this issue, Facebook has developed a runtime Hadoop support architecture that leverages SQL with the Hadoop platform (Borkar et al., 2012). This architecture is called Hive; it permits SQL programmers to develop Hive Query Language (HQL) statements akin to typical SQL statements. However, HQL is limited in the commands it recognizes. Ultimately, HQL statements are decomposed by the Hive Service into MapRaduce tasks and executed across a Hadoop cluster of servers/nodes (Zikopoulos et al., 2012). Also, since Hive is dependent on Hadoop and MapReduce executions, queries may have lag time in processing up to several minutes. This implies Hive may not be suitable for big data analytics applications that need rapid response times, typical of relational databases. Lastly, Hive is a read-based programming artifact; it is therefore not appropriate for transactions that engage in a large volume of write instructions (Zikopoulos et al., 2012).
Zookeeper is yet another open-source Apache project that allows a centralized infrastructure with various services; this provides for synchronization across a cluster of servers. Zookeeper maintains common objects required in large cluster situations (like a library). Examples of these typical objects include configuration information, hierarchical naming space, and others (Zikopoulos et al., 2012). Big data analytics applications can utilize these services to coordinate parallel processing across big clusters. As described in Zikopoulos et al. (2012), one can visualize a Hadoop cluster with >500 utility services. This necessitates a centralized management of the entire cluster in the context of such things as name services, group services, synchronization services, configuration management, and others. Furthermore, several other open-source projects that utilize Hadoop clusters require these types of cross-cluster services (Zikopoulos et al., 2012). The availability of these in a Zookeeper infrastructure implies that projects can be embedded by Zookeeper without duplicating or requiring constructing all over again. A final note: Interface with Zookeeper happens via Java or C interfaces presently (Zikopoulos et al., 2012).
HBase is a column-oriented database management system that sits on the top of HDFS (Zikopoulos et al., 2012). In contrast to traditional relational database systems, HBase does not support a structured query language such as SQL. The applications in HBase are developed in Java much similar to other MapReduce applications. In addition, HBase does support application development in Avro, REST, or Thrift. HBase is built on concepts similar to how HDFS has a NameNode (master) and slave nodes, and MapReduce comprises JobTracker and TaskTracker slave nodes. A master node manages the cluster in HBase, and regional servers store parts of the table and execute the tasks on the big data (Zikopoulos et al., 2012).
Cassandra, an Apache project, is also a distributed database system (Ohlhorst, 2012). It is designated as a top-level project modeled to handle big data distributed across many utility servers. Also, it provides reliable service with no particular point of failure (http://en.wikipedia.org/wiki/Apache_Cassandra). It is also a NoSQL system. Facebook originally developed it to support its inbox search. The Cassandra database system can store 2 million columns in a single row. Similar to Yahoo!'s needs, Facebook wanted to use the Google BigTable architecture that could provide a column-and-row database structure; this could be distributed across a number of nodes. But BigTable faced a major limitationits use of a master node approach made the entire application depend on one node for all read-write coordinationthe antithesis of parallel processing (Ohlhorst, 2012). Cassandra was built on a distributed architecture named Dynamo, designed by Amazon engineers. Amazon used it to track what its millions of online customers were entering into their shopping carts. Dynamo gave Cassandra an advantage over BigTable; this is due to the fact that Dynamo is not dependent on any one master node. Any node can accept data for the whole system, as well as answer queries. Data is replicated on multiple hosts, creating stability and eliminating the single point of failure (Ohlhorst, 2012).
Many tasks may be tethered together to meet the requirements of a complex analytics application in MapReduce. The open-source project Oozie to an extent streamlines the workflow and coordination among the tasks (Zikopoulos et al., 2012). Its functionality permits programmers to define their own jobs and the relationships between those jobs. It will then automatically schedule the execution of the various jobs once the relationship criteria have been complied with.
Lucene is yet another widely used open-source Apache project predominantly used for text analytics/searches; it is incorporated into several open-source projects. Lucene precedes Hadoop and has been a top-level Apache project since 2005. Its scope includes full text indexing and library search for use within a Java application (Zikopoulos et al., 2012).
Avro, also an Apache project, facilitates data serialization services. The data definition schema is also included in the data file. This makes it possible for an analytics application to access the data in the future since the schema is also stored along with. Versioning and version control are also added features of use in Avro. Schemas for prior data are available, making schema modifications possible (Zikopoulos et al., 2012).
Mahout is yet another Apache project whose goal is to generate free applications of distributed and scalable machine learning algorithms that support big data analytics on the Hadoop platform. Mahout is still an ongoing project, evolving to include additional algorithms (http://en.wikipedia.org/wiki/ Mahout). The core widely used algorithms for classification, clustering, and collaborative filtering are implemented using the map/reduce paradigm.
Streams deliver a robust analytics platform for analyzing data in real time (Zikopoulos et al., 2012). Compared to BigInsights, Streams applies the analytics techniques on data in motion. But like BigInsights, Streams is appropriate not only for structured data but also for nearly all other types of datathe nontraditional semistructured or unstructured data coming from sensors, voice, text, video, financial, and many other high-volume sources (Zikopoulos et al., 2012).
Overall, in summary, there are numerous vendors, including AWS, Cloudera, Hortonworks, and MapR Technologies, among others, who distribute open-source Hadoop platforms (Ohlhorst, 2012). Numerous proprietary options are also available, such as IBM's BigInsights. Further, many of these are cloud versions that make it more widely available. Cassandra, HBase, and MongoDB, as described above, are widely used for the database component.
Big data analytics is transforming the way companies are using sophisticated information technologies to gain insight from their data repositories to make informed decisions. This data-driven approach is unprecedented, as the data collected via the web and social media is escalating by the second. In the future we'll see the rapid, widespread implementation and use of big data analytics across the organization and the industry. In the process, the several challenges highlighted above need to be addressed. As it becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards and governance, and continually improving the tools and technologies would garner attention. Big data analytics and applications are at a nascent stage of development, but the rapid advances in platforms and tools can accelerate their maturing process.
Borkar, V.R., Carey, M.J., and C. Li. Big Data Platforms: What's Next? XRDS, 19(1), 44–49, 2012.
Ohlhorst, F. Big Data Analytics: Turning Big Data into Big Money. New York: John Wiley & Sons, 2012.
Sathi, A. Big Data Analytics. MC Press Online LLC, 2012.
Zikopoulos, P.C., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., and J. Giles. Harness the Power of Big DataThe IBM Big Data Platform. New York: McGrawHill, 2013.
Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., and G. Lapis. Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data. New York: McGraw-Hill, 2012.
Read more IT Performance Improvement
Certain names and logos on this page and others may constitute trademarks, servicemarks, or tradenames of
Taylor & Francis LLC. Copyright © 20082014 Taylor & Francis LLC. All rights reserved.