Big Data and Cloud Computing

Faculty Article

Share

Preface

There is a tremendous growth of Information and Communication Technology (ICT) and many fields are emerging with the help of ICT. However, big data and cloud computing are conjoined as the most important area of study which has taken the top position of the current research in the Gartner cycle. There is lot to say about the topic, however, in this small article, I have little scope to discuss all the issues and I have thought of restricting to an overview of big data, its importance, its relationship with cloud, big data initiative and some challenges.

Big data and Big Data problem

With remarkable growth of social media, Internet of Things (IOT) and multimedia, there is an explosion in generation of massive volume of data. For examples, around 267 million transactions are made per day in Wal-Mart’s 6000 stores worldwide, above 3 billion pieces of content are generated on Facebook every day, a large synoptic survey telescope can record 30 trillion bytes of image data in a single day, 32 petabytes of climate observation data is conserved in the NASA Center, FICO’s falcon credit card fraud detection system manages over 2.1 billion valid accounts around the world. We refer here such massive volume of data as big data which is structured, semi-structured or fully unstructured in nature. But how can we define and characterize big data? Well, it is defined as a collection of very huge data sets with a great diversity of types so that it becomes difficult to process by using state-of-art data processing approaches or traditional data processing platforms. It is characterized by high-volume, high-velocity, high-variety, and high-veracity generally known as 4Vs. Volume indicates the size of the data set, velocity is the speed of data in and out, variety means the range of data types and sources and veracity indicates uncertainty of data. Big data is transforming healthcare, science, engineering, finance, business, and eventually, the society and drawing enormous attention from academia, government, business and industry. We are in the era of data-intensive computation which has been shifted from compute-intensive applications. There is an urgency to analyze this massive amount of data to discover certain knowledge for the benefits to the society, business people and industry. For instance, processing big data to infer reserving informative patterns and knowledge can provide the public sector a chance to improve productivity and higher levels of efficiency and effectiveness. European’s public sector could potentially reduce expenditure of administrative activities by 15–20 percent, increasing 223 billion to 446 billion values, or even more. This estimate is under efficiency gains and a reduction in the difference between actual and potential aggregate of tax revenue. These functionalities can speed up year productivity growth by up to 0.5 percentage points over the next decade. However, the major challenge for researchers and practitioners is how to analyze the data with the traditional models, platforms and computing paradigms. It is right to say that big data will revolutionize many fields, including business, scientific research, public administration, and so on.

Cloud computing and big data

Cloud computing has become one of the most significant technologies which is best suited to solve big data problem. The advantages of cloud computing include virtualized resources, parallel processing, security, and data service integration with scalable data storage. Cloud computing not only delivers applications and services over the Internet, it has also been extended to infrastructure as a service (e.g., Amazon EC2), and platform as a service (such as Google AppEngine and Microsoft Azure). Another advantage of cloud is its storage technology which provides a possible tool for storing big data. Cloud storage has good extensibility and scalability in storing information. Cloud computing provides the underlying engine through the use of Hadoop, a class of distributed data-processing platforms. Big data evaluation is driven by fast-growing cloud-based applications developed using virtualized technologies. Therefore, cloud computing not only provides facilities for the computation and processing of big data but also serves as a service model.

A Big data platform

Hadoop is an open-source from Apache project written in Java that provides the distributed processing platform for large datasets across clusters of commodity. Hadoop has two primary components, namely, HDFS and MapReduce programming framework. HDFS and MapReduce are closely related to each other and co-deployed to produce a single cluster. HDFS is a distributed file system designed to run on top of the local file systems of the cluster nodes and to store extremely large files suitable for streaming data access. HDFS is highly fault tolerant and can scale up to thousands of machines, each offering local computation and storage. On the other hand, MapReduce is a simplified programming model for processing large number of datasets pioneered by Google for data intensive applications. The MapReduce model is adopted through open-source Hadoop implementation, which was popularized by Yahoo. MapReduce allows an inexperienced programmer to develop parallel programs and create a program capable of using computers in a cloud. Map/Reduce works in divide and conquer approach. In terms of Hadoop cluster, there are two kinds of nodes, master nodes and worker nodes. The master node takes a complex problem as input, divides it into smaller sub-problems, and distributes them to worker nodes in Map step. In the Reduce step, the answers of all the sub-problems are collected in the master node which combines them to form an output as a solution of the entire problem. Apart from the MapReduce framework, several other current open-source Apache projects are related to the Hadoop ecosystem. Hive, Hbase, Mahout, Pig, Zookeeper, Spark, and Avro are quite a few popular names among them.

Big data Initiative

Wal-Mart recently collaborated with Hewlett Packard to establish a data warehouse which has a capability to store 4 petabytes of data, tracing every purchase record from their point-of-sale terminals for competitiveness. There are almost 3 terabytes of data collected by the US Library of Congress for public administration. The Obama administration announced the Big Data research and development initiative in 2012 to investigate important problems faced by the government by making use of big data. The initiative was constitutive of 84 different Big Data programs involving six departments. The similar thing also happened in Europe.

Big Data challenges and research

Solving bid data problems on a cloud environment is not trivial or straightforward as there are many challenges such as availability of a service (e.g., network links/bandwidth), data confidentiality (security risks), energy (data centres consume huge power), parallelization application, visualization and so on. Many scientific fields including astronomy, meteorology, social computing, bioinformatics and computational biology have already become highly data-driven as they generate large volume of data with various types. How to probe knowledge from the data produced by large-scale scientific simulation is certainly a big data problem for which the answer is still unsatisfactory and unknown. For developing big data techniques and tools, there is a need to involve a number of disciplines, including statistics, data mining, machine learning, neural networks, social network analysis, signal processing, pattern recognition, optimization methods and visualization approaches. There is lot of research scope to use such techniques in the environment of cloud computing. An enormous scope is also there to improve Map-Reduce or its application to solve big data problem over clouds. We need join our hands to carry out very significant research to make big data and cloud computing really an emerging field of computer science.