Big Data and Cloud Computing
In the recent years, there has been a huge demand in order to store and process more and more amount of data, especially in sectors like science, finance and also in government work. Effective process and methods are required for maintenance and management of such data.
Big data is the solution that is responsible for successful storing of data and processing over it. On the other hand cloud provides a scalable and elastic environment so that the performance of big data can take place in a reliable and fault tolerant manner. Big Data along with Big Data Analytics are considered alongside in the areas related to business and science in order to correlate the data. Therefore both the technologies have great importance as they can provide competitive advantages in business and ways to specifically aggregate and summarize the scientific data.
In this paper we are going to emphasize on cloud computing and big data systems. We will mainly focus on challenges and benefits of these technology based on the organization “Facebook” and on analysis of these tools as source of business intelligence.
Cloud computing refers to an on demand model designed for various provisions; it’s mainly based on the concept of distributed and virtual technologies of computing. The architecture of cloud computing comprises of:
- Facility of programmable management
- Effective and efficient flexibility and scalability of the system
- Use of shard resources like memory, hardware, etc.
- Provides on demand services
Cloud computing can be categorized as;
- Software as a service (SaaS): it refers when the software are provided by any third party that are available on demand usually through the Internet. Like the spreadsheets tools, delivery services for web content (Marr, 2015).
- Platform as a service (PaaS): facilitates the customer to design new applications using the platforms like deployment platforms, management configuration and development tools.
- Infrastructure as a service (IaaS): provides operating system, hardware and other virtual machines that maybe controlled using a service API.
Big data refers to the large distinctive volumes of variable data created by various sources like human beings and machine. It requires new innovative technology to scale, host and process the data analytically to derive the insights of any real time business which relates to the customer or risk, performance, management over the productivity.
All the information and data gathered from social media, devices enabled through internet are a part of big data. Big Data is categorized through four “Vs”:
- Volume: refers to the amount of data created
- Variety: data under big data comes from various different sources
- Velocity: generation of data is very fast and never stopping
- Veracity: refers to the need of testing the quality of data coming through various sources (Hashem et al, 2015)
Cloud computing uses the concept of virtual hardware system in order to provide, elastic, scalable and fault tolerable environment for the storing and processing of bigger volumes of data. Thus, cloud provides availability, scalability and fault tolerance features to big data. It can be said that both big data and cloud computing works together hand in hand and are a compatible concept. Today big data is regarded as a valuable opportunity in business. Many new companies like Teradata, Cloudera have started focusing on delivering database or big data as a service(DBaaS, BDaaS) other organizations like Amazon, Google, General Electric, Microsoft, IBM also provides methods for the consumption of big data by their customers according to the demand (Neves et al, 2017).
Definitions of big data based on an online survey of 154 global executives in April 2012
Facebook is the world’s largest social networking platform which according to a report performs about 2.5 billion processes on its content section which incorporates more than 600 terabytes of data every day. It comprises of nearly 2.7 billion actions of like, photos generated every day comprises of about 300 million data. It works on 105 terabytes of day every half an hour.
Big Data in Cloud
Facebook revealed that it operates the largest Hadoop system with capacity of storing over 100 petabytes of data on a single Hadoop cluster in 2012.
Few facts about Facebook Mobile Platform according to sources (2017)
More than 100 million people use Facebook and more than one billion pages are viewed every day that results in accumulation of massive amount of data at Facebook. One of the biggest challenges faced by Facebook was to develop a scalable way which can be used to store and process these bytes as in order to improve the user experience Facebook needs to completely evaluate this historical data.
Another problem was of Data architecture because in the earlier phase Facebook used to use a centralized architecture where single computer system was used to solve all the complex problems. These single centralized systems are ineffective in processing of big bundles of data and are very costly.
Types of data was also an important drawback as Facebook generates various different kinds of data including text, images, videos with various extensions because the centralized data systems were based on structured data i.e. data used to be stored in fixed formats.
Defining of data relationships among such a large amount of data was not possible in case of traditional systems. There were several other problems like accuracy and confidentiality scaling etc.
Hadoop refer to an open source framework based on java programming which supports storing and processing of extremely large sets of data in a distributed environment. Apache Software Foundation sponsored this product under the project named Apache (Shvachko et al, 2010).
Facebook used Hadoop framework on a distributed system for large scale processing and the features of map reducing paradigm. Hadoop provided the facility of writing map-reduced programs in any language according to their choices. Facebook started using SQL as a paradigm to operate and address large piles of data. The data stored in these Hadoop file system in Facebook is mostly published as tables which gives advantage to the developers as they can easily explore and operate the required data sets by using small subsets of SQL. Facebook operates on this data sets using map-reduce program or by using standard query operators.
(McAfee and Brynjolfsson, 2012)
Hive is also an open source; it’s a peta byte scale data warehousing framework that is completely based on Hadoop which is developed by the Facebook Data Infrastructure Team. Hive has been very popular amongst Facebook users from starting. It is heavily used for summarization of jobs, machine learning and for business intelligence purpose.
Hive provided the facility of analysis of the large data sets scalable as scalability is the core function of Facebook and several engineering and non-engineering team continuously work on it. Analyst at Facebook uses ad hoc analysis along with several business intelligence applications. Several products like reporting applications related to the Facebook Ad Network or the Facebook’s Lexicon product are completely based on analytics (Fan et al, 2014). Technology like Hive and Hadoop are responsible for providing a flexible infrastructure needed by its diverse users and applications and providing a cost effective manner for the scaling of the amount of data that is generated at Facebook at an increasing rate.
Hive system architecture
(Thusoo et al, 2010)
The above figure illustrates the flow of data through a source system to the Facebook warehouse. As shown, Facebook consist of two sources of data – the federated MySQL tier which comprises of entire data related to Facebook site while the log data is stored in the web tier.
The data from the web tier is placed in set of cluster named Scribe-Hadoop. These cluster consist of Scribe servers that are designed to run on these Hadoop clusters. Logs coming from different web servers are aggregated and are placed in the Hadoop cluster by these Scribe servers in the form of HDFS files. There exist a trde off which occurs between the latencies and compression which arises because of the exploration of possibility to compress the data present in web tier before it is transferred to scribeh clusters. After which periodically this data Is further compressed by copier jobs and placed within the associated Hive-Hadoop clusters.
As shown above once the data is stored in these clusters it is available for consumption using the process called the down stream process.
There are two Hive-Hadoop clusters available-
- The production Hive-Hadoop clusters: performs the jobs with verystrict deadlines for delivery
- Ad hoc Hive-Hadoop clusters: perform jobs with that have lower priority
All the replication jobs performed by the user on the data stored in these clusters rely on logging of hive commands which were submitted to the production Hive-Hadoop clusters (Antonopoulos and Gillam, 2010).
Finally, the results of these jobs are either placed in the cluster for further analysis in future or are loaded back to be used by the Facebook user in the federated MySol.
Facebook generates large amount of data every day along with the existing historical data to support historical analysis. The production cluster is responsible for storing the data worth of one month and the data beyond that period is stored by the ad hoc cluster. Since the size of data is very large, all the data is generally compressed by the factor of 6-7 in most cases. Hadoop also allows with the feature to compress the data according to the need of the user through specified codecs (Jin et al, 2015).
Variety of data in Facebook (Minelli, Chambers and Dhiraj, 2012)
PAX [8] compression scheme is also introduced along with the gzip method which compresses the rows and columns of the tables within the hive with 10%-30% more compression as compared to gzip.
Facebook generates 3 copies of each HDFS file in order to save the data from the case of nodes failure. Nowadays Facebook is using erasure codes which reduces it to 2.2- by storing only two copies of data and the rest two copies for error correction codes for that data.
(Constine, 2012)
Big data provides both opportunity as well as challenges to a business. Big data should be processed and analyzed in a proper manner from time to time to extract positive values to make positive changes or to influence the decisions related to the business.
Definition- Analytics refers to the finding the meaningful patterns present in the data, for business analytics is defined as the use of data extensively in order to derive facts that are based on decisions and actions related to business (Gandomi and Haider, 2015).
Analytics helps in optimizing the process, aggregate internal data with external one. It helps a firm to meet the demands of stakeholders, manage the risk, and manage the large amount of data sets and enhancing the overall performance of the organization by transforming its information into intelligence.
(Sivarajah et al, 2017)
Issues |
Existent solutions |
Advantages |
Disadvantages |
Security |
SLAs and data encryption |
Data is well encrypted |
Querying regarding encrypted data remains time consuming |
Heterogeneity |
Big data offers the ability to deal with different variety of data coming with different velocity |
Most of the variety of data is covered |
Handling of variety of data along with different velocities is very difficult |
Privacy |
– User consent – De-identification |
User is provided with reasonable privacy |
Most of these De-identification mechanism can be reverse engineered |
Data Governance |
Documents regarding data governance |
-defines policies regarding data access -specification of role -defines the life cycle of data |
-defining of data cycle is not easy -counterproductive effects may arise due to enforcement of data governance policies |
Disaster Recovery |
Plans regarding recovery |
Defines the methods and location for recovery of data |
Generally there exist only a single place to secure data |
Data uploading |
-internet to upload data -providing HDDs to cloud providers |
Providing of HDDs to cloud provider is faster than other but is also more unsecure |
Sending HDDs can be risky as it may incur physical damages, while uploading process is very time consuming |
elasticity |
Techniques like resizing, replication and live migration |
Makes the system capable to accommodate data peaks |
Assessments of load variation is mostly manual rather than automatized |
(Yang et al, 2017)
Conclusion
With increasing of data daily, systems supporting big data and in particular analytic tools provide a way to store these huge data. Cloud leverages the solution of big data by providing fault-tolerant and scalable environment. Although, big data is a powerful system that results in improved decision making of any organization and is a source of business intelligence. But, it still faces certain challenges regarding security mechanisms defining of various data types, implementing of elasticity and special efforts must be made to overcome these challenges.
- All the components of big data must be considered, not only the size:factors like veracity and velocity also affect the IT infrastructure so IT need to work carefully on these aspects in order to attain the goals and objectives.
- Implementation of a single strategy regarding big data over the entire organization:a team or an individual should be placed that identifies the problem and challenges preventing this big data implementation.
- Acceptance of fact that big data is not just a passing trend:strategically plans should be made to unlock the entire potential of big data as it will help in sustained growth of the organization.
References:
Antonopoulos, N. and Gillam, L., 2010. Cloud computing. London: Springer.
Constine, J., 2012. How big is facebook’s data? 2.5 billion pieces of content and 500+ terabytes ingested every day, 22 August 2012.
Fan, J., Han, F. and Liu, H., 2014. Challenges of big data analysis. National science review, 1(2), pp.293-314.
Gandomi, A. and Haider, M., 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), pp.137-144.
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A. and Khan, S.U., 2015. The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, pp.98-115.
Jin, X., Wah, B.W., Cheng, X. and Wang, Y., 2015. Significance and challenges of big data research. Big Data Research, 2(2), pp.59-64.
Marr, B 2015, ‘7 Amazing companys that really get Big Data’, big data case study collection, accessed 16 August 2017, <.file:///C:/Users/ASUS/Desktop/project%202/bigdata-case-studybook_final.pdf>
McAfee, A. and Brynjolfsson, E., 2012. Big data: the management revolution. Harvard business review, 90(10), pp.60-68.
Minelli, M., Chambers, M. and Dhiraj, A., 2012. Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. John Wiley & Sons.
Neves, P.C., Schmerl, B., Bernardino, J. and Cámara, J., Big Data in Cloud Computing: features and issues.
Shvachko, K., Kuang, H., Radia, S. and Chansler, R., 2010, May. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10). IEEE.
Sivarajah, U., Kamal, M.M., Irani, Z. and Weerakkody, V., 2017. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research, 70, pp.263-286.
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R. and Liu, H., 2010, June. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1013-1020). ACM.
Yang, C., Huang, Q., Li, Z., Liu, K. and Hu, F., 2017. Big Data and cloud computing: innovation opportunities and challenges. International Journal of Digital Earth, 10(1), pp.13-53.