Utilizing Data Warehousing, HBase And Spark To Analyze Telecom Company’s Customer Data

Data Warehousing and the BI System

For collecting and managing data from various sources to provide some meaningful information to business is known as data warehousing. Mainly it is the process of gathering data and transform that data into meaningful information and making it available to different users or client timely. The data warehouse is the core of the BI system which is built for data analysis and reporting.

With respect to the dataset, data’s have been collected from various sources in raw format and using BI tools it have been converted to useful information for analytical purpose. Because of the distributed data processing platforms, components of the data warehouse architecture will improve gradually which includes data ingestion then processing the data using analytical tools to data archiving.

In data warehouse, data is processed then transformed and after that the transformed data is ingested so that the meaningful data or processed data is accessible in the data warehouse through some business Intelligence tools like tableau and many more. A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions. Data warehouse stores huge volumes of data which helps the users to analyze different time period range and trends to predict the future possibilities (Vaisman and Zimányi 2014).

As to collect and store different data from different states and area codes, data warehousing plays a vital role in organizing and storing the raw data in a proper understandable format and with this the organization can save a lot by storing huge data on cloud platform.

MongoDB is termed to be as a non-relational database which stores data in document format in a binary representation of JSON while HBase is also known for non-relational database model and basically the data are stored column wise in which data are stored in key value pairs.

Some of the crucial features of MongoDB includes- Document oriented format then JSON or BSON type the data have been stored then it consists of geospatial indexing also it has some similar properties of SQL like Query and indexing whereas on the other hand few of the major features of HBase includes- Column oriented format also it is based upon Hadoop distributed file system format and it consist of secondary indexing and the type of shell it consist is the Jruby based shell which is flexible to use. Also, both the database type is scheme free.

Comparison between MongoDB and HBase

The major difference between these two types of databases is that both being No SQL databases but have few major differences, different filtering, projection and aggregate functions are provided in the query model of MongoDB whereas the HBase consist of only the key value pairing of data . There are mainly three nodes which is provided by MongoDB model namely primary node, secondary node and another replication node while on the other hand HBase consist of ten which includes data nodes, region node and many more (Matallah, Belalem, and Bouamrane 2017).

While comparing these two models with our data, the best will be HBase non-relational database as HBase is best and mostly used by large organization with thousands of operation done over the tables then range queries also random read and write operation over the table and the important part is for analysis operations, due to its structure these models can scale trillions of records. The main advantage of HBase is that data can be easily stored and retrieved using the functionality of Apache spark. Data can be analyzed and visualized using different programming languages like python, java, and BI tools like tableau.

Databricks are built just above the Apache spark which is basically used by the analyst or data scientist to visualize the data through the use of dashboard by organizing the data in the form of a notebook which are highly reliable and performant data pipelines also the platform delivers productive materials at huge scale. The data bricks have the ability to produce the output file in JSON format. Whereas apache spark is an open source cluster purpose framework. Due to the advantage that spark has its own cluster management system still its uses Hadoop in two different ways one is by storing huge volumes of data and another is by processing those data, basically spark uses Hadoop for only storage purpose.

Few of the features of Spark are speed as application runs 100 times faster in memory and supports multiple languages few of which are java, python, Scala and many more. Some of the interesting use cases of spark are machine learning and using the advanced analytics, visually interactive analysis, and also spark has the ability of real time data processing(Zaharia et al 2016)..

There are different databricks runtime environment which includes Spark, Delta Lake, Pytorch, mlflow and databricks cloud service includes Azure and AWS. The DB always Up-to-date the system with latest feature for smoother workflow. Also, the visualization and analytical output are subpar compared to other tools.

Databricks and Apache Spark

Different high-level libraries are inbuilt in spark which have the supports of programming languages like R, python etc. Using the dataset different machine learning algorithms and data visualization techniques are done seamlessly and are highly reliable. Due to the architecture spark has the ability to process enormous amount of data which will normally give the organization with appropriate value.

Nowadays data is been represented using facts, plots and figure and these data are stored in large number. With the recent advancement of technology usage of smart devices are producing huge volumes of data which has led to data revolution. With recent studies, scientist and analyst stated that with smarter device usability the amount of data or frequency will grow fast in near future.

Hadoop is one of the open source and freely available software platform which have the ability to store massive amount of data, and have the computational power which can handle different tasks or job virtually. The main usage of Hadoop is to handle the growing big data technology with ease and provides solution using data mining, prediction analytics and machine learning algorithm. Hadoop has the ability to work structured, unstructured and semi-structured data (Mehta and Mehta 2016).

Hadoop ecosystem consist of various components and services including ingesting then storing data after that analyzing it and at the end maintaining. Few of the most commonly used services available which includes the core components are Hadoop distributed file system, Yarn, mapreduce and hadoop common (Lydia et al., 2016) .

Using such technology and advanced features in large datasets it will be easy for any existing database that can be mined suitably and important information can be extracted from it which will be useful for decision making for the authorities. Using Hadoop it’s possible to predict the Churn depending upon many factors. Churn is important because it directly affects your service’s profitability. Hence using predictive analysis and machine learning algorithm it will be possible to analyze customer requirement depending on several factors. Few of the major factors depending upon the churn includes area code, voice mail plan, international plan and more.

Also using analysis in Hadoop system, Telecommunication Company can find out the trends of different package or plans for various area codes of different state. The main advantage of using Hadoop components is that all the calculation and processing is done at the database end and produce only the final result which helps in efficiently analyzing the data further.

Hadoop and Big Data Analytics

The figure below shows the most important features to less important features-

Figure- 1

The number of customers in each area code was found in tableau by setting the area codes in the row area and count of customers in the column area.

Area Code	Count
408	838
415	1655
510	840

Table 1

Tableau automatically clubs the area codes together and it can be seen that there are three unique area codes 408,415 and 510.

Q2.

State	Avg Voice Mail MSgs	State	Avg Voice Mail MSgs
AK	9.058	MS	9.446
AL	7.175	MT	9.338
AR	7.200	NC	9.088
AZ	9.469	ND	10.210
CA	9.000	NE	8.557
CO	8.364	NH	8.268
CT	9.068	NJ	9.191
DC	9.741	NM	6.452
DE	6.967	NV	6.803
FL	9.460	NY	6.133
GA	10.019	OH	10.192
HI	7.868	OK	8.656
IA	7.795	OR	6.179
ID	10.219	PA	7.489
IL	7.379	RI	7.323
IN	8.239	SC	8.683
KS	7.614	SD	10.433
KY	7.000	TN	7.038
LA	7.902	TX	6.444
MA	7.908	UT	8.500
MD	8.157	VA	7.039
ME	7.903	VT	6.548
MI	8.452	WA	4.561
MN	8.262	WI	8.282
MO	7.937	WV	8.726
		WY	6.519

Table 2

Average number of voice calls for each state. This is done in tableau by dragging the states in the rows and putting the average number of voice mail messages in the columns.

Q3.

It is required to find from the dataset customers who had churned and also had both international and voice mail plan. From tableau, it is found that 36 people of the whole dataset had both international and a voicemail plan and also churned.

The calculation is visualized in tableau by the following stepwise method:

First the Voice Mail plan is dropped on the column axis and the count of international plan is dropped on the row axis. It creates a side by side bar chart that shows the count of customers with an international plan with a voice mail plan and without a voice mail plan.

However it is only required to know the count of customers who have churned and have both the international and voice mail plan.

Hence, churn is put on the filter part of tableau and switched to Yes.

From the data visualized it can be seen that out people who churned and had a voice mail plan, 36 people also had an international plan.

In the visualization, the international plan that takes the values yes or no is dropped on the colour marks to assist in our quick analysis.

For Churn =Yes, the following table is cross tabbed to excel.

	Voice Mail Plan
International Plan	no	yes
no	302	44
yes	101	36

Table 3

Q4.

Avg. Account Length	101.1
Avg. Customer Service Calls	1.6
Avg. Number Vmail Messages	8.1
Avg. Number of Records	1.0
Avg. Total Day Calls	100.4
Avg. Total Day Charge	30.6
Avg. Total Day Minutes	179.8
Avg. Total Eve Calls	100.1
Avg. Total Eve Charge	17.1
Avg. Total Eve Minutes	201.0
Avg. Total Intl Calls	4.5
Avg. Total Intl Charge	2.8
Avg. Total Intl Minutes	10.2
Avg. Total Night Calls	100.1
Avg. Total Night Charge	9.0
Avg. Total Night Minutes	200.9

Table 4

It is required to find the average value of total night calls in minutes.

For this analysis, measure names are dropped in the rows and measure values are dropped in the marks section. The measure is set to average. A table output is produced where the average of each measure is given.

From the table it is found that the average of total night minutes is 200.9.

Q5.

Same as 3.

Q6.

The maximum number of voicemail messages is needed to be found for three given states: OH, OK and RI.

Analyzing Telecom Company’s Customer Data

This analysis is done in the following stages.

First the States from dimensions is put in the row column and Voice Mail Messages is put in the column section. The output calculates the states and the sum of the voice messages for each states. As we need the maximum number of messages for the states, in the drop down arrow in number of messages the measure is set to Maximum. We get the following output after this:

State	Max	State	Max
AK	39	MN	46
AL	38	MO	48
AR	49	MS	45
AZ	45	MT	36
CA	39	NC	47
CO	40	ND	40
CT	47	NE	41
DC	39	NH	43
DE	35	NJ	43
FL	51	NM	42
GA	39	NV	44
HI	43	NY	37
IA	43	OH	46
ID	37	OK	43
IL	43	OR	50
IN	46	PA	40
KS	42	RI	43
KY	40	SC	46
LA	39	SD	48
MA	45	TN	36
MD	45	TX	40
ME	42	UT	44
MI	41	VA	42
		VT	44
		WA	41
		WI	44
		WV	50
		WY	42

Table 5

However the maximum number of messages is needed to be found only for three states. So, the dimension state is dropped on filter and only the required States: OH, OK and RI are selected to give the following output.

State	Max
OH	46
OK	43
RI	43

Table 6

Thus the maximum messages for OH, OK and RI are 46, 43 and 43 respectively.

Q7.

The sum of day calls is needed to be found for the customers who reside in states OH and NY and have not churned. The analysis is done in the following way:

First the state from the dimensions is dropped in rows and Day Calls is dropped in column and its measure is changed to Sum. The output gives the sum of day calls for all the customers but we need sum of day calls of customers who have not churned and so Churn is put on the filter and set to false.

The following output is produced.

State	Sum of day calls	State	Sum of day calls
AK	4684	MT	5234
AL	7156	NC	5759
AR	4484	ND	5531
AZ	5971	NE	5448
CA	2656	NH	4611
CO	5485	NJ	4973
CT	6006	NM	5726
DC	4870	NV	5254
DE	5216	NY	6741
FL	5598	OH	6707
GA	4746	OK	5266
HI	5119	OR	6461
IA	4139	PA	3663
ID	6793	RI	5967
IL	5313	SC	4597
IN	6519	SD	4942
KS	5711	TN	4866
KY	5125	TX	5428
LA	4862	UT	6226
MA	5545	VA	7535
MD	5291	VT	6535
ME	4850	WA	5297
MI	5692	WI	6866
MN	6825	WV	9965
MO	5613	WY	6967
MS	4973

Table 8

But it is only required to find the sum of day calls for states OH and NY so states is put on filter and only OH and NY are kept.

State	Sum of Day Calls
NY	6741
OH	6707

References:

Benedek, G., Lublóy, Á. and Vastag, G., 2014. The importance of social embeddedness: Churn models at mobile providers. Decision Sciences, 45(1), pp.175-201

Chen, M., Mao, S. and Liu, Y., 2014. Big data: A survey. Mobile networks and applications, 19(2), pp.171-209.

Deardorff, A., 2016. Tableau (version. 9.1). Journal of the Medical Library Association, 104(2), pp.182-183.

Gy?rödi, C., Gy?rödi, R., Pecherle, G. and Olah, A., 2015, June. A comparative study: MongoDB vs. MySQL. In 2015 13th International Conference on Engineering of Modern Electric Systems (EMES) (pp. 1-6). IEEE.

Lydia, E.L. and Swarup, M.B., 2016. Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive. International Journal of Computer Science Engineering (IJCSE) Vol, 5.

Matallah, H., Belalem, G. and Bouamrane, K., 2017. Experimental comparative study of NoSQL databases: HBASE versus MongoDB by YCSB. Comput. Syst. Sci. Eng, 32(4), pp.307-317.

Mehta, S. and Mehta, V., 2016. Hadoop ecosystem: An introduction. International Journal of Science and Research (IJSR), 5(6), pp.557-562.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S. and Xin, D., 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), pp.1235-1241

Nair, L., Shetty, S. and Shetty, S., 2016. Interactive visual analytics on Big Data: Tableau vs D3. js. Journal of e-Learning and Knowledge Society, 12(4).

Park, Y., Cafarella, M. and Mozafari, B., 2016, May. Visualization-aware sampling for very large databases. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (pp. 755-766). IEEE.

Uzunkaya, C., Ensari, T. and Kavurucu, Y., 2015. Hadoop ecosystem and its analysis on tweets. Procedia-Social and Behavioral Sciences, 195, pp.1890-1897.

Vaisman, A. and Zimányi, E., 2014. Data warehouse systems. Data-Centric Systems and Applications.

Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J. and Ghodsi, A., 2016. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), pp.56-65.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Utilizing Data Warehousing, HBase And Spark To Analyze Telecom Company’s Customer Data ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order