Data Warehousing and the BI System
For collecting and managing data from various sources to provide some meaningful information to business is known as data warehousing. Mainly it is the process of gathering data and transform that data into meaningful information and making it available to different users or client timely. The data warehouse is the core of the BI system which is built for data analysis and reporting.
With respect to the dataset, data’s have been collected from various sources in raw format and using BI tools it have been converted to useful information for analytical purpose. Because of the distributed data processing platforms, components of the data warehouse architecture will improve gradually which includes data ingestion then processing the data using analytical tools to data archiving.
In data warehouse, data is processed then transformed and after that the transformed data is ingested so that the meaningful data or processed data is accessible in the data warehouse through some business Intelligence tools like tableau and many more. A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions. Data warehouse stores huge volumes of data which helps the users to analyze different time period range and trends to predict the future possibilities (Vaisman and Zimányi 2014).
As to collect and store different data from different states and area codes, data warehousing plays a vital role in organizing and storing the raw data in a proper understandable format and with this the organization can save a lot by storing huge data on cloud platform.
MongoDB is termed to be as a non-relational database which stores data in document format in a binary representation of JSON while HBase is also known for non-relational database model and basically the data are stored column wise in which data are stored in key value pairs.
Some of the crucial features of MongoDB includes- Document oriented format then JSON or BSON type the data have been stored then it consists of geospatial indexing also it has some similar properties of SQL like Query and indexing whereas on the other hand few of the major features of HBase includes- Column oriented format also it is based upon Hadoop distributed file system format and it consist of secondary indexing and the type of shell it consist is the Jruby based shell which is flexible to use. Also, both the database type is scheme free.
Comparison between MongoDB and HBase
The major difference between these two types of databases is that both being No SQL databases but have few major differences, different filtering, projection and aggregate functions are provided in the query model of MongoDB whereas the HBase consist of only the key value pairing of data . There are mainly three nodes which is provided by MongoDB model namely primary node, secondary node and another replication node while on the other hand HBase consist of ten which includes data nodes, region node and many more (Matallah, Belalem, and Bouamrane 2017).
While comparing these two models with our data, the best will be HBase non-relational database as HBase is best and mostly used by large organization with thousands of operation done over the tables then range queries also random read and write operation over the table and the important part is for analysis operations, due to its structure these models can scale trillions of records. The main advantage of HBase is that data can be easily stored and retrieved using the functionality of Apache spark. Data can be analyzed and visualized using different programming languages like python, java, and BI tools like tableau.
Databricks are built just above the Apache spark which is basically used by the analyst or data scientist to visualize the data through the use of dashboard by organizing the data in the form of a notebook which are highly reliable and performant data pipelines also the platform delivers productive materials at huge scale. The data bricks have the ability to produce the output file in JSON format. Whereas apache spark is an open source cluster purpose framework. Due to the advantage that spark has its own cluster management system still its uses Hadoop in two different ways one is by storing huge volumes of data and another is by processing those data, basically spark uses Hadoop for only storage purpose.
Few of the features of Spark are speed as application runs 100 times faster in memory and supports multiple languages few of which are java, python, Scala and many more. Some of the interesting use cases of spark are machine learning and using the advanced analytics, visually interactive analysis, and also spark has the ability of real time data processing(Zaharia et al 2016)..
There are different databricks runtime environment which includes Spark, Delta Lake, Pytorch, mlflow and databricks cloud service includes Azure and AWS. The DB always Up-to-date the system with latest feature for smoother workflow. Also, the visualization and analytical output are subpar compared to other tools.
Databricks and Apache Spark
Different high-level libraries are inbuilt in spark which have the supports of programming languages like R, python etc. Using the dataset different machine learning algorithms and data visualization techniques are done seamlessly and are highly reliable. Due to the architecture spark has the ability to process enormous amount of data which will normally give the organization with appropriate value.
Nowadays data is been represented using facts, plots and figure and these data are stored in large number. With the recent advancement of technology usage of smart devices are producing huge volumes of data which has led to data revolution. With recent studies, scientist and analyst stated that with smarter device usability the amount of data or frequency will grow fast in near future.
Hadoop is one of the open source and freely available software platform which have the ability to store massive amount of data, and have the computational power which can handle different tasks or job virtually. The main usage of Hadoop is to handle the growing big data technology with ease and provides solution using data mining, prediction analytics and machine learning algorithm. Hadoop has the ability to work structured, unstructured and semi-structured data (Mehta and Mehta 2016).
Hadoop ecosystem consist of various components and services including ingesting then storing data after that analyzing it and at the end maintaining. Few of the most commonly used services available which includes the core components are Hadoop distributed file system, Yarn, mapreduce and hadoop common (Lydia et al., 2016) .
Using such technology and advanced features in large datasets it will be easy for any existing database that can be mined suitably and important information can be extracted from it which will be useful for decision making for the authorities. Using Hadoop it’s possible to predict the Churn depending upon many factors. Churn is important because it directly affects your service’s profitability. Hence using predictive analysis and machine learning algorithm it will be possible to analyze customer requirement depending on several factors. Few of the major factors depending upon the churn includes area code, voice mail plan, international plan and more.
Also using analysis in Hadoop system, Telecommunication Company can find out the trends of different package or plans for various area codes of different state. The main advantage of using Hadoop components is that all the calculation and processing is done at the database end and produce only the final result which helps in efficiently analyzing the data further.
Hadoop and Big Data Analytics
The figure below shows the most important features to less important features-
Figure- 1
Q1
The number of customers in each area code was found in tableau by setting the area codes in the row area and count of customers in the column area.
Area Code |
Count |
408 |
838 |
415 |
1655 |
510 |
840 |
Table 1
Tableau automatically clubs the area codes together and it can be seen that there are three unique area codes 408,415 and 510.
Q2.
State |
Avg Voice Mail MSgs |
State |
Avg Voice Mail MSgs |
AK |
9.058 |
MS |
9.446 |
AL |
7.175 |
MT |
9.338 |
AR |
7.200 |
NC |
9.088 |
AZ |
9.469 |
ND |
10.210 |
CA |
9.000 |
NE |
8.557 |
CO |
8.364 |
NH |
8.268 |
CT |
9.068 |
NJ |
9.191 |
DC |
9.741 |
NM |
6.452 |
DE |
6.967 |
NV |
6.803 |
FL |
9.460 |
NY |
6.133 |
GA |
10.019 |
OH |
10.192 |
HI |
7.868 |
OK |
8.656 |
IA |
7.795 |
OR |
6.179 |
ID |
10.219 |
PA |
7.489 |
IL |
7.379 |
RI |
7.323 |
IN |
8.239 |
SC |
8.683 |
KS |
7.614 |
SD |
10.433 |
KY |
7.000 |
TN |
7.038 |
LA |
7.902 |
TX |
6.444 |
MA |
7.908 |
UT |
8.500 |
MD |
8.157 |
VA |
7.039 |
ME |
7.903 |
VT |
6.548 |
MI |
8.452 |
WA |
4.561 |
MN |
8.262 |
WI |
8.282 |
MO |
7.937 |
WV |
8.726 |
WY |
6.519 |
Table 2
Average number of voice calls for each state. This is done in tableau by dragging the states in the rows and putting the average number of voice mail messages in the columns.
Q3.
It is required to find from the dataset customers who had churned and also had both international and voice mail plan. From tableau, it is found that 36 people of the whole dataset had both international and a voicemail plan and also churned.
The calculation is visualized in tableau by the following stepwise method:
First the Voice Mail plan is dropped on the column axis and the count of international plan is dropped on the row axis. It creates a side by side bar chart that shows the count of customers with an international plan with a voice mail plan and without a voice mail plan.
However it is only required to know the count of customers who have churned and have both the international and voice mail plan.
Hence, churn is put on the filter part of tableau and switched to Yes.
From the data visualized it can be seen that out people who churned and had a voice mail plan, 36 people also had an international plan.
In the visualization, the international plan that takes the values yes or no is dropped on the colour marks to assist in our quick analysis.
For Churn =Yes, the following table is cross tabbed to excel.
Voice Mail Plan |
||
International Plan |
no |
yes |
no |
302 |
44 |
yes |
101 |
36 |
Table 3
Q4.
Avg. Account Length |
101.1 |
Avg. Customer Service Calls |
1.6 |
Avg. Number Vmail Messages |
8.1 |
Avg. Number of Records |
1.0 |
Avg. Total Day Calls |
100.4 |
Avg. Total Day Charge |
30.6 |
Avg. Total Day Minutes |
179.8 |
Avg. Total Eve Calls |
100.1 |
Avg. Total Eve Charge |
17.1 |
Avg. Total Eve Minutes |
201.0 |
Avg. Total Intl Calls |
4.5 |
Avg. Total Intl Charge |
2.8 |
Avg. Total Intl Minutes |
10.2 |
Avg. Total Night Calls |
100.1 |
Avg. Total Night Charge |
9.0 |
Avg. Total Night Minutes |
200.9 |
Table 4
It is required to find the average value of total night calls in minutes.
For this analysis, measure names are dropped in the rows and measure values are dropped in the marks section. The measure is set to average. A table output is produced where the average of each measure is given.
From the table it is found that the average of total night minutes is 200.9.
Q5.
Same as 3.
Q6.
The maximum number of voicemail messages is needed to be found for three given states: OH, OK and RI.
Analyzing Telecom Company’s Customer Data
This analysis is done in the following stages.
First the States from dimensions is put in the row column and Voice Mail Messages is put in the column section. The output calculates the states and the sum of the voice messages for each states. As we need the maximum number of messages for the states, in the drop down arrow in number of messages the measure is set to Maximum. We get the following output after this:
State |
Max |
State |
Max |
AK |
39 |
MN |
46 |
AL |
38 |
MO |
48 |
AR |
49 |
MS |
45 |
AZ |
45 |
MT |
36 |
CA |
39 |
NC |
47 |
CO |
40 |
ND |
40 |
CT |
47 |
NE |
41 |
DC |
39 |
NH |
43 |
DE |
35 |
NJ |
43 |
FL |
51 |
NM |
42 |
GA |
39 |
NV |
44 |
HI |
43 |
NY |
37 |
IA |
43 |
OH |
46 |
ID |
37 |
OK |
43 |
IL |
43 |
OR |
50 |
IN |
46 |
PA |
40 |
KS |
42 |
RI |
43 |
KY |
40 |
SC |
46 |
LA |
39 |
SD |
48 |
MA |
45 |
TN |
36 |
MD |
45 |
TX |
40 |
ME |
42 |
UT |
44 |
MI |
41 |
VA |
42 |
VT |
44 |
||
WA |
41 |
||
WI |
44 |
||
WV |
50 |
||
WY |
42 |
Table 5
However the maximum number of messages is needed to be found only for three states. So, the dimension state is dropped on filter and only the required States: OH, OK and RI are selected to give the following output.
State |
Max |
OH |
46 |
OK |
43 |
RI |
43 |
Table 6
Thus the maximum messages for OH, OK and RI are 46, 43 and 43 respectively.
Q7.
The sum of day calls is needed to be found for the customers who reside in states OH and NY and have not churned. The analysis is done in the following way:
First the state from the dimensions is dropped in rows and Day Calls is dropped in column and its measure is changed to Sum. The output gives the sum of day calls for all the customers but we need sum of day calls of customers who have not churned and so Churn is put on the filter and set to false.
The following output is produced.
State |
Sum of day calls |
State |
Sum of day calls |
AK |
4684 |
MT |
5234 |
AL |
7156 |
NC |
5759 |
AR |
4484 |
ND |
5531 |
AZ |
5971 |
NE |
5448 |
CA |
2656 |
NH |
4611 |
CO |
5485 |
NJ |
4973 |
CT |
6006 |
NM |
5726 |
DC |
4870 |
NV |
5254 |
DE |
5216 |
NY |
6741 |
FL |
5598 |
OH |
6707 |
GA |
4746 |
OK |
5266 |
HI |
5119 |
OR |
6461 |
IA |
4139 |
PA |
3663 |
ID |
6793 |
RI |
5967 |
IL |
5313 |
SC |
4597 |
IN |
6519 |
SD |
4942 |
KS |
5711 |
TN |
4866 |
KY |
5125 |
TX |
5428 |
LA |
4862 |
UT |
6226 |
MA |
5545 |
VA |
7535 |
MD |
5291 |
VT |
6535 |
ME |
4850 |
WA |
5297 |
MI |
5692 |
WI |
6866 |
MN |
6825 |
WV |
9965 |
MO |
5613 |
WY |
6967 |
MS |
4973 |
Table 8
But it is only required to find the sum of day calls for states OH and NY so states is put on filter and only OH and NY are kept.
State |
Sum of Day Calls |
NY |
6741 |
OH |
6707 |
References:
Benedek, G., Lublóy, Á. and Vastag, G., 2014. The importance of social embeddedness: Churn models at mobile providers. Decision Sciences, 45(1), pp.175-201
Chen, M., Mao, S. and Liu, Y., 2014. Big data: A survey. Mobile networks and applications, 19(2), pp.171-209.
Deardorff, A., 2016. Tableau (version. 9.1). Journal of the Medical Library Association, 104(2), pp.182-183.
Gy?rödi, C., Gy?rödi, R., Pecherle, G. and Olah, A., 2015, June. A comparative study: MongoDB vs. MySQL. In 2015 13th International Conference on Engineering of Modern Electric Systems (EMES) (pp. 1-6). IEEE.
Lydia, E.L. and Swarup, M.B., 2016. Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive. International Journal of Computer Science Engineering (IJCSE) Vol, 5.
Matallah, H., Belalem, G. and Bouamrane, K., 2017. Experimental comparative study of NoSQL databases: HBASE versus MongoDB by YCSB. Comput. Syst. Sci. Eng, 32(4), pp.307-317.
Mehta, S. and Mehta, V., 2016. Hadoop ecosystem: An introduction. International Journal of Science and Research (IJSR), 5(6), pp.557-562.
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S. and Xin, D., 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), pp.1235-1241
Nair, L., Shetty, S. and Shetty, S., 2016. Interactive visual analytics on Big Data: Tableau vs D3. js. Journal of e-Learning and Knowledge Society, 12(4).
Park, Y., Cafarella, M. and Mozafari, B., 2016, May. Visualization-aware sampling for very large databases. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (pp. 755-766). IEEE.
Uzunkaya, C., Ensari, T. and Kavurucu, Y., 2015. Hadoop ecosystem and its analysis on tweets. Procedia-Social and Behavioral Sciences, 195, pp.1890-1897.
Vaisman, A. and Zimányi, E., 2014. Data warehouse systems. Data-Centric Systems and Applications.
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J. and Ghodsi, A., 2016. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), pp.56-65.