Critical Literature Review Of Data Warehouses, Data Lakes, And Data Marts

Data Management Architecture

A data warehouse center is a relational database that is intended for question and investigation as opposed to for exchange handling (Agarwal and Dhar, 2014). It more often than not contains recorded information got from exchange information, however it can incorporate information from different sources. It isolates examination workload from exchange workload and empowers an association to combine information from a few sources.

Notwithstanding a social database, a data warehouse center condition incorporates an extraction, transportation, change, and stacking (ETL) arrangement, an online investigative handling (OLAP) motor, customer examination devices, and different applications that deal with the way toward get-together information and conveying it to business clients (Szewra?ski et al., 2017).

With a specific end goal to find drifts in business, experts require a lot of information. This is especially as opposed to online transaction processing (OLTP) frameworks, where execution prerequisites request that recorded information be moved to a document (Symons et al., 2017). A data warehouse center’s emphasis on change after some time is what is implied by the term time variation.

A data lake is an accumulation of capacity occurrences of different information resources extra to the starting information sources. These advantages are put away in a close correct, or even correct, duplicate of the source arrange (Ryan, 2015). The motivation behind a data lake is to show a foul perspective of information to just the most exceedingly gifted investigators, to enable them to investigate their information refinement and examination methods free of any of the arrangement of-record bargains that may exist in a conventional explanatory information store, (for example, a data mart or data warehouse).

As mentioned by Michael and Miller (2013), a data lake is an extensive stockpiling vault that holds an immense measure of crude information in its local configuration until the point when it is required. An “enterprise data lake” (EDL) is essentially a data lake for big business wide data stockpiling and sharing.

Data Mart:

A data mart is a storehouse of information that is intended to serve a specific group of learning specialists.

The distinction between a data warehouse and a data mart can be befuddling in light of the fact that the two terms are here and there utilized inaccurately as equivalent words. A data warehouse is a focal archive for each of the association’s information (Kitchin 2014). The objective of a data mart, in any case, is to meet the specific requests of a particular gathering of clients inside the association, for example, human resource management (HRM). By and large, an association’s data marts are subsets of the association’s information stockroom.

Data Warehouse (DW) are brought together information storehouses that incorporate information from different value-based, inheritance, or outer frameworks, applications, and sources (Kimball and Ross, 2013). The Data Warehouse gives a domain isolate from the operational frameworks and is totally intended for choice help, scientific revealing, specially appointed inquiries, and information mining. This disengagement and streamlining empowers questions to be performed with no effect on the frameworks that help the business’ essential exchanges (i.e value-based and operational frameworks) (Khanapi et al., 2015).

Discussion of Concepts

On a very basic level, a Data Warehouse tackles the on-going issue of hauling information out of value-based frameworks rapidly and proficiently and changing over that information into noteworthy data. Also, the Data Warehouse center takes into consideration handling of substantial and complex questions in a profoundly effective way (Jaber et al., 2015). Upon successful execution of a Data Warehouse or Data Mart, business will understand various upgrades and positive increases.

Taken for example, to find out about an organization’s sales data one can design a data warehouse that focuses on sales. Utilizing this data warehouse, the organization can answer questions like “Who was their best client for this thing a year ago”

Another example could be: “Locate the aggregate sales for all clients a month ago/year, retrieve any particular months deal”. Additionally, a business exchange can be separated into certainties, for example, number of items requested and the cost paid for the items, and into dimensions, for example, arrange date, client name, item number and so on.

The primary advantage of an data lake is the centralization of unique substance sources. Once assembled (from their “data storehouses”), these sources can be consolidated and handled utilizing huge information, seek and investigation procedures which would have generally been inconceivable. The dissimilar substance sources will frequently contain exclusive and delicate data which will require usage of the proper safety efforts in the data lake (Dewan et al., 2014).

Data Lake benefits the safety efforts in the data lake might be doled out in a way that gifts access to certain data to clients of the information lake that don’t approach the first substance source. These clients are qualified for the data, yet unfit to get to it in its hotspot for reasons unknown.

A few clients should not work with the information in the first substance source however expend the information coming about because of procedures incorporated with those sources. There might be an authorizing farthest point to the first substance source that keeps a few clients from getting their own particular qualifications (Beers et al. 2016). Now and again, the first substance source has been secured, is outdated or will be decommissioned soon; yet its substance is as yet profitable to clients of the data lake.

Since data mart are improved to take a gander at information particularly, the outline procedure tends to begin with an examination of client needs. Interestingly, a data warehouse center’s outline procedure tends to begin with an examination of what information as of now exists and how it can be gathered and overseen such that it can be utilized later on (Bates et al., 2014). A Data Warehouse has a tendency to be a vital however fairly incomplete idea; an information store has a tendency to be strategic and gone for meeting a prompt need.

This section of the study explore the usefulness of Data Mining tool, specifically the Rapid Miner software. Here, in order to understand the application of the said data mining tool, a data set named, white-wines.csv has been considered. The data set mainly consist of 12 different properties of wines and the quality of a wine [rank wise]. Here, the analyst has performed exploratory data analysis as well as regression analysis to understand the top 5 properties of wine, which influence the quality ranking given by the quality tester as well as how such top 5 aspects influence the quality.

Data Warehouse

The first step of understanding a given data set with the help of Rapid Miner tool is execution of exploratory data analysis. In Rapid Miner, all analyses are done in two phase. First, a process is required to design and then the analyst has to run the process to find the result. The following figure is showing the process designed for exploratory data analysis.

The above mentioned table is showing the nature of each of the aspects that define quality of a wine as well as the quality rank. Taken for example, if the first aspect, fixed acidity is taken into consideration, then it can be said that the data set contains no missing data. Further, this aspect represents the presence of tartaric acid grams/litre of a wine. The above figure shows that in any wine sample, minimum volume of tartaric acid is 3.8 grams and maximum volume of tartaric acid is 14.2 grams. Further, the table is also showing that on an average a sample contains 6.855 grams. Similarly all other variables can be explained the same way through this figure mentioned above. When the above mentioned figure is showing the nature of the data set, to understand whether each of these aspects are important one for quality ranking; the analyst has designed scatter diagram for each of these 12 aspects. All the scatter diagrams are shown as below:

In all the above mentioned scatter plots, quality is measured through x-axis and other variables are measured through y-axis (Jaber et al., 2015). The color coding from blue to red are showing the minimum rank to maximum rank with reference to a particular aspect. Hence, from these scatter plots, it can be said that one of these 12 aspects will be a key factor if the spread of the scatter plot contains more red points as well as the spread is bigger. Hence, alcohol, pH, sulphates, total sulfur dioxide, and volatile acid may be the top 5 key aspects that defines the quality rank of a wine properly.

However, the scatter plot merely gives an indication only. Hence, to identify these 5 key aspects further analysis is needed and thus the analyst has performed correlation analysis. The following figure is showing the process established to perform this correlation analysis.

The following figure is showing the correlation matrix table, which clearly indicates how these 12 key aspects are associated with quality ranking.

It has seen that the correlation value lies in between -1 to 1 (Kitchin 2014). When the correlation value 1 mean the factors are perfectly positively correlated, -1 represents the opposite. Hence, from the above table, it can be easily concluded that alcohol, pH, sulphates, density and chloride are the top 5 key aspects. In these the first three are positively associated with quality ranking and the last two are negatively correlated.

The above section of this study identifies which 5 key aspects needed to be considered in order to judge the quality of a wine. This section of the study is designed to understand how these 5 key aspects are interconnected with quality rank of wine. Here, the researcher has performed linear regression analysis with the help of Rapid Miner data mining tool.

Data Lake

The process designed for this analysis is shown as below:

In the above process, the analyst first incorporates an operator named as “Set Role”. This is used to define the quality variable as label or target variable. In case of regression analysis, often it is termed as dependent variables (Agarwal and Dhar, 2014). The second operators, “Select Attributes” is mainly used to remove the attributes other than the chosen 5 aspects. Finally, the operator linear regression is used to perform the analysis. The outcome table is shown as below:

The first column of the above table is showing the parameter details. Second column is showing the coefficient value of each of the chosen parameter as well as the intercept value. P-value column is showing the significance level of each these 5 chosen parameter.

In case of regression analysis, the regression equation can be defined as:

Quality = intercept + a*alcohol + b*sulpates + c*pH + d*density + e*chloride

The above table is showing that all the p-value is less than 0.05 confidence level. Hence, all these parameters are significant. Now, the coefficient of alcohol is .335. This mean, for one unit increase in alcohol level, the quality of wine will increase by .335 times. Thus, it can be said that except chloride one unit increase in other factors will increase the wine quality level but one unit increase in chloride will decrease the wine quality by 2.238 times.

In this section, the analyst has used Tableau Desktop tool to visualize the given data set, whistler-daily-snowfall.csv. This particular data set prepared on the basis of historical daily snowfall data record at Whistler, BC, Canada over the period July 1 1972 to December 31 2009.

In tableau, prior to design the dashboard as mentioned below, the analyst has designed individual sheet for each of the variables exist in this data sheet:

The dashboard is designed considering the average value of each of the parameter mentioned in this data set. If temperature variables are taken into account, then it can be said that in between 1985 to 1995, the average minimum temperature, average maximum temperature as well as average mean temperature reduced significantly. However, prior to 1985 and post 1995, the average temperature increased.

Similarly, if the total snow volume or snow on ground volume is taken into consideration, then it can be said that the volume of snow is now slowing done from 2002 onward.

Reference

Agarwal, R. and Dhar, V., 2014. Editorial—Big data, data science, and analytics: The opportunity and challenge for IS research.

Bates, D.W., Saria, S., Ohno-Machado, L., Shah, A. and Escobar, G., 2014. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), pp.1123-1131.

Beers, A.C., Eldridge, M.W., Hanrahan, P.M. and Taylor, J.E., Tableau Software, Inc., 2016. Systems and methods for generating models of a dataset for a data visualization. U.S. Patent 9,292,628.

Dewan, S., Aggarwal, Y. and Tanwar, S., 2014. Review on Data Warehouse, Data Mining and OLAP Technology: As Prerequisite aspect of business decision-making activity.

Inmon, W.H. and Linstedt, D., 2014. Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault. Morgan Kaufmann.

Jaber, M.M., Ghani, M.K.A., Suryana, N., Mohammed, M.A. and Abbas, T., 2015. Flexible Data Warehouse Parameters: Toward Building an Integrated Architecture. International Journal of Computer Theory and Engineering, 7(5), p.349.

Khanapi, M., Ghani, A., Mustafa Musa, J. and Suryana, N., 2015. Telemedicine supported by data warehouse architecture. ARPN Journal of Engineering and Applied Sciences, pp.vol-10.

Kimball, R. and Ross, M., 2013. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.

Kitchin, R., 2014. Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), p.2053951714528481.

Michael, K. and Miller, K.W., 2013. Big data: New opportunities and new challenges [guest editors’ introduction]. Computer, 46(6), pp.22-24.

Ryan, J., 2015, September. Communicating research via data visualization. In National Data Integrity Conference-2015. Colorado State University. Libraries.

Symons, D., Konczewski, A., Johnston, L.D., Frensko, B. and Kraemer, K., 2017. Enriching Student Learning with Data Visualization.

Szewra?ski, S., Kazak, J., Sylla, M. and ?wi?der, M., 2017. Spatial Data Analysis with the Use of ArcGIS and Tableau Systems. In The Rise of Big Spatial Data (pp. 337-349). Springer International Publishing.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Critical Literature Review Of Data Warehouses, Data Lakes, And Data Marts ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order