What is data analysis?
Slide 1:
One of the most important question involved in this presentation is “what is data analysis?”. Data analysis is crucial in any business firm and it involves the process of examining a given data set to enable one to draw the most appropriate conclusion about the key question that they wanted to answer. It is most of the time is a continuous process that involves the collection and analysis of data that is still under scrutiny, that is because research normally tries to identify the patterns that are present in the entire data that has been collected.
Tools like MS Excel, Python, RStudio etc. are used in any type of data analysis project.
In the current project Python has been selected as the data analysis tool. Python has become a very important tool that researchers prefer when they want to conduct analysis of any given data set, and this is because the software is flexible and the language is easy for analysts to understand.
An ARIMA model will be used in order to forecast a given set of data.
Slide 2:
ARIMA model normally entail three terms that is p, d and q. These terms have meaning and are very essential for this model, for instance;
P – Stands for the order of the Autoregressive (AR) term
d – Stands for the total number of differentiations that is needed to make the time series analysis stationary, keeping in mind that for a stationery time series d is always equal to zero (d = 0).
q – Stands for the Moving Average (MA) term. It entails the total number of lagged errors that should be in the ARIMA model.
Slide 3:
1) Investigation of the concept of time series and forecasting. The project will try and bring forth how the aspect time series and forecasting are intertwined. Forecasting puts into use the previous information of a certain matter under study to predict the future outcome or how the situation will be in future, and time series forecasting uses aspects such as historical trends, cyclical analysis, and the idea of seasonality. That is why the concepts will be used in the project to determine the future outcome in business using the data from Light IT consultancy firm. This is to determine whether or not the implementation of ARIMA model will be effective in future in the business sector.
2) To perform the required gathering of the system development. This project will ensure that it has adequate system development that will be necessary for the effective development of the ARIMA model using Python 3 software. This will also show how critical it is to gather the requirement that is necessary for system development. This is because for the model to work a system has to be developed; this system will ensure that the time series is stationary and show if the project team understands what is needed from the project because it is from this phase that the most important requirement of the project can be seen or discovered. Therefore, this project is aiming at doing the requirement gathering properly to avoid issues that will cause the failure or delay of the project.
Understanding ARIMA Models
3) To design and model the proposed ARIMA system focusing on business customer data from Light IT consultancy firm. This project aims at implementing the use of ARIMA model in business sector, and therefore, a good system will ensure that the model works effectively. The system will ensure that the data analysis is done with precision and the results are interpreted appropriately, to allow entrepreneurs to make proper decision for their business. The data from the consultancy firm will be analyzed using the Python analysis tool and use the ARIMA model that entails time series analysis and forecasting.
4) Coding the system using Python programing language. Python programming language allows one to work faster and integrate the system in the most effective manner. That is why this project aims at using the python coding system so as to be able to finish the project analysis faster and obtain the most effective results. This is because python allow analysts or programmers to do clear and logical coding.
5) To test and validate the ARIMA system code against the requirements using business customer data from Light IT firm. This is the part where the project will analyze the data and implement the use of the suggested model in the analysis so that clear interpretations can be made. This project wants to determine if the model will be effective in the data analysis process and prove that it can be used in the business sector during data analysis and predict the future outcome of a business using the forecasting technique of time series analysis.
Slide 4:
The challenges involved in the project are:
- Hardware crush. Hard wares such as computers sometimes can crush dues to internal drive malfunctioning, this can lead to loss of the data or important information that is to be used in the project, other hardware materials include external storage devices that might get lost or be infected with malicious.
- Power shortage. The data analysis soft wares and machines depend on electricity for functionality, and whenever there is power shortage, these processes cannot continue. That is why a constant supply of power is very necessary for this project.
- Data loss. This is a problem that sometimes arises due to poor storage of the data set, the data set might be lost because it might have been stored in a device that was infected by viruses. In other cases, the data may be lost because the storage device was stolen. For this project no data has been lost yet and to avoid this from happening several backup plans have been put in place.
- Poor management of change. It is obvious that change is always inevitable when doing a project, and when team members involved in the project do not quickly accept change then that might be a challenge because there will be no generation of ideas that might be helpful to the project. At some point of this project this problem has been experienced, the difference in ideology resulted to chaos although it was resolved it might happen again in future and to avoid that proper communication has been enhanced among team members.
Slide 5:
This is the general equation for an ARIMA(p,d,q) model. Where;
βo is a constant.
β1 …. βp are the auto-regression parameters to be estimated.
Xt is the observed value of time series at time t (month t= 1, 2, 3…).
θ1… θp are the moving average parameters to be estimated.
Xt-1, et-1, et-1, et-2…et-p are the observed value of time series at t-1 up to t-p.
εt is the error term with mean as 0 and variance as 1.
Slide 6:
Before proceeding with an ARIMA model, it is necessary to check the stationarity of the data involved in the study. There are certain tests that help in checking the stationarity of the data like Augmented dickey Fuller test and Phillips Perron unit test. The null hypothesis for both the tests are same and states that the data is not stationary i.e. an unit root is present in the data. The null hypothesis gets rejected when the p-value is obtained to be less than 0.05.
Slide 7:
From the graph it can be said that it is not stationary because the mean, variance, and covariance is not constant over the given period of time. The different means and variances can be seen on the distribution of the peaks and troughs that are on the graph, they are not the same and this is because the data has trend meaning that there is increase in sales with every different times. Due to this aspect the time series analysis cannot be performed.
Project Objectives
Slide 8:
From the ACF results it can be seen that the lags immediately decay from zero to a negative which is a sign for a stationary data, with this we see that there is one lag before proceeding to the next lag, also from the new graph for the stationary data. We can see that there is no trend meaning that the mean and the variances are equal to show that the data is now stationary. The data is now perfect to conduct the analysis that we wanted to conduct to help get the best ARIMA model for this series.
Slide 9:
- The best ARIMA model has been found to be ARIMA model of order (9, 2, 0). This is the best model that anyone can implement into the Python 3 in the business sector. The model can also be used by the Light IT firm to carry out any other analysis that it would wish to do. Having this model does not mean that it is the best, other models can also be used depends entirely on a person’s preference.
- From the research, it was realized that the use of ARIMA model in Python 3 is the most effective form of data analysis, this was achieved when the data from Light IT consultancy firm was tested using the model and the results were pleasing because the forecast of the data was determined in the easiest way possible, because from the analysis we were able to know that the data was not stationary therefore, the process of differencing had to be done in order to make the data a stationary one. This brought the realization that all firms in the business sector should use the same model for their data analysis and keep their businesses on track.
- With the advancement in technology everybody wants to develop a coding language that is best understood by them, software these days work with codes and thus avoiding long structural sentence that take plenty of time to write. This research brought us to the realization that coding languages like the ones that are used in the Python software might look difficult to handle but they are the best to use, this is to imply that any company should consider having a expert who is good at coding to facilitate fast manipulation of the company’s data.
- From the extensive discussion that we had conducted in this study, it was realized that the ARIMA model is actually the best time series model that many people always sue when they want to conduct time series research, the others are good yes but just not as good as the ARIMA model. The model is the best because it uses the lagged moving average to smoothen the time series under study, it also works on the assumption that the future depends on past incidences, it is broadly used for the statistical and technical analysis to get a forecast of the data.
Slide 10:
A few recommendations can be made based on the results:
- Companies should venture into the use of software to boost their businesses; this is because the world is advancing and the analogue method of data storage and analysis is almost forgotten. A good software takes a lot of details in a firm that cannot be done manually and if done manually could take longer than expected. Tools such as python not only analyzes the job but also keep the results in store for future referencing.
- The business sector should use coding language to facilitate faster manipulation of data, since computers use artificial intelligence coding will be an easier way to tell or command the computer of what you would like to be helped with. This makes this machine language a very important aspect for all businesses.
- It is wise if all companies store all the record of their sales for future reference, for example a time might come when a company wants to know how it has been generally fairing on since it was started. It is the data that has always been stored that will enhance this, because without data the company cannot know if it is doing well or not, and where it need to make adjustments.
Slide 1:
Regression analysis is the most widely used technique for fitting models to data. When a regression model is fit using ordinary least squares, we get a few statistics to describe a large set of data. These statistics can be highly influenced by a small set of data that is different from the bulk of the data. These points could be y-type outliers (vertical outliers) that do not follow the general model of the data or x-type outliers (leverage points) that are systematically different from the rest of the explanatory data. We can also have points that are both leverage points and vertical outliers, which are sometimes referred to as bad leverage points. Collectively, we call any points of these kinds’ outliers.
There are a few techniques to anticipate the boundaries in regression; one of the strategies is Ordinary Least Square (OLS). Assessing boundaries with OLS should satisfy some endorsed suppositions, with mistakes commonly free and ordinary with a center worth 0 and fluctuation 2.
Autocorrelation: It measures how the lagged version of the value of a variable is related to the original version of it, when a time series data is considered.
Heteroscedasticity: It refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroscedasticity results in an unequal scatter of the residuals
Multicollinearity: is a statistical concept where several independent variables in a model are correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences
Normality: it means that the data under study must follow a normal distribution
Slide 2
The objective deliverable in consideration are:
To define data analytics in the business and data usage perspectives: Defining what exactly is data analytics gives a broader perspective of what the topic of discussion is all about, in this case, data analytics is described as the analysis of data to make a substantive conclusion on a given topic of study (Mehta, and Pandit, 2018). This will be achieved through research and reading of materials to get a precise understanding of what exactly is data analytics in the data processing and business decision-making process
To justify the need for data analytics in the business decision-making process: This objective is tailored to identify the key gaps in conventional data processing thus presenting analytics as a solution through comparing the traditional data processing and their efficiency levels to the data analytics and its respective efficiency in the business decision-making process
Challenges Involved
To identify the regression data analytics techniques and algorithms: The identification of the diverse techniques used in regression analytics is key in tailoring the research into achieving the overall goal of the research as specified in the topic of study
To demonstrate the benefits of regression algorithms in predictive analytics: By attaining the demonstration of regression analytics and its respective analytics, the benefits drawn from it shall be identified as the specifics to be achieved in this objective. This is through a hands-on demonstration of one or two regression algorithms used in data analytics.
Slide 3
Regression analysis is one of the significant and usually utilized factual devices for examining the connection between a reliant and at least one autonomous factor, with wide applications in the field of money, economic aspects, medication, and brain research. A regression technique is for the most part characterized as
In which Y depicted as the dependent variable as well as ε derived as the true residual vector as well as X depicting the design matrix finalizes to become the n × p. Derive β based as the estimator for the β as well as the below (2) depicted to represent the next latter fitted residuals.
The regression investigation ordinarily utilizes the least-squares strategy for assessment of model boundaries under certain presumptions to be fulfilled, like the ordinariness of mistakes with zero mean and steady change, i.e., ε ∼ N (0, δ2 ).
Slide 4
Outliers being conflicting perceptions and generally veered off from most of the perceptions in information need legitimate taking care of as they present a genuine danger to the regression model and its assessed coefficients and, thus, give deceiving results (Werner, 2019). Two kinds of outliers can occur in the regression dataset. One with extremely enormous qualities in the reaction is alluded to as upward outliers, while perceptions with extremely huge qualities in the explanatory variable are called influence focuses.
Robust regression is an innovative process for overcoming the problem of outliers and strong perceptions in data and limiting their influence on the regression results. The vast majority of so-called robust regression techniques lack this characteristic. The basic goal of robust assessment is to provide reliable evaluations/derivations for oblique borders while keeping outliers at bay. The robust system substitutes some other capacity for the OLS’s number of squared residuals, which is usually less impacted by unusual perceptions. These methods first fit the data to regression and then identify outliers as perceptions with large residuals. Effectiveness, breakdown point, and limited impact are three desirable features of robust tactics. The breakdown point is only a small fraction of the unexpected impressions that an assessor might face before making an incorrect decision.
Least Trimmed squares (LTS) is an exceptionally robust and sensibly useful assessor among each of the robust assessors accessible in the writing, and it is acquired by confining the managed amount of the squared residuals. The LTS assessor is a modified version of the LS assessor that focuses on the more important features while ignoring the extreme impressions in the organized data.
Slide 5
This is the result of the Residual normality test that was carried out using the Kolmogorov Smirnov test. The p-values are very low. Therefore null hypothesis is rejected so it tends to be inferred that the residuals of old style linear regression models are not normally distributed. Residuals that are not normally distributed can be brought about by an outlier in the information.
Slide 6
Outlier detection is done using TRES detection and this result was obtained. In light of the results from the table, it is realized that all malnutrition information know 2012-2017 has outliers
Slide 7
These figures demonstrate the standardized residuals vs fitted quality and robust distances for India, respectively. A thorough investigation reveals that both population growth and foreign direct investment inflows have a significant role in Pakistan’s monetary development. Nonetheless, FDI inflows are inextricably linked to population growth, and monetary growth is inextricably linked to population growth, even if the influence of gross investment funds is minor.
Slide 8
The current study examined the impact of FDI inflow, yearly basis population growth, as well as gross investment companies on Pakistan’s and India’s GDP per capita using least squares (LS) and high breakdown robust least tried squares (LTS) regression techniques. FDI has a small but positive impact on both Pakistan’s and India’s monetary growth within the LS structure; however, once the LTS approach is implemented, FDI becomes a decisive and vital part of Pakistan’s financial development model. Nonetheless, due to the general elimination of 5 and 2 exclusions from the information of Pakistan and India, respectively, FDI has a negligible impact on the monetary economy of India. Populace development adds to GDP per capita for the two economies indistinguishably. The two methods uncover that quick populace development adversely impacts the monetary development of the two nations and henceforth is a significant issue for the financial development of the two economies, and it requires prompt consideration.