Aim and Objective
Aim and Objective
It aims to show the factors that leads to heat disease. The main objective of this project is to identify the key indicators for the heart disease by creating data visualization with the help of the Tableau and Python. Further, these two tools will be compared with the created data visualization.
Research Questions
The research questions are:
- Does drinking alcohol have more impact of heart disease than smoking?
- Does age factor increase heart disease?
- Does kidney disease, skin cancer, and different walking are associated to heart disease?
- Does stroke factor impact heart disease?
- Does the general health factor impact heart disease?
- Does the diabetic factor impact heart disease?
- What is the age range for heart disease?
- Does gender impact on heart disease?
- How much impact does Smoking, Alcohol Drinking, and Stroke factors lead to increase the heart disease?
- How much impact does diabetic, physical activity and gen health factors lead to increase in heart disease?
- How much impact does Asthma, Kidney disease and skin cancer factors lead to increase in heart disease?
Machine Learning (ML), can be signifies a computational science field, which handles analysis and interpretation of structures and patterns in huge data volumes. Through this it is possible to infer the insightful patterns from the datasets for supporting the decision making of business with or without any human interface. In other words, ML frameworks are the libraries or tools, which let the developers to easily develop the ML applications or ML models, without handling the core algorithm. On the other hand, for the ML development it facilitates the end-to-end pipeline.
ML framework simplifies the ML algorithms, and various ML frameworks are present for different purposes. Perhaps, the highly popular machine learning frameworks are the following:
- PyTorch
- TensorFlow
- Scikit-learn
However, these frameworks can be only selected based on the work intended to be performed. The focus of these two frameworks revolves around mathematics and statistical modeling (i.e., machine learning) against the Neural Network training (i.e., the deep learning).
ML frameworks are classified as follows:
- Due the similarity of PyTorch and TensorFlow, they are the direct competitors. Both of them offer highly rich linear algebra tools, and has the capacity of executing the regression analysis.
- Though the R programmers are mostly familiar to Scikit-learn, it cannot be built for running near a cluster.
- To run on cluster the Spark ML is built.
2020 Annual CDC Survey Data of 400k Adults associated to their Health Status
Topic covered by the dataset
As per Centers of Disease Control and Prevention (CDC), the main cause of death is determined as heart disease in the US. Here, most of the races are identified as American Indians, African Americans, white people, and Alaska Natives. Nearly half of the Americans (i.e., 47%) have minimum 1 of 3 main risk factors for heart disease: high cholesterol, high blood pressure, and smoking. The rest of the key indicators are obesity (high BMI), diabetic status, not having adequate amount of physical activity or a lot of alcohol consumption. In healthcare, identifying and stopping the factors that impose largest affect on heart disease is highly essential. In the computational developments, machine learning methods are allowed for detecting the data patterns, which predicts the condition of the patients (Acharya and Chellappan, 2018).
Source of the Dataset and Carried out Dataset Treatments
In general, the selected dataset is taken from CDC, which is a main component of BRFSS (Behavioral Risk Factor Surveillance System) that performs annual telephone surveys for collecting data associated to the US residents’ health status. This dataset contains various factors which are connected to heart disease, which is the main reason to opt this dataset. Most relevant variables in the dataset are used. This dataset is cleaned for making it easy to use in this project.
Research Questions
Originally, this dataset comprises of approximately 300 variables, which are decreased to 20 variables. Here, the “Heart Disease” variable must be treated as binary i.e., “Yes” – respondent had heart disease and “No” – respondent had no heart disease (Taieb, 2018).
Tableau Data Visualization
In this part, visualize the heart disease dataset with Tableau tool. The Tableau implementation is demonstrated below (Babcock and Kumar, 2017):
- Firstly, visualize the Alcohol drinking and smoking factors for heart disease. See the figure 1, it shows that alcohol drinking does not increase heart disease when compared with the smoking factor. Therefore, smoking among people highly affects heart disease.
- Secondly, visualize the age based on heart disease. See the figure 2, it shows whether the age factor might increase heart disease or not. Thus, the age of above 60 can increase heart disease when compared to the age below 60 years.
- Thirdly, visualize the kidneydisease, skin cancer, different walking factors for heart disease. See the figure 3, it shows that kidney disease, skin cancer, and different walking are associated to heart disease. Skin cancer can increase heart disease when compared to different walking and kidney disease. Therefore, it is determined that different walking and kidney disease does not affect heart disease (Chopra, England and Alaudeen, 2019).
- Fourthly, visualize stroke factor for heart disease. See the figure 3, it shows that stroke does not increase heart disease. Therefore, people are affected by stroke, because the number of patients with stroke who are highly affected by heart disease is 307,726 (Stirrup, Nandeshwar, Ohmann and Floyd, 2016).
- Fifthly, visualize the general health factor based on heart disease. See the figure 3, it shows gen health, where very good gen health is highly affected by heart disease when compared to excellent gen health, fair gen health, good gen health and the poor gen health that are determined. A very good gen health count for heart disease is observed as – 113,858.
- Sixthly, visualize the diabetic factors based on heart disease. See the figure 3, it shows that no diabetic people are highly affected by heart disease, because the count of “No” is 269,653 (Crickard, 2020).
- At last, it concludes that smoking and stroke factors lead to increase the heart disease when compared to the other factors.
Python Data Visualization
Moving on, visualize the heart disease dataset with python. The python implementation is demonstrated below:
- Firstly, visualize heart disease for different range of ages. See the Figure 8, highly the age category of above 80 or older age of people have heart disease.
- Secondly, visualize heart disease for different genders. See the figure 9, the male genders have high number of heart disease when compared to female genders (Kirk, Timms, Rininsland and Teller, 2016).
- Thirdly, visualize how much of Smoking, Alcohol Drinking, and Stroke factors lead to increase the heart disease. See the figure 10, in smoking factor, 58.59% of people have heart disease and 41.41% of people do not have heart disease. In Alcohol Drinking category, 4.17% of people have heart disease and 95.93% of people don’t have heart disease. In Stroke category, 16.03% of people have heart disease and 83.97% of people do not have heart disease. Therefore, smoking and stroke factors lead to increase in heart diseases (Miller, 2017) (Indrakumari, Poongodi and Ranjan Jena, 2020).
- Fourthly, while visualizing how much diabetic, physical activity and gen health factors lead to increase in heart disease (See figure 11), it is observed that in diabetic, 33.12% of people have heart disease and 66.88% of people don’t have heart disease. In physical activity, 63.89% of people have heart disease and 36.11% of people do not have heart disease. In general health, people with good health have 34.92% of heart disease, then people with very good health have 19.66% of heart disease, and people with poor health have 14.06% of heart disease. Therefore, Physical activity and diabetic factors lead to increase in heart disease (Milovanovic?, 2021).
- Fifthly, when visualizing how much Asthma, Kidney disease and skin cancer factors lead to increase heart disease (see figure 12), it is observed that in Asthma category, 18.02% of people have heart disease and 81.98% of people don’t have heart disease. In kidney disease category, 12.62% of people have heart disease and 87.38% of people do not have heart disease. In skin cancer category, 18.19% of people have heart disease, and 81.81% of people do not have heart disease (Xu, Xu and Yang, 2017). Therefore, these three factors are having a smaller number of people with heart disease. So, these factors are not increasing the heart disease (Stirrup, 2019) (Indrakumari, Shukla and Sehgal, 2021).
Machine Learning Analytical Model
Now, create an analytical model to predict heart disease with the help of the Python. The selected machine learning analytical models are Naïve Bayes and Stochastic Gradient Descent (SGD) model. Once model predicts the heart disease, later compare both the algorithms with the model performances and then evaluate which model is the best model for predicting heart diseases. To do this follow the below steps:
- Before creating a model, prepare the data for prediction.
- Next, ensure data preprocessing is done for a model prediction i.e., splitting the data into train data and test data. Also, do column transformation for this prediction.
- Afterwards, build a model, firstly import the required packages and functions.
- Then, try to resample the data for increasing the number of positive examples.
Thus, machine learning model can now be built for predicting the heart disease.
Naïve Bayes Model
Here, create a Naive Bayes model, which is a basic model. However, in machine learning it is an effective probabilistic classification model. This model is influenced from the Bayes Theorem. The following are the classified categories of Naive Bayes Classifiers:
- Gaussian Naive Bayes
- Bernoulli Naive Bayes
- Multinomial Naive Bayes
In our case, build a Gaussian Naïve Bayes model for predicting the heart disease as shown in Appendix. This classifier is used if there are continuous predictor values and when it expects to follow the Gaussian distribution. As per the result, this classifier predicts the heart disease with 72% of accuracy and f1 score is 0.32, which gives good prediction for this analysis.
Stochastic Gradient Descent (SGD) Model
A Stochastic Gradient Descent (SGD) model is built, which is also a simple and highly efficient approach for fitting the linear classifiers and regressors under convex loss functions like (linear) Support Vector Machines and Logistic Regression. Recently, SGD is seen to receive a lot of attention with respect to a large-scale learning. As it is successfully implemented to large-scale and sparse machine learning, the issues mostly were seen in the natural language processing and text classification.
In our case, heart disease is predicted with the help of SGD classifier as shown in Appendix. As per the result, this classifier predicts the heart disease with 86% of accuracy and f1 score is 0.39, which is an effective result for this prediction.
At last, both the created models are compared, where the SGD classifier has a high accuracy and f1 score when compared to the Naïve Bayes model. Therefore, the best model for predicting the heart disease is SGD model, because it gives effective result than the Naïve Bayes model.
Analytical Framework
The key difference between Python and Tableau is that Python contains code written by hand, whereas Tableau contains no-code environment. When compared to Python, Tableau provides higher flexibility due to its interactive data visualization. However, both these environments are different in terms of their learning, use, handling data, data integration, and mobility and so on, which are described in the following paragraphs.
- Usage
Python can be utilized for writing software programs, which resolve computer issues.
Tableau, supports in interpreting the information and develops effective and meaningful business insights.
- Data Handling
Python can easily handle data streaming, where the required package can be parsed and various data types can be loaded for operating the required packages or libraries.
Tableau can quickly consume multiple file types. Also, it can connect to various database types due to their pre-built connections. It even has the capacity of loading the data of different data types and produce the visualizations.
- Visualization
Python can even be utilized for data analytics and data visualization. Data visualizations using Python is possible with the help of open libraries like SeaBorn, MatPlotLib, and so on.
Tableau used BI (Business Intelligence) for producing interactive data visualization product that are user-friendly.
- Integrations
Python is an open source, so it can be used freely and distributed for commercial purpose.
Tableau allows framework integration with the common databases for importing the data.
- Easy to Learn
Python is one of the easiest languages in the world as it uses English language.
However, to use Tableau very less programming skills are required.
- Mobility
Python can be utilized on AIX, Windows, iOS, IBM I, OS/390, MS, Solaris, HP-UX, z/OS, and Linux.
Tableau platform is present on all types of devices such as, smartphones, tablets, laptops, and so on. Additionally, it can even be accessed via internet.
- Analysis
Data analysis with the help of Python has amazing results, which stresses on the importance and necessity of data transformation and data cleaning.
Even Tableau tool is an excellent tool for data analysis, however it is not that efficient to perform complicated and intricate processes. When data transformation and data cleaning is considered using the Tableau tool, it is observed that it contains limited scope.
Conclusions and Recommendations
It is determined that older people i.e., above 80 years have heart disease. Most the males are highly prone to hear disease. If an individual who had a stroke and if such a person smokes, then it can lead to heart disease. Further, even the Physical activity and diabetic factors are seen to increase heart disease. Finally, it is observed that Asthma, Kidney disease and skin cancer factors did not increase heart disease.
It is recommended to avoid smoking and take measures to control diabetics. Early detection of heart disease is necessary to save the death rate of the patients (Poongodi, Indrakumari, Janarthanan and Suresh, 2021). Regular physical activity is required to decrease the heart disease risks (Preventing Heart Disease, 2022).
In this report, by using SGD classifier heart disease is predicted, which shows 86% of accuracy and f1 score is 0.39. Thus, makes it an effective prediction result. From the model comparison, it is determined that SGD model is the best model to predict the heart disease.
References
Acharya, S. and Chellappan, S., 2018. Pro Tableau.
Babcock, J. and Kumar, A., 2017. Python. Birmingham: Packt Publishing.
Chopra, R., England, A. and Alaudeen, M., 2019. Data Science with Python. Birmingham: Packt Publishing, Limited.
Crickard, P., 2020. Data Engineering with Python. [S.l.]: Packt Publishing.
Indrakumari, R., Poongodi, T. and Ranjan Jena, S., 2020. Heart Disease Prediction using Exploratory Data Analysis. nternational Conference on Smart Sustainable Intelligent Computing and Applications under ICITETM2020, (173), pp.130–139.
Indrakumari, R., Shukla, P. and Sehgal, A., 2021. Heart Disease Prediction Using Tableau. 1st ed. CRC Press, p.16.
Kirk, A., Timms, S., Rininsland, Æ. and Teller, S., 2016. Data Visualization. Birmingham: Packt Publishing.
Miller, J., 2017. Big Data Visualization. Birmingham: Packt Publishing.
Milovanovic?, I., 2021. Python data visualization cookbook.
Poongodi, T., Indrakumari, R., Janarthanan, S. and Suresh, P., 2021. A Systematic Framework for Heart Disease Prediction Using Big Data Analytics. Springer International Publishing,.
Stirrup, J., 2019. Tableau Dashboard Cookbook.
Stirrup, J., Nandeshwar, A., Ohmann, A. and Floyd, M., 2016. Tableau. Birmingham, UK: Packt Publishing.
Taieb, D., 2018. Data Analysis with Python. Birmingham: Packt Publishing Ltd.
The Nutrition Source. 2022. Preventing Heart Disease. [online] Available at: <https://www.hsph.harvard.edu/nutritionsource/disease-prevention/cardiovascular-disease/preventing-cvd/> [Accessed 24 March 2022].
Xu, M., Xu, J. and Yang, X., 2017. Asthma and risk of cardiovascular disease or all-cause mortality: a meta-analysis. Ann Saudi Med, 37(2), pp.99–105.