Hypothesis
Cancer is among the health conditions that have emerged in the last two to three decades ago. It has been established that there is no definite cure for this condition. This has led to the development of more scientific research aimed at how medication or control procedures to this conditions can be developed. In this paper, breast cancer data obtained from kaggle.com is explored, inferential and regression statistical techniques employed to determine whether the condition is benign or malignant (Kaggle.com, 2016).
- H0: There is no difference in average texture between benign and malignant breast cancer
HA: Malignant texture mean is greater than benign breast cancer
- H0: There is no difference in average compactness between benign and malignant breast cancer
HA: Malignant breast cancer compactness average is higher than benign.
The descriptive statistics are provided in forms of tables and plots. The plots have been developed using the ggplot2 package in r commander. The information provided is for the training dataset, which was separated from the main dataset to allow availability testing dataset to be used in model prediction and diagnostics.
Table 1: Descriptive statistics
## vars n mean sd min max range |
Table 1 above shows means, standard deviation, minimum and maximum values of the continuous variables in the breast cancer [train] dataset. All these variables can be used as predictors to either breast cancer case is benign or malignant.
Table 2: Descriptive Statistics by diagnosis
## Descriptive statistics by group |
Table 2 above shows the comparative descriptive statistics by diagnosis. Observations on the means help in determining the variables that have greater differences, indicating potential predictors of a logistic regression predicting the probability of a breast cancer case being benign or malignant. For instance, there is a great difference between the area worst variable between those diagnosed as having benign breast cancer and those determined as malignant.
Table 3: Diagnosis by proportion
Diagnosis |
Benign |
Malignant |
Proportion |
56.75% |
43.25% |
According to the train data distribution, 56.75% of the breast cancer cases diagnosed were determined as benign while 43.25% were malignant.
Based on the mean differences observed in the descriptive statistics, variables with greatest differences are graphically represented in this section.
Figure 1: Histogram of Radius, Texture, Perimeter, Area, Smoothness and Compactness means faceted by Diagnosis
The distributions of the variables displayed above seem to vary between those diagnosed with benign and those determined to have malignant breast cancer. For instance, perimeter mean data for the benign cases seem to be closely distributed compared to those with malignant. Generally were can state that cases of malignant breast cancer have higher variance on the radius, perimeter, area and compactness mean compared to benign.
Descriptive Statistics
Figure 2: Histograms of Concavity mean, Concave Points mean, Radius Se, Perimeter Se, Area Se and Radius worst, Faceted by Diagnosis
Means of concavity & concave points and radius worst seem to have a greater difference in average values between the benign and malignant groups. The other variables (Radius, perimeter and area standard errors) are approximately similar data distributions although the variances vary.
Figure 3: Histogram of worst values of Perimeter, Area, Compactness and Concave Points, Faceted by Diagnosis
The average and distribution of worst values of the perimeter, concave points, compactness and area of breast cancer are different for benign and malignant groups.
Malignant texture mean is greater than benign breast cancer
Table 4: First hypothesis output
## Welch Two Sample t-test |
The p-value is less than the significance level, hence concluding that malignant has greater texture mean than benign breast cancer.
Table 5: Second hypothesis output
## Welch Two Sample t-test |
The p-value is less than 0.05, hence conclude that average compactness of breast cancer is higher for malignant compared to benign (Glover and Mitchell, 2008).
The variables used in the final model include; radius mean, concave points worst and area worst. The three independent variables are very significant with p-values less than 0.05 (Prabhakaran, 2016). The model is as shown in the table below;
Table 6: Logistics regression model output
model5 <- glm(diagnosis ~ radius_mean+concave.points_worst+area_worst, family = binomial(link = ‘logit’)) |
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred |
summary(model5) |
## |
The developed regression model developed from the above analysis is as shown below (Menard, 2008).
Figure 4: ROC curve
The ROC curve in figure 4 shows that the logistic model used in predicting malignant breast cancer has high power. The area under the curve is large, hence improved true positive rate (Menard, 2008).
Conclusion
We conclude that the compactness and texture of malignant breast cancer have higher average values compared to benign. This indicates that the process of breast cancer diagnosis can be guided by these observations. The final logistic regression fit included radius mean, worst concave points and worst area observed in the prediction of malignant breast cancer. The logistic regression equation below was determined to be the best fit for malignant breast cancer prediction.
More exploration of the data can be conducted to better the model. This may include developing dummy variables to improve the power of the new models. The data exploration will be directed in understanding the distribution of the predictor variables towards prediction of malignant breast cancer.
References
Glover, T. and Mitchell, K. (2008). An introduction to biostatistics. Long Grove, Ill.: Waveland Press.
Kaggle.com (2016). Breast Cancer Wisconsin (Diagnostic) Data Set | Kaggle. [online] Kaggle.com. Available at: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data [Accessed 18 Oct. 2017].
Menard, S. (2008). Applied logistic regression analysis. Thousand Oaks, Calif. [u.a.]: Sage.
Prabhakaran, S. (2016). Logistic Regression With R. [online] R-statistics.co. Available at: https://r-statistics.co/Logistic-Regression-With-R.html [Accessed 18 Oct. 2017].