Breast Cancer Data Analysis Report

Hypothesis

Cancer is among the health conditions that have emerged in the last two to three decades ago. It has been established that there is no definite cure for this condition. This has led to the development of more scientific research aimed at how medication or control procedures to this conditions can be developed. In this paper, breast cancer data obtained from kaggle.com is explored, inferential and regression statistical techniques employed to determine whether the condition is benign or malignant (Kaggle.com, 2016).

H₀: There is no difference in average texture between benign and malignant breast cancer

H_A: Malignant texture mean is greater than benign breast cancer

H₀: There is no difference in average compactness between benign and malignant breast cancer

H_A: Malignant breast cancer compactness average is higher than benign.

The descriptive statistics are provided in forms of tables and plots. The plots have been developed using the ggplot2 package in r commander. The information provided is for the training dataset, which was separated from the main dataset to allow availability testing dataset to be used in model prediction and diagnostics.

Table 1: Descriptive statistics

##                         vars   n   mean     sd    min   max   range
## radius_mean                1 400 14.32   3.58   6.98   28.11   21.13
## texture_mean               2 400 18.95   4.12   9.71   39.28   29.57
## perimeter_mean             3 400 93.32 24.65 43.79 188.50 144.71
## area_mean                  4 400 673.22 357.28 143.50 2499.00 2355.50
## smoothness_mean            5 400   0.10   0.01   0.06    0.14    0.08
## compactness_mean           6 400   0.11   0.05   0.02    0.35    0.33
## concavity_mean             7 400   0.09   0.08   0.00    0.43    0.43
## concave.points_mean        8 400   0.05   0.04   0.00    0.20    0.20
## symmetry_mean              9 400   0.18   0.03   0.12    0.30    0.19
## fractal_dimension_mean    10 400   0.06   0.01   0.05    0.10    0.05
## radius_se                 11 400   0.42   0.28   0.11    2.87    2.76
## texture_se                12 400   1.20   0.54   0.36    4.88    4.52
## perimeter_se              13 400   2.97   2.04   0.76   21.98   21.22
## area_se                   14 400 42.28 43.62   7.23 525.60 518.37
## smoothness_se             15 400   0.01   0.00   0.00    0.03    0.03
## compactness_se            16 400   0.03   0.02   0.00    0.14    0.13
## concavity_se              17 400   0.03   0.03   0.00    0.40    0.40
## concave.points_se         18 400   0.01   0.01   0.00    0.05    0.05
## symmetry_se               19 400   0.02   0.01   0.01    0.08    0.07
## fractal_dimension_se      20 400   0.00   0.00   0.00    0.03    0.03
## radius_worst              21 400 16.60   4.96   7.93   33.13   25.20
## texture_worst             22 400 25.33   6.12 12.02   49.54   37.52
## perimeter_worst           23 400 109.43 34.40 50.41 229.30 178.89
## area_worst                24 400 917.18 583.20 185.20 3432.00 3246.80
## smoothness_worst          25 400   0.13   0.02   0.07    0.22    0.15
## compactness_worst         26 400   0.26   0.17   0.03    1.06    1.03
## concavity_worst           27 400   0.28   0.21   0.00    1.25    1.25
## concave.points_worst      28 400   0.12   0.07   0.00    0.29    0.29
## symmetry_worst            29 400   0.30   0.07   0.16    0.66    0.51
## fractal_dimension_worst   30 400   0.08   0.02   0.06    0.21    0.15

Table 1 above shows means, standard deviation, minimum and maximum values of the continuous variables in the breast cancer [train] dataset. All these variables can be used as predictors to either breast cancer case is benign or malignant.

Table 2: Descriptive Statistics by diagnosis

## Descriptive statistics by group
## group: Benign
##                         vars   n   mean     sd    min     max range    se
## radius_mean                1 227 12.07   1.73   6.98   16.84   9.86 0.11
## texture_mean               2 227 17.12   3.35   9.71   33.81 24.10 0.22
## perimeter_mean             3 227 77.54 11.40 43.79 108.40 64.61 0.76
## area_mean                  4 227 456.77 128.52 143.50 880.20 736.70 8.53
## smoothness_mean            5 227   0.09   0.01   0.06    0.13   0.07 0.00
## compactness_mean           6 227   0.08   0.03   0.02    0.22   0.20 0.00
## concavity_mean             7 227   0.05   0.05   0.00    0.41   0.41 0.00
## concave.points_mean        8 227   0.03   0.02   0.00    0.09   0.09 0.00
## symmetry_mean              9 227   0.18   0.03   0.12    0.27   0.16 0.00
## fractal_dimension_mean    10 227   0.06   0.01   0.05    0.09   0.04 0.00
## radius_se                 11 227   0.29   0.12   0.11    0.88   0.77 0.01
## texture_se                12 227   1.19   0.57   0.36    4.88   4.52 0.04
## perimeter_se              13 227   1.99   0.76   0.76    5.12   4.36 0.05
## area_se                   14 227 21.02   9.05   7.23   77.11 69.88 0.60
## smoothness_se             15 227   0.01   0.00   0.00    0.02   0.02 0.00
## compactness_se            16 227   0.02   0.02   0.00    0.11   0.10 0.00
## concavity_se              17 227   0.03   0.04   0.00    0.40   0.40 0.00
## concave.points_se         18 227   0.01   0.01   0.00    0.05   0.05 0.00
## symmetry_se               19 227   0.02   0.01   0.01    0.06   0.05 0.00
## fractal_dimension_se      20 227   0.00   0.00   0.00    0.03   0.03 0.00
## radius_worst              21 227 13.26   1.92   7.93   18.22 10.29 0.13
## texture_worst             22 227 22.41   4.81 12.02   41.78 29.76 0.32
## perimeter_worst           23 227 86.08 13.08 50.41 120.30 69.89 0.87
## area_worst                24 227 548.45 155.98 185.20 1032.00 846.80 10.35
## smoothness_worst          25 227   0.12   0.02   0.07    0.17   0.10 0.00
## compactness_worst         26 227   0.18   0.09   0.03    0.58   0.56 0.01
## concavity_worst           27 227   0.16   0.15   0.00    1.25   1.25 0.01
## concave.points_worst      28 227   0.07   0.04   0.00    0.18   0.18 0.00
## symmetry_worst            29 227   0.27   0.04   0.17    0.42   0.26 0.00
## fractal_dimension_worst   30 227   0.08   0.01   0.06    0.15   0.09 0.00
## ——————————————————–
## group: Malignant
##                         vars   n    mean     sd    min     max   range se
## radius_mean                1 173   17.27   3.22 10.95   28.11   17.16 0.24
## texture_mean               2 173   21.36   3.80 10.38   39.28   28.90 0.29
## perimeter_mean             3 173 114.03 21.88 71.90 188.50 116.60 1.66
## area_mean                  4 173 957.23 362.54 361.60 2499.00 2137.40 27.56
## smoothness_mean            5 173    0.10   0.01   0.07    0.14    0.07 0.00
## compactness_mean           6 173    0.14   0.05   0.05    0.35    0.29 0.00
## concavity_mean             7 173    0.16   0.07   0.02    0.43    0.40 0.01
## concave.points_mean        8 173    0.09   0.03   0.02    0.20    0.18 0.00
## symmetry_mean              9 173    0.19   0.03   0.13    0.30    0.17 0.00
## fractal_dimension_mean    10 173    0.06   0.01   0.05    0.10    0.05 0.00
## radius_se                 11 173    0.60   0.32   0.19    2.87    2.68 0.02
## texture_se                12 173    1.21   0.50   0.36    3.57    3.21 0.04
## perimeter_se              13 173    4.26   2.45   1.33   21.98   20.65 0.19
## area_se                   14 173   70.17 54.10 13.99 525.60 511.61 4.11
## smoothness_se             15 173    0.01   0.00   0.00    0.03    0.03 0.00
## compactness_se            16 173    0.03   0.02   0.01    0.14    0.13 0.00
## concavity_se              17 173    0.04   0.02   0.01    0.14    0.13 0.00
## concave.points_se         18 173    0.01   0.01   0.01    0.04    0.04 0.00
## symmetry_se               19 173    0.02   0.01   0.01    0.08    0.07 0.00
## fractal_dimension_se      20 173    0.00   0.00   0.00    0.01    0.01 0.00
## radius_worst              21 173   20.98   4.28 12.84   33.13   20.29 0.33
## texture_worst             22 173   29.15   5.52 16.67   49.54   32.87 0.42
## perimeter_worst           23 173 140.09 29.24 85.10 229.30 144.20 2.22
## area_worst                24 173 1401.00 584.94 508.10 3432.00 2923.90 44.47
## smoothness_worst          25 173    0.15   0.02   0.09    0.22    0.13 0.00
## compactness_worst         26 173    0.38   0.17   0.05    1.06    1.01 0.01
## concavity_worst           27 173    0.44   0.17   0.02    1.10    1.08 0.01
## concave.points_worst      28 173    0.18   0.05   0.03    0.29    0.26 0.00
## symmetry_worst            29 173    0.33   0.08   0.16    0.66    0.51 0.01
## fractal_dimension_worst   30 173    0.09   0.02   0.06    0.21    0.15 0.00

Table 2 above shows the comparative descriptive statistics by diagnosis. Observations on the means help in determining the variables that have greater differences, indicating potential predictors of a logistic regression predicting the probability of a breast cancer case being benign or malignant. For instance, there is a great difference between the area worst variable between those diagnosed as having benign breast cancer and those determined as malignant.

Table 3: Diagnosis by proportion

Diagnosis	Benign	Malignant
Proportion	56.75%	43.25%

According to the train data distribution, 56.75% of the breast cancer cases diagnosed were determined as benign while 43.25% were malignant.

Based on the mean differences observed in the descriptive statistics, variables with greatest differences are graphically represented in this section.

Figure 1: Histogram of Radius, Texture, Perimeter, Area, Smoothness and Compactness means faceted by Diagnosis

The distributions of the variables displayed above seem to vary between those diagnosed with benign and those determined to have malignant breast cancer. For instance, perimeter mean data for the benign cases seem to be closely distributed compared to those with malignant. Generally were can state that cases of malignant breast cancer have higher variance on the radius, perimeter, area and compactness mean compared to benign.

Descriptive Statistics

Figure 2: Histograms of Concavity mean, Concave Points mean, Radius Se, Perimeter Se, Area Se and Radius worst, Faceted by Diagnosis

Means of concavity & concave points and radius worst seem to have a greater difference in average values between the benign and malignant groups. The other variables (Radius, perimeter and area standard errors) are approximately similar data distributions although the variances vary.

Figure 3: Histogram of worst values of Perimeter, Area, Compactness and Concave Points, Faceted by Diagnosis

The average and distribution of worst values of the perimeter, concave points, compactness and area of breast cancer are different for benign and malignant groups.

Malignant texture mean is greater than benign breast cancer

Table 4: First hypothesis output

## Welch Two Sample t-test
##
## data: texture_mean[train$diagnosis == “Malignant”] and texture_mean[train$diagnosis == “Benign”]
## t = 11.651, df = 344.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 3.642672 Inf
## sample estimates:
## mean of x mean of y
## 21.36098 17.11762

The p-value is less than the significance level, hence concluding that malignant has greater texture mean than benign breast cancer.

Table 5: Second hypothesis output

## Welch Two Sample t-test
##
## data: compactness_mean[train$diagnosis == “Malignant”] and compactness_mean[train$diagnosis == “Benign”]
## t = 14.068, df = 265.29, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.05868472 Inf
## sample estimates:
## mean of x mean of y
## 0.14456393 0.07807859

The p-value is less than 0.05, hence conclude that average compactness of breast cancer is higher for malignant compared to benign (Glover and Mitchell, 2008).

The variables used in the final model include; radius mean, concave points worst and area worst. The three independent variables are very significant with p-values less than 0.05 (Prabhakaran, 2016). The model is as shown in the table below;

Table 6: Logistics regression model output

model5 <- glm(diagnosis ~ radius_mean+concave.points_worst+area_worst, family = binomial(link = ‘logit’))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model5)

##
## Call:
## glm(formula = diagnosis ~ radius_mean + concave.points_worst +
##     area_worst, family = binomial(link = “logit”))
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.8231 -0.1385 -0.0362   0.0088   3.7624
##
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -4.601408   3.531814 -1.303 0.19263
## radius_mean          -1.176314   0.430713 -2.731 0.00631 **
## concave.points_worst 40.853801   8.030466   5.087 3.63e-07 ***
## area_worst            0.019996   0.004364   4.582 4.60e-06 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 547.205 on 399 degrees of freedom
## Residual deviance: 89.336 on 396 degrees of freedom
## AIC: 97.336
##
## Number of Fisher Scoring iterations: 9

The developed regression model developed from the above analysis is as shown below (Menard, 2008).

Figure 4: ROC curve

The ROC curve in figure 4 shows that the logistic model used in predicting malignant breast cancer has high power. The area under the curve is large, hence improved true positive rate (Menard, 2008).

Conclusion

We conclude that the compactness and texture of malignant breast cancer have higher average values compared to benign. This indicates that the process of breast cancer diagnosis can be guided by these observations. The final logistic regression fit included radius mean, worst concave points and worst area observed in the prediction of malignant breast cancer. The logistic regression equation below was determined to be the best fit for malignant breast cancer prediction.

More exploration of the data can be conducted to better the model. This may include developing dummy variables to improve the power of the new models. The data exploration will be directed in understanding the distribution of the predictor variables towards prediction of malignant breast cancer.

References

Glover, T. and Mitchell, K. (2008). An introduction to biostatistics. Long Grove, Ill.: Waveland Press.

Kaggle.com (2016). Breast Cancer Wisconsin (Diagnostic) Data Set | Kaggle. [online] Kaggle.com. Available at: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data [Accessed 18 Oct. 2017].

Menard, S. (2008). Applied logistic regression analysis. Thousand Oaks, Calif. [u.a.]: Sage.

Prabhakaran, S. (2016). Logistic Regression With R. [online] R-statistics.co. Available at: https://r-statistics.co/Logistic-Regression-With-R.html [Accessed 18 Oct. 2017].

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Breast Cancer Data Analysis Report ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order