Data Description
Since the inception of the art of business, risk has always been present making it eminent for a better part of business as a whole and its operations component. This range from the probability of losses to making decisions on what strategy to adopt. Placing the business executive under constant need to come up with solutions that befit not only the business problem at hand but also enables it to predict the probability of such a problem occurring in both the near and far future. As such, the business is then set to meet its responsibility to various of its stakeholders, who include:
- Shareholders
- Customers
- Government
- Business partners
- Employees
- Community
The business world is prone to uncertainty and therefore prediction is crucial to making sustainable decisions that aid in propelling the business further and cementing the confidence of stakeholders in the ability of the business to sustain itself and be equally profitable.
In this project, we will explore the groups or types of people who were likely to survive the accident of the Titanic establish whether altruism (placing other humans’ welfare before yours) did play a part in the survivors of the Titanic.
In order to project the probable cost of an life insurance policy on passengers in a cruise ship. An insurance company may choose to examine the kind of passengers likely to survive a catastrophic incident such as that of the “Titanic”. A cruise ship accident that killed approximately 1500 passengers of the 2,225 aboard.
According to Elinder and Erixson (2012), the common fallacy that women and children have a higher survival rate in cases of maritime accidents stands to be disproved. In their research, they establish that, captains and men have comparatively higher advantages of survival than women and children. Additionally the argue that Captains actually have a higher chance of surviving than passengers.
In this paper, using historical data on the maritime tragedy of the sinking of the Titanic, we will explore and investigate the truth of the statements: “Women and children have a higher survival chance than men” and “Captains and ship crew have a lower survival rate than the passengers.”
Data used for this project is obtained from historical data on the sinking of the Titanic. It is sampled into:
- Train data- used for training the prediction model
- Test data- used to test the build model.
The data has 12 variables with 276 entries. The variables include:
- Passenger ID
- Passenger traveling class
- Name
- Sex
- Age
- Number of siblings/spouses aboard
- Number of parents/ children aboard
- Ticket number
- passenger fare
- Cabin number
- Port of embarking
The data has 177 missing values.
Vidhya (2017) suggests the process of data cleaning as involving:
- Exploratory analysis also involves filtration of data following a given condition
- Data visualization
- Checking for errors, which involves:
- Selecting- picking of variables of interest
- Imputation/dealing with errors- treating of missing and duplicate data entries
- summarize- conducting of data analysis”
We visualize the survival data in three ways in order to determine the distribution of the interest variable in the sample data.These include:
- Histogram
Data Cleaning and Preprocessing
#displaying survival data in histogram
hist(newdat$Survived , main=”Histogram of Survived passengers”, xlab=”survival”, border=”grey”, col=”blue”, xlim=c(0,1),las=1,breaks=5)
#ggplot for survival against age
ggplot(data = newdat,aes(y = Survived , x = Age)) + geom_point()
#scatter plot for Survived against sex
ggplot(data = newdat,aes(y = Survived , x = Sex)) + geom_point()
The test data is filtered to include entries from passenger-ID 893 to 1166.A total of 275 entries of the sample data. While the train data was filtered to include 15 sample entries
#checking if any in train data
any(is.na(newdat))
#checking if any in train data
any(is.na(nwtstdat))
#checking count in train data
sum(is.na(newdat))
#checking count in test data
sum(is.na(nwtstdat))
#checking distribution of train data
library(mice)
md.pattern(newdat)
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked Age
#function to determine percentage of missing data
datmissing <- function(x){sum(is.na(x))/length(x)*100}
apply(data,2,datmissing)
#checking distribution of test data
library(mice)
md.pattern(nwtstdat)
From the summary, we note that 177 values are missing in the Age data, making a 19.86532%. Therefore, given that it is above the 5% threshold we might consider dropping or adopting it. However, our business problem seeks to to explore age I.e. children and adults survival chances, hence we will adopt the variable “Age”. Whereas the train data has 87 missing data, one from fare variable and 86 from age variable.
In order to handle missing values from the Age variable we replace the missing values with zero
#dealing with missing values for train data-set
newdat[is.na(newdat)] <- 0
#checking if still there are missing values
any(is.na(newdat))
- FALSE
#dealing with missing values for test data-set
nwtstdat[is.na(nwtstdat)] <- 0
#checking if still there are missing values
any(is.na(nwtstdat))
#adding survival column to test data-set
nwtstdat<-bind_cols(nwtstdat, newdat [1:418, 2])
In order to work with relevant data for our business problem from the sample data, we selected six variables:
- Passenger ID
- Survival
- Sex
- Age
- Cabin
- Passenger class
Passenger ID- identifies every unique entry of our data-set hence important for sub-setting and analysis
Survived-indicates the survival status of the passengers
Sex- Identifies the sex of the passengers
Age- gives the specific age of each passenger on board
Cabin- Allocates every passenger to various ship station I.e. place of abode while cruising
Passenger class- Identifies the classes of travelers in the ship
Data splitting is already done. I.e. the data is split into training data-set and test data-set.
Additionally in our re-sampling we used the caret package
d statistic is 0.7156671 this justifies the suitability of our model on the performance given new data.
Exploratory Data Analysis
We will build a model using train data and test for its suitability and accuracy using test data.
#checking quality of prediction against actual values
truecount <- nwtstdat$Age;
summary(truecount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.3838 1.0000 1.0000
library(caret)
out_cam_nem<-‘Survived’
prdictas<-names(newdat)[!names(newdat) %in% out_cam_nem]
survival_chance_tst<- rfe(nwtstdat[,prdictas], nwtstdat[,out_cam_nem])
linear_model1 <- train(Survived~., data = newdat, method = “lm”)
#Obtaining overall R-squared statistic
summary(linear_model1$finalModel)$r.squared
linear_model1
Linear Regression
891 samples
11 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 891, 891, 891, 891, 891, 891, …
Resampling results:
RMSE Rsquared MAE
0.6700129 0.07156671 0.4350878
Tuning parameter ‘intercept’ was held constant at a value of TRUE
The unrealistic R-squared statistic is 1, while the realistic R-squared statistic is 0.7156671 this justifies the suitability of our model on the performance given new data.
According to an article published by Zevross (2017). The caret package does the following:
- Samples the data randomly without replacement
- Creates a model for the sampled sites
- Conducts out-of-bag observations and calculates the R-squares statistic
- Runs 25 models and averages the R_squared statistic
- Caret runs a final model, storing it as finalModel
Therefore, given a low R-squared for our training data, our model is comparatively suitable for new data
#building first model
reg_model_fit <- lm(Survived ~ Sex + PassengerId + Sex +Age +Pclass , data = newdat)
summary(reg_model_fit)
Call:
lm(formula = Survived ~ Sex + PassengerId + Sex + Age + Pclass,
data = newdat)
Residuals:
Min 1Q Median 3Q Max
-0.99540 -0.22791 -0.07716 0.23243 0.97100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.172e+00 5.578e-02 21.003 < 2e-16 ***
Sexmale -5.106e-01 2.741e-02 -18.630 < 2e-16 ***
PassengerId 1.736e-05 5.035e-05 0.345 0.73038
Age -2.404e-03 7.908e-04 -3.039 0.00244 **
Pclass -1.766e-01 1.679e-02 -10.519 < 2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3858 on 886 degrees of freedom
Multiple R-squared: 0.3743, Adjusted R-squared: 0.3714
- statistic: 132.5 on 4 and 886 DF, p-value: < 2.2e-16
#Building second model
reg_model_fit2 <- lm(Survived ~ Sex + PassengerId + Sex +Age +Pclass , data = nwtstdat)
summary(reg_model_fit2)
Call:
lm(formula = Survived ~ Sex + PassengerId + Sex + Age + Pclass,
data = nwtstdat)
Residuals:
Min 1Q Median 3Q Max
-0.4879 -0.3955 -0.3461 0.5810 0.7363
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0047832 0.2468958 0.019 0.985
Sexmale 0.0484419 0.0499939 0.969 0.333
PassengerId 0.0002514 0.0001984 1.267 0.206
Age 0.0004798 0.0015859 0.303 0.762
Pclass 0.0292105 0.0333070 0.877 0.381
Residual standard error: 0.4886 on 413 degrees of freedom
Multiple R-squared: 0.008437, Adjusted R-squared: -0.001166
F-statistic: 0.8785 on 4 and 413 DF, p-value: 0.4767
#Showing prediction of Survival chances give Age and Sex
regfit<-scatterplot3d(Survived, Age,Sex, angle = 60, color = “dodgerblue”, pch = 1, ylab = “Survival (rate)”, xlab = “Sex (male/female)”, zlab = “Age (years)”)
linear_model1 <- train(Survived~., data = newdat, method = “lm”) #overlaying our observed data
regfit$points3d(Survived,Age,Sex, pch=16)
From the second linear regression model we find the R-squared statistic to be 0.3714 , which is relatively high compared to that of the first model at 0.07156671 and that of the third predictive model which is 0.008437. Therefore when considering the best model we drop the second due to its supposed inability to successfully predict new data
Checking for correlation between the variables, we realized that only the passenger class had strong correlation with surviving
Determining the factors that observably influenced survival
Distribution of survivors by variable
Male and female Survivors
Total survivors by variable
Total survivors from each ship cabin
Survivors from each cabin
Survivors according to age and sex
Scatter-plot for male and female survivors’ distribution
Survival of females in the data sample
Therefore from the deployment graphs we can draw the following conclusions:
- More males survived during the Titanic maritime accident than women
- Most passengers in the first cabin survived compared to those of lower cabins with the last cabin having the list survivors
- Fewer children than adults survived
- Those who had embarked had higher chances of survival than those who were still on the cruise ship
This four conclusions help us in answering our initial business questions. I.e.
- Women and children have lower chances of survival in case of a maritime accident
- Captains and crew members, relatively do not give passengers priority during sea accidents. As established from the survival rate of first and second cabins where the captain controls the ship from
- Passengers in better travelling cabins I.e. the first and second have higher chances of survival compared to those in the rest of the cabins.
Therefore, in conclusion, there are a number of factors that may ensure survival in case of a sea accident and as such measures that increase luck of survival should be put in place as a risk prevention measure. This may include adequate number of lifeboats and conduction of enough surveillance and reconnaissance of the ship paths before any given voyage.
References
Kaushik, S. (2016). Practical guide to implement machine learning. Online.Retrieved from: https://www.analyticsvidhya.com/blog/2016/12/practical-guide-to-implement-machine-learning-with-caret-package-in-r-with-practice-problem/. Accessed on 18th May 2018
Laguna, M.,& Marklund, J.( 2005) Business process modeling, simulation, and design. Upper
Saddle River, NJ: Pearson/prentice Hall.
Rouse, A. Strategic Decision making process. Journal on business Model development. Vol.35,
33-34. Retrieved from:
https://www.researchgate.net/publication/312187946_Strategic_Decision_Making_Process_Models_and_Theories
Intellectus Statistics. (2018).Linear Regression. Retrieved from: https://www.intellectusstatistics.com/data-analysis-plan-templates/multinomial-logistic-regression/
Zev, Ross. (2017). Predictive modeling and machine learning in R with the caret package- Technical Tidbits From Spatial Analysis & Data Science. Retrieved from: https://www.zevross.com/blog/2017/09/19/predictive-modeling-and-machine-learning-in-r-with-the-caret-package/