Did Age, Sex, And Social Class Impact Survival On The Titanic?

Data Description

Since the inception of the art of business, risk has always been present making it eminent for a better part of business as a whole and its operations component. This range from the probability of losses to making decisions on what strategy to adopt. Placing the business executive under constant need to come up with solutions that befit not only the business problem at hand but also enables it to predict the probability of such a problem occurring in both the near and far future. As such, the business is then set to meet its responsibility to various of its stakeholders, who include:

Shareholders
Customers

Government

Business partners
Employees
Community

The business world is prone to uncertainty and therefore prediction is crucial to making sustainable decisions that aid in propelling the business further and cementing the confidence of stakeholders in the ability of the business to sustain itself and be equally profitable.

In this project, we will explore the groups or types of people who were likely to survive the accident of the Titanic establish whether altruism (placing other humans’ welfare before yours) did play a part in the survivors of the Titanic.

In order to project the probable cost of an life insurance policy on passengers in a cruise ship. An insurance company may choose to examine the kind of passengers likely to survive a catastrophic incident such as that of the “Titanic”. A cruise ship accident that killed approximately 1500 passengers of the 2,225 aboard.

According to Elinder and Erixson (2012), the common fallacy that women and children have a higher survival rate in cases of maritime accidents stands to be disproved. In their research, they establish that, captains and men have comparatively higher advantages of survival than women and children. Additionally the argue that Captains actually have a higher chance of surviving than passengers.

In this paper, using historical data on the maritime tragedy of the sinking of the Titanic, we will explore and investigate the truth of the statements: “Women and children have a higher survival chance than men” and “Captains and ship crew have a lower survival rate than the passengers.”

Data used for this project is obtained from historical data on the sinking of the Titanic. It is sampled into:

Train data- used for training the prediction model
Test data- used to test the build model.

The data has 12 variables with 276 entries. The variables include:

Passenger ID
Passenger traveling class
Name
Sex
Age
Number of siblings/spouses aboard
Number of parents/ children aboard
Ticket number
passenger fare
Cabin number
Port of embarking

The data has 177 missing values.

Vidhya (2017) suggests the process of data cleaning as involving:

Exploratory analysis also involves filtration of data following a given condition
Data visualization

Checking for errors, which involves:
Selecting- picking of variables of interest
Imputation/dealing with errors- treating of missing and duplicate data entries
summarize- conducting of data analysis”

We visualize the survival data in three ways in order to determine the distribution of the interest variable in the sample data.These include:

Histogram

Data Cleaning and Preprocessing

#displaying survival data in histogram

hist(newdat$Survived , main=”Histogram of Survived passengers”, xlab=”survival”, border=”grey”, col=”blue”, xlim=c(0,1),las=1,breaks=5)

#ggplot for survival against age

ggplot(data = newdat,aes(y = Survived , x = Age)) + geom_point()

#scatter plot for Survived against sex

ggplot(data = newdat,aes(y = Survived , x = Sex)) + geom_point()

The test data is filtered to include entries from passenger-ID 893 to 1166.A total of 275 entries of the sample data. While the train data was filtered to include 15 sample entries

#checking if any in train data

any(is.na(newdat))

#checking if any in train data

any(is.na(nwtstdat))

#checking count in train data

sum(is.na(newdat))

#checking count in test data

sum(is.na(nwtstdat))

#checking distribution of train data

library(mice)

md.pattern(newdat)

PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked Age

#function to determine percentage of missing data

datmissing <- function(x){sum(is.na(x))/length(x)*100}

apply(data,2,datmissing)

#checking distribution of test data

library(mice)

md.pattern(nwtstdat)

From the summary, we note that 177 values are missing in the Age data, making a 19.86532%. Therefore, given that it is above the 5% threshold we might consider dropping or adopting it. However, our business problem seeks to to explore age I.e. children and adults survival chances, hence we will adopt the variable “Age”. Whereas the train data has 87 missing data, one from fare variable and 86 from age variable.

In order to handle missing values from the Age variable we replace the missing values with zero

#dealing with missing values for train data-set

newdat[is.na(newdat)] <- 0

#checking if still there are missing values

any(is.na(newdat))

FALSE

#dealing with missing values for test data-set

nwtstdat[is.na(nwtstdat)] <- 0

#checking if still there are missing values

any(is.na(nwtstdat))

#adding survival column to test data-set

nwtstdat<-bind_cols(nwtstdat, newdat [1:418, 2])

In order to work with relevant data for our business problem from the sample data, we selected six variables:

Passenger ID
Survival

Age
Cabin
Passenger class

Passenger ID- identifies every unique entry of our data-set hence important for sub-setting and analysis

Survived-indicates the survival status of the passengers

Sex- Identifies the sex of the passengers

Age- gives the specific age of each passenger on board

Cabin- Allocates every passenger to various ship station I.e. place of abode while cruising

Passenger class- Identifies the classes of travelers in the ship

Data splitting is already done. I.e. the data is split into training data-set and test data-set.

Additionally in our re-sampling we used the caret package

d statistic is 0.7156671 this justifies the suitability of our model on the performance given new data.

Exploratory Data Analysis

We will build a model using train data and test for its suitability and accuracy using test data.

#checking quality of prediction against actual values

truecount <- nwtstdat$Age;

summary(truecount)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 0.0000 0.3838 1.0000 1.0000

library(caret)

out_cam_nem<-‘Survived’

prdictas<-names(newdat)[!names(newdat) %in% out_cam_nem]

survival_chance_tst<- rfe(nwtstdat[,prdictas], nwtstdat[,out_cam_nem])

linear_model1 <- train(Survived~., data = newdat, method = “lm”)

#Obtaining overall R-squared statistic

summary(linear_model1$finalModel)$r.squared

linear_model1

Linear Regression

891 samples

11 predictor

No pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 891, 891, 891, 891, 891, 891, …

Resampling results:

RMSE Rsquared MAE

0.6700129 0.07156671 0.4350878

Tuning parameter ‘intercept’ was held constant at a value of TRUE

The unrealistic R-squared statistic is 1, while the realistic R-squared statistic is 0.7156671 this justifies the suitability of our model on the performance given new data.

According to an article published by Zevross (2017). The caret package does the following:

Samples the data randomly without replacement
Creates a model for the sampled sites

Conducts out-of-bag observations and calculates the R-squares statistic

Runs 25 models and averages the R_squared statistic
Caret runs a final model, storing it as finalModel

Therefore, given a low R-squared for our training data, our model is comparatively suitable for new data

#building first model

reg_model_fit <- lm(Survived ~ Sex + PassengerId + Sex +Age +Pclass , data = newdat)

summary(reg_model_fit)

Call:

lm(formula = Survived ~ Sex + PassengerId + Sex + Age + Pclass,

data = newdat)

Residuals:

Min 1Q Median 3Q Max

-0.99540 -0.22791 -0.07716 0.23243 0.97100

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.172e+00 5.578e-02 21.003 < 2e-16 ***

Sexmale -5.106e-01 2.741e-02 -18.630 < 2e-16 ***

PassengerId 1.736e-05 5.035e-05 0.345 0.73038

Age -2.404e-03 7.908e-04 -3.039 0.00244 **

Pclass -1.766e-01 1.679e-02 -10.519 < 2e-16 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3858 on 886 degrees of freedom

Multiple R-squared: 0.3743, Adjusted R-squared: 0.3714

statistic: 132.5 on 4 and 886 DF, p-value: < 2.2e-16

#Building second model

reg_model_fit2 <- lm(Survived ~ Sex + PassengerId + Sex +Age +Pclass , data = nwtstdat)

summary(reg_model_fit2)

Call:

lm(formula = Survived ~ Sex + PassengerId + Sex + Age + Pclass,

data = nwtstdat)

Residuals:

Min 1Q Median 3Q Max

-0.4879 -0.3955 -0.3461 0.5810 0.7363

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.0047832 0.2468958 0.019 0.985

Sexmale 0.0484419 0.0499939 0.969 0.333

PassengerId 0.0002514 0.0001984 1.267 0.206

Age 0.0004798 0.0015859 0.303 0.762

Pclass 0.0292105 0.0333070 0.877 0.381

Residual standard error: 0.4886 on 413 degrees of freedom

Multiple R-squared: 0.008437, Adjusted R-squared: -0.001166

F-statistic: 0.8785 on 4 and 413 DF, p-value: 0.4767

#Showing prediction of Survival chances give Age and Sex

regfit<-scatterplot3d(Survived, Age,Sex, angle = 60, color = “dodgerblue”, pch = 1, ylab = “Survival (rate)”, xlab = “Sex (male/female)”, zlab = “Age (years)”)

linear_model1 <- train(Survived~., data = newdat, method = “lm”) #overlaying our observed data

regfit$points3d(Survived,Age,Sex, pch=16)

From the second linear regression model we find the R-squared statistic to be 0.3714 , which is relatively high compared to that of the first model at 0.07156671 and that of the third predictive model which is 0.008437. Therefore when considering the best model we drop the second due to its supposed inability to successfully predict new data

Checking for correlation between the variables, we realized that only the passenger class had strong correlation with surviving

Determining the factors that observably influenced survival

Distribution of survivors by variable

Male and female Survivors

Total survivors by variable

Total survivors from each ship cabin

Survivors from each cabin

Survivors according to age and sex

Scatter-plot for male and female survivors’ distribution

Survival of females in the data sample

Therefore from the deployment graphs we can draw the following conclusions:

More males survived during the Titanic maritime accident than women
Most passengers in the first cabin survived compared to those of lower cabins with the last cabin having the list survivors

Fewer children than adults survived

Those who had embarked had higher chances of survival than those who were still on the cruise ship

This four conclusions help us in answering our initial business questions. I.e.

Women and children have lower chances of survival in case of a maritime accident
Captains and crew members, relatively do not give passengers priority during sea accidents. As established from the survival rate of first and second cabins where the captain controls the ship from
Passengers in better travelling cabins I.e. the first and second have higher chances of survival compared to those in the rest of the cabins.

Therefore, in conclusion, there are a number of factors that may ensure survival in case of a sea accident and as such measures that increase luck of survival should be put in place as a risk prevention measure. This may include adequate number of lifeboats and conduction of enough surveillance and reconnaissance of the ship paths before any given voyage.

References

Kaushik, S. (2016). Practical guide to implement machine learning. Online.Retrieved from: https://www.analyticsvidhya.com/blog/2016/12/practical-guide-to-implement-machine-learning-with-caret-package-in-r-with-practice-problem/. Accessed on 18^thMay 2018

Laguna, M.,& Marklund, J.( 2005) Business process modeling, simulation, and design. Upper

Saddle River, NJ: Pearson/prentice Hall.

Rouse, A. Strategic Decision making process. Journal on business Model development. Vol.35,

33-34. Retrieved from:

https://www.researchgate.net/publication/312187946_Strategic_Decision_Making_Process_Models_and_Theories

Intellectus Statistics. (2018).Linear Regression. Retrieved from: https://www.intellectusstatistics.com/data-analysis-plan-templates/multinomial-logistic-regression/

Zev, Ross. (2017). Predictive modeling and machine learning in R with the caret package- Technical Tidbits From Spatial Analysis & Data Science. Retrieved from: https://www.zevross.com/blog/2017/09/19/predictive-modeling-and-machine-learning-in-r-with-the-caret-package/