--- title: "Lab 8" author: "Allen" date: "November 30, 2017" output: pdf_document --- ```{r setup, include=FALSE} ``` ## R Markdown ```{r} library(readr) titanic_data <- read_csv("C:/Users/allecm14/Downloads/titanic data.csv") View(titanic_data) ``` ## Question 1: From looking at the variables I would supposed that the variable name as well as the variable passenger, should not be in the model because and each row represents each individual person. I would also take out ticket number, as the number of the ticket should not intuitively affect whether they surivived. The variable embarked I removed as well, as the port of embarkment should not affect our response variable considering that it doesn't matter at which specific port they entered the ship. ## Question 2: ```{r, echo=FALSE} emplogit <- function(x, y, binsize = NULL, ci = FALSE, probit = FALSE,prob = FALSE, main = NULL, xlab = "", ylab = "", lowess.in = FALSE){ # x vector with values of the independent variable # y vector of binary responses # binsize integer value specifying bin size (optional) # ci logical value indicating whether to plot approximate # confidence intervals (not supported as of 02/08/2015) # probit logical value indicating whether to plot probits instead # of logits # prob logical value indicating whether to plot probabilities # without transforming # # the rest are the familiar plotting options if (length(x) != length(y)) stop("x and y lengths differ") if (any(y < 0 | y > 1)) stop("y not between 0 and 1") if (length(x) < 100 & is.null(binsize)) stop("Less than 100 observations: specify binsize manually") if (is.null(binsize)) binsize = min(round(length(x)/10), 50) if (probit){ link = qnorm if (is.null(main)) main = "Empirical probits" } else { link = function(x) log(x/(1-x)) if (is.null(main)) main = "Empirical logits" } sort = order(x) x = x[sort] y = y[sort]

a = seq(1, length(x), by=binsize) b = c(a[-1] - 1, length(x)) prob = xmean = ns = rep(0, length(a)) # ns is for CIs for (i in 1:length(a)){ range = (a[i]):(b[i]) prob[i] = mean(y[range]) xmean[i] = mean(x[range]) ns[i] = b[i] - a[i] + 1 # for CI } extreme = (prob == 1 | prob == 0) prob[prob == 0] = min(prob[!extreme]) prob[prob == 1] = max(prob[!extreme]) g = link(prob) # logits (or probits if probit == TRUE) linear.fit = lm(g[!extreme] ~ xmean[!extreme]) b0 = linear.fit\$coef[1] b1 = linear.fit\$coef[2] loess.fit = loess(g[!extreme] ~ xmean[!extreme]) plot(xmean, g, main=main, xlab=xlab, ylab=ylab) abline(b0,b1) if(lowess.in ==TRUE){ lines(loess.fit\$x, loess.fit\$fitted, lwd=2, lty=2) } } ``` ```{r} titanic_data\$Pclass <- as.factor(titanic_data\$Pclass) ``` If we have two categorical variables, would use a two-by-two table ## Question 3: If we have one categorical variable and one quantitative, we would use a boxplot to look at teh data. ## Question 4: ```{r} prop.table(table(titanic_data\$Survived,titanic_data\$Pclass),1) ``` ```{r} prop.table(table(titanic_data\$Survived,titanic_data\$Pclass),2) ``` ```{r} table(titanic_data\$Survived, titanic_data\$Pclass) ``` ```{r} table(titanic_data\$Survived, titanic_data\$Sex) ``` ```{r} plot(titanic_data\$Survived~titanic_data\$Age, xlab="Age", ylab="Survived") ``` ```{r} plot(titanic_data\$Survived~titanic_data\$SibSp, xlab="Siblings", ylab="Survived") ``` No one with more than four siblings survived based on this plot, therefore it is safe to assume that the more siblings on had the less likely they were to
survive.

