This is the third part of a series surrounding survival analysis. For part 1, click here. For part 2, click here.

This particular post follows the final part of fitting a Cox proportional hazards model; residual checking and model validation.

Solutions can be found for these exercises here.

**Exercise 1**

Load the survival and survminer libraries. Build our previously derived final Cox model.

**Exercise 2**

Calculate Martingale residuals and build a dataframe of these residuals with a second index column.

**Exercise 3**

Plot your Martingale residuals now.

**Exercise 4**

In a similar way to above, calculate deviance residuals, incorporate them into your residual data frame, and plot.

**Exercise 5**

Calculate the linear predictors of your final model and plot against your deviance residuals.

You may need to use ggrepel here to handle label positioning.

**Exercise 6**

Build a new dataframe containing dfbeta values for each variable within your final model.

**Exercise 7**

Using a for loop, plot each of these dfbeta residuals.

**Exercise 8**

With graphical residual plots created, formally test that the PH assumption has been met for your final model.

**Exercise 9**

Plot this now to graphically confirm your test.

**Exercise 10**

With all residuals and model checking complete, visualise your final survival curve generated by your model.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Density-Based Clustering Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

This tutorial concerns itself with MLE calculations and bootstrapping.

Answers to the exercises are available here.

**Exercise 1**

Set a seed to 123 and create the following dataframe:

lifespans = data.frame(index = 1:200, lifespans = rgamma(200, shape = 2, rate = 1.2)*50)

It is thought that this data of bacteria’s lifespan may follow a gamma distribution. Confirm this assumption through an exploratory plot.

**Exercise 2**

Using an appropriate function from the MASS library, evaluate the log-likelihood for this dataset.

**Exercise 3**

From this evaluation, also obtain MLEs for the shape and rate parameters.

**Exercise 4**

Obtain the standard errors for the shape parameter.

**Exercise 5**

Use a wald test to calculate a test statistic to determine if the shape MLE differs from 2 at the 5% level.

**Exercise 6**

Generate an exact p-value for this test.

**Exercise 7**

The goal parameters of our bootstrap are the variance. From the new dataset below, calculate the variance.

durations = c(35.8, 33.4, 34.9, 17.9, 35.6, 10.3, 14.9, 28.3, 39.2, 25.4, 23.4, 7.1, 38.9, 9.2, 8.1)

**Exercise 8**

Create a single bootstrapped sample and calculate the variance from this.

**Exercise 9**

Turn your solution from step 8 into a for loop, generating 100 bootstrapped samples and test statistics.

**Exercise 10**

Calculate a 95% confidence interval for your bootstrapped test statistic.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-7)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

The skill in conducting this sort of work is being able to identify an appropriate distribution on which to model the question and test accordingly. Conveniently, within R, the syntax for conducting the test and drawing distributional samples is very uniform. The aim of this tutorial is to expose users less familiar with the statistical theory to some of the more common distributions and their applications.

Answers to the exercises are available here.

**Exercise 1**

Set a seed to 123 and plot a histogram of 1000 draws from a normal distribution with mean 10, standard deviation 2.

**Exercise 2**

Using a QQ plot. Assess the normality of your previously simulated draws.

**Exercise 3**

Using a t-test, test for a difference in means between your samples. 1000 samples of the Student’s t-distribution, 10 degrees of freedom, and a delta value of 9. Report on your p-value and its significance at the 5% level.

**Exercise 4**

Rewrite your t-test, testing now if your normal samples have a greater mean than your samples from the Student’s t-distribution. Report on your new p-value.

**Exercise 5**

Putting these skills together now, calculate a two-sided t-test of equal means from two normal distributions. The first of mean 1, standard deviation 0.5, the second of mean 0.9, a standard deviation of 1. Hint: A function may become useful here.

**Exercise 6**

Replicate this t-test 1000 times and test for a standard uniform distribution, using a QQ plot.

**Exercise 7**

Moving onto other distributions now. The probability of rolling a score greater than 3 on a loaded die is 0.75. What is the probability of rolling exactly 5 scores greater than 3 from 10 rolls of the die?

**Exercise 8**

Building on this, what is the probability of rolling a score greater than 3, less than 5 times, from 10 rolls?

**Exercise 9**

On average, a cashier serves 50 people per hour in their shop. What is the probability that they serve 60 people or more during one hour?

**Exercise 10**

For the cashier serving, each transaction takes 50 seconds on average. What is the probability of the transaction being completed in less than 30seconds?

A linear model is an explanation of how a continuous response variable behaves, dependent on a set of covariates or explanatory variables. Whilst often insufficient to explain complex problems, linear models do present underlying skills, such as variable selection and diagnostic examinations. Therefore, a worthwhile introduction to statistical regression techniques.

In this tutorial, we’ll be creating a couple of linear models and comparing the performance of them on the Boston Housing dataset. This tutorial will require caret and mlbench to be installed and you may find ggplot2 and dplyr useful too, though these are not essential.

Solutions to these exercises can be found here.

**Exercise 1**

Load the Boston Housing dataset from the mlbench library and inspect the different types of variables present.

**Exercise 2**

Explore and visualize the distribution of our target variable.

**Exercise 3**

Explore and visualize any potential correlations between medv and the variables crim, rm, age, rad, tax and lstat.

**Exercise 4**

Set a seed of 123 and split your data into a train and test set using a 75/25 split. You may find the caret library helpful here.

**Exercise 5**

We have seen that crim, rm, tax, and lstat could be good predictors of medv. To get the ball rolling, let us fit a linear model for these terms.

**Exercise 6**

Obtain an r-squared value for your model and examine the diagnostic plots found by plotting your linear model.

**Exercise 7**

We can see a few problems with our model immediately with variables such as 381 exhibiting a high leverage, a poor QQ plot in the tails a relatively poor r-squared value.

Let us try another model, this time transforming MEDV due to the positive skewness it exhibited.

**Exercise 8**

Examine the diagnostics for the model. What do you conclude? Is this an improvement on the first model?

One assumption of a linear model is that the mean of the residuals is zero. You could try and test this.

**Exercise 9**

Create a data frame of your predicted values and the original values.

**Exercise 10**

Plot this to visualize the performance of your model.

This is the last of the exercise set on H2O’s machine learning algorithms. Please do them in sequence. This requires some additional data. I have provided the links, so please download them when it’s needed.

Answers to the exercises are available here. Please check the documentation before starting this exercise set.

For other parts of this series please follow the tag: h2o

**Exercise 1**

Load the bank data from the previous exercise. You can see that the response variable “class O” is much higher than the class 1. H2O provides a parameter balance_classes in its classification algorithm. Create a naive Bayes algorithm with balanc_class =TRUE and check its performance by comparing it with the base model of the naive Bayes classifier. Does it improve performance?

**Exercise 2**

You can see that this does not improve the score, but ideally, it should. What went wrong? Although we have instructed the gbm to use balance_class to be true, we have not specified the ratio for “No” class or “Yes” class. H2O did not sample it properly. We can specify the class sample factor in the parameter. Try to undersample the “No” class and also try to oversample the “No” Class. Do you see the change in error rate?

**Exercise 3**

Created a grid for gbm with hyperparameters nbins_cats, learn_rate. The distribution should be bernoulli.

**Exercise 4**

Find the best model and check its performance in the test set.

**Exercise 5**

Next, we will create a random forest classifier. Create a base random forest classifier, check the auc score on the validation, and test data set.

**Exercise 6**

Create a grid search for a random forest classifier. Check the documentation or the first series of H2O ml algorithms for parameters that can be used in the grid search.

As always, find the best model and check the performance.

**Exercise 7**

Create an ensemble model for the bank data where the base models are a random forest model and a gbm model. Create the ensembles and check the performance.

**Exercise 8**

For the next exercise set, we will use prostate data. Download it from here.

Create a K means model and find the clusters.

**Exercise 9**

In the K means model, try setting the init parameter to “Furthest” and “PlusPlus”, respectively. Find the centers STD from the model for the explanatory variables.

**Exercise 10**

Next, you will see why the principal component is used in ml. Download the data from here. Load the data set, and check the main two principal components. The summary should give you an intuitive idea that the first two pc explains 98% of the variance.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-7)
- Density-Based Clustering Exercises
- Data Manipulation with data.table (part -2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

We previously discussed this in a tutorial. R has several tools available to help us to perform a Power Analysis.

Answers to the exercises are available here.

**Exercise 1**

Install and then load the *pwr* package. This package contains functions for Power Analysis.

**Exercise 2**

Use `pwr.t.test()`

to calculate the minimum number of samples required for an effect size (*d*) of 0.3, a significance level of 0.05, and a power of 0.8 for a two-sample test.

**Exercise 3**

Again, use `pwr.t.test()`

, but now calculate the effect size that could be detected. Use 50 samples, a significance level of 0.05, and power of 0.8.

**Exercise 4**

Again use `pwr.t.test()`

, but now calculate the power given 50 samples, an effect size (*d*) of 0.3, and a significance level of 0.05.

**Exercise 5**

Examine the effect of sample size on statistical power. Use `pwr.t.test()`

to calculate the statistical power of different numbers of samples. Then, plot statistical power versus sample size. Use an effect size of 0.3 and a significance level of 0.05.

**Exercise 6**

Similar to the previous problem, plot statistical power versus the effect size. Use 100 samples and a significance threshold of 0.05.

**Exercise 7**

The *pwr* package also includes other commands, including one for ANOVAs, `pwr.anova.test()`

. Use this command to calculate the power for an experiment with 6 groups, a sample size of 50 per group, a significance level of 0.05, and an effect size of 0.2.

**Exercise 8**

Load the PlantGrowth data that is built into R. This data consists of plant weights for a control and two different treatment groups. There are 10 replicates per group, for a total of 30 samples. Calculate the effect size of this data using the `anova()`

command and the formula:

Where *p _{i}* is the fraction of the total samples in group

**Exercise 9**

Calculate the statistical power obtained using an ANOVA for the PlantGrowth data. Use the effect size calculated in the last problem, along with the number of samples, number of groups, and a significance threshold of 0.05.

**Exercise 10**

Explore other commands found in the *pwr* package.

The second part of this series focuses on more complex and insightful methods through the semi-parametric Cox Proportional Hazards model. Through a Cox Proportional Hazards model, it is possible to model covariates in a semi-parametric fashion. The advantage of this modeling strategy is that it makes modeling the survival times possible, without knowing or specifying the underlying distribution.

Solutions to these exercises can be found here. It should be noted that these solutions are quite verbose with the intention of breaking down each stage in the modeling process. In real life, you will probably want to use lapply and functions to speed up the variable selection stage.

**Exercise 1**

Before any modeling can commence, let us just test a few variables to get a feel for their effects on survival times. Create survival objects for sex, ph.karno, and wt.loss. Hint: You’ll need to group wt.loss.

**Exercise 2**

Plot these using Survminer to look for differences in the group’s survival curves.

**Exercise 3**

To check that these three variables adhere to the proportional hazards assumption, we must plot a log-cumulative hazard plot for each of them. Within each plot, we are looking for “roughly” parallel lines between each group. If this criterion is met, then the variables adhere to the PH assumption and can then be modeled through a Cox PH model.

**Exercise 4**

With no huge differences in parallelism, we can begin model building. First, just time and status. This will be known as the null model.

**Exercise 5**

Calculate the -2*log-likelihood for the null model. This will be our baseline comparison value when fitting terms.

**Exercise 6**

Now fit a model with sex, ph.karno, and wt.loss individually and obtain -2*log-likelihood values for these models.

**Exercise 7**

Conduct a Chi-Squared test now on the change in -2*log-likelihood values for each of the single term models, ensuring you have the correct degrees of freedom specified.

**Exercise 8**

Given that every term is significant indicates that they are needed in the model, as they introduce a significant amount of information. The next step is to fit them in the presence of each other. To do this, fit a saturated model and remove each term to test for a significant change in -2*log-likelihood values.

**Exercise 9**

With every term being significant, one last thing to check is for any interactions. Feel free to try a few and test for a significant change in deviance compared to the saturated model.

**Exercise 10**

Write out your final Cox Proportional Hazards model.

In this Exercise set ,we will continue our journey with H20’s Machine Learning algorithms. We will also find out about Gradient Boosted Machine and Classifiers like naive bayes. On the next series, we will conclude the machine learning journey with H2O.

Answers to the exercises are available here. Please check the documentation before starting this exercise set.

**Exercise 1**

We are working with the same data as before. Please load the energy efficiency data set and initialize an H2O cluster. Create the Explanatory Variables and response Variable, like the previous exercise set. Create a default gradient boosted machine model by H2O.

**Exercise 2**

Check the Variable Importance and Plot the importance metric using H2O.

**Exercise 3**

Find the model performance in the test data.

**Exercise 4**

Create a grid of models. The hyper parameter should be on the max_depth parameter in gbm. Max_depth 10 should be enough for most data sets so you can create a grid of 1 to 10.

Play with the gbm parameters to get the intuition behind them.

**Exercise 5**

Sort the grid by the error metric and compare the values with the base line model.

**Exercise 6**

Find the best model from the grid and check the performance on the test data.

Now we will look at how classification works in GBM with the help of UCI’s bank marketing data set.Download the data set from

**Exercise 7**

As always, create the train test and validation frame using 60% data as training, 20% as validation, and 20% as test. Create a 5 fold naive bayes classifier.

**Exercise 8**

Find the auc score for test data and validation data from the naive bayes model.

**Exercise 9**

Create a default gbm model with the training data set. Find the variable importance via var_imp. Try adding more variable to the default gbm model and check you have not left out any important variable.

**Exercise 10**

Check out the performance of the default gbm model with the auc score in both the validation data and test data.

- Power Analysis: Exercises
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Two Way ANOVA in R Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Statistical power is the probability of detecting a significant trend, given a significant trend actually exists. For example, a statistical power of 0.8 means there is a 0.8 probability of detecting a significant trend. Equivalently, a statistical power of 0.8 implies there is a 0.2 probability that you fail to detect a significant trend, in other words, a false-negative. Statistical power is intricately linked to three other quantities: statistical significance, effect size, and sample size. The threshold for statistical significance, or α, is the probability of false-positive. Therefore, an α of 0.05 would mean that there is a 0.05 probability that an experiment would show a significant trend, even though the trend is not actually significant. Effect size is the strength of the trend. And lastly, the sample size is the number of samples for an experiment.

Thankfully for us, the *pwr* package in R has a number of tools to conduct a power analysis for common experimental designs.

#install.packages('pwr', dependencies = TRUE) require(pwr)

Suppose we are planning an experiment where we have both a control and experimental group. We determined that a t-test is an appropriate test to compare differences between the two groups. We want to determine how many samples will be necessary to be confident in our results. We use the `pwr.t.test()`

command in R. This command has four important arguments: sample size (*n*), effect size (*d*), significance level (*sig.level*), and statistical power (*power*). We have to specify three of the four arguments to obtain the fourth. For example, let us imagine that we want a significance level of 0.05 a statistical power of 0.8. In addition, we anticipate an effect size (*d*) of 0.3. These values will be dependent on your particular experiment.

pwr.t.test(n = , d = 0.3 , sig.level = 0.05, power = 0.8, type = "two.sample")

## ## Two-sample t test power calculation ## ## n = 175.3847 ## d = 0.3 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

We see that for our specified experimental setup, we need a sample size (

We could also determine the statistical power given a set sample size of 90.

pwr.t.test(n = 90, d = 0.3 , sig.level = 0.05, power = , type = "two.sample")

## ## Two-sample t test power calculation ## ## n = 90 ## d = 0.3 ## sig.level = 0.05 ## power = 0.5166416 ## alternative = two.sided ## ## NOTE: n is number in *each* group

We see that with a sample size of 90 and effect size of 0.3, we only have a statistical power of 0.51. Therefore, given there is a significant difference between our two groups, we have about a coin flip’s probability of detecting that difference. This is not a well-designed experiment.

The *pwr* package also allows power calculations for ANOVAs, chi-squared tests, general linear models, and differences of proportions.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-8)
- Survival Analysis: Exercises (Part 2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Using R’s survival library, it is possible to conduct very in-depth survival analysis’ with a huge amount of flexibility and scope of analysis. The only downside to conducting this analysis in R is that the graphics can look very basic, which, whilst fine for a journal article, does not lend itself too well to presentations and posters. Step in the brilliant survminer package, which combines the excellent analytical scope of R with the beautiful graphics of GGPlot.

In this tutorial, we will use both the survival library and the survminer library to produce Kaplan-Meier plots and analyze log-rank tests. Solutions are available here.

**Exercise 1**

Load the lung data set from the survival library and re-factor the status column as a factor.

**Exercise 2**

Calculate the percentage of censored observations.

**Exercise 3**

Create a basic survival object exploring the occurrence of events.

**Exercise 4**

Print this object and plot it to graphically investigate this.

**Exercise 5**

Now install and load the survminer library and plot your survival object using a GGPlot graphic.

**Exercise 6**

Create a new survival object, stratifying the survival times now by gender.

**Exercise 7**

Plot this new age-stratified survival object and comment on your observations.

**Exercise 8**

Form a set of hypothesis’ to formally test survival times between males and females.

**Exercise 9**

Compute a log-rank test and report on the p-value calculated, in terms of the previously formed hypothesis.

**Exercise 10**

Investigate how the daily standard of one’s life affects their survival times. How many patients scored a 3 and what could be done with those individuals scoring 3?