Beneath and Beyond the Cox Model
Posted on September 5, 2022 by R Views in R bloggers | 0 Comments
[This article was first published on R Views , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Share Tweet
The Cox Proportional Hazards model has so dominated survival analysis over the past forty years that I imagine quite a few people who regularly analyze survival data might assume that the Cox model, along with the Kaplan-Meier estimator and a few standard parametric models, encompass just about everything there is to say about the subject. It would not be surprising if this were true because it is certainly the case that these tools have dominated the teaching of survival analysis. Very few introductory textbooks look beyond the Cox Model and the handful of parametric models built around Gompertz, Weibull and logistic functions. But why do Cox models work so well? What is the underlying theory? How do all the pieces of the standard survival tool kit fit together?
As it turns out, Kaplan-Meier estimators, the Cox Proportional Hazards model, Aalen-Johansen estimators, parametric models, multistate models, competing risk models, frailty models and almost every other survival analysis technique implemented in the vast array of R packages comprising the CRAN Survival Task View , are supported by an elegant mathematical theory that formulates time-to-event analyses as stochastic counting models. The theory is about thirty years old. It was initiated by Odd Aalen in his 1975 Berkeley PhD dissertation, developed over the following twenty years largely by Scandinavian statisticians and their collaborators, and set down in more or less complete form in two complementary textbooks [5 and 9] by 1993. Unfortunately, because of its dependency on measure theory, martingales, stochastic integrals and other notions from advanced mathematics, it does not appear that the counting process theory of survival analysis has filtered down in a form that is readily accessible by practitioners.
In this rest of this post, I would like to suggest a path for getting a working knowledge of this theory by introducing two very readable papers, which taken together, provide an excellent overview of the relationship of counting processes to some familiar aspects of survival analysis.
The first paper, The History of applications of martingales in survival analysis by Aalen, Andersen, Borgan, Gill, and Keiding [4] is a beautiful historical exposition of the counting process theory by master statisticians who developed a good bit of the theory themselves. Read through this paper in an hour of so and you will have an overview of the theory, see elementary explanations for some of the mathematics involved, and gain a working idea of how the major pieces of the theory fit together how they came together.
The second paper, Who needs the Cox model anyway? [7], is actually a teaching note put together by Bendix Carstensen. It is a lesson with an attitude and the R code to back it up. Carstensen demonstrates the equivalence of the Cox model to a particular Poisson regression model. Working through this note is like seeing a magic trick and then learning how it works.
The following reproduces a portion of Carstensen’s note. I provide some commentary and fill in a few elementary details in the hope that I can persuade you that it is worth the trouble to spend some time with it yourself.
Carstensen use the North Central Cancer Treatment Group lung cancer survival data set which is included in the survival package for his examples.
## # A tibble: 228 × 10 ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss ##
## 1 3 306 2 74 1 1 90 100 1175 NA ## 2 3 455 2 68 1 0 90 90 1225 15 ## 3 3 1010 1 56 1 0 90 90 NA 15 ## 4 5 210 2 57 1 1 90 60 1150 11 ## 5 1 883 2 60 1 0 100 90 NA 0 ## 6 12 1022 1 74 1 1 50 80 513 0 ## 7 7 310 2 68 2 2 70 60 384 10 ## 8 11 361 2 71 2 2 60 80 538 1 ## 9 1 218 2 53 1 1 70 80 825 16 ## 10 7 166 2 61 1 2 70 70 271 34 ## # … with 218 more rows
It may not be obvious at first because there is no subject ID column, but this data frame contains one row for each of 228 subjects. The first column is an institution code. time is the time of death or censoring. status is the censoring indicator. The remaining columns are covariates. I select time, status, sex and age, drop the others from the our working data frame, and then replicate Carstensen’s preprocessing in a tidy way. The second line of mutate() adds a small number to each event time to avoid ties.
set.seed(1952) lung % select(time, status, age, sex) %>% mutate(sex = factor(sex,labels=c("M","F")), time = time + round(runif(nrow(lung),-3,3),2))
To get a feel for the data we fit a Kaplan-Meier Curve stratified by sex.
surv.obj