Logo

The Data Daily

Generating clustered data with marginal correlations | R-bloggers

Generating clustered data with marginal correlations | R-bloggers

Generating clustered data with marginal correlations
Posted on November 21, 2022 by Keith Goldfeld in R bloggers | 0 Comments
[This article was first published on ouR data generation , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Share Tweet
A student is working on a project to derive an analytic solution to the problem of sample size determination in the context of cluster randomized trials and repeated individual-level measurement (something I’ve thought a little bit about before). Though the goal is an analytic solution, we do want confirmation with simulation. So, I was a little disheartened to discover that the routines I’d developed in simstudy for this were not quite up to the task. I’ve had to quickly fix that, and the updates are available in the development version of simstudy, which can be downloaded using devtools::install_github(“kgoldfeld/simstudy”). While some of the changes are under the hood, I have added a new function, genBlockMat, which I’ll describe here.
Correlation in cluster randomized trials
The fundamental issue with cluster randomized trials is that outcomes for a group of patients in a specific cluster are possibly correlated; the degree to which this is true impacts both how much we “learn” from each individual. The more highly correlated individuals are, the less information we actually have. (In the extreme case, if there is perfect correlation, we really only have a sample of one from each group.)
When generating data and modeling associations, the structure of the correlation needs to reflect the context of the study design. The specific structure can depend on whether outcomes generally vary over time (so that patient outcomes within a cluster closer temporally might be more highly correlated than outcomes collected from patients far apart in time) and whether measurements are collected for the same individuals over time (you might expect the measurements of the same individual to be more highly correlated than measurements of two different individuals).
There are at least two ways to go about simulating correlated data from a cluster randomized trial. The first is to use a random effect to induce correlation. For example, a simple data generating process for a binary outcome with a treatment indicator and one covariate would start with a formulation like this:
\[ P(Y_{ij} = 1) = \pi_{ij}, \ \ \ Y_{ij} \in \{0,1\}\] \[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i + b_j \]
where \(Y_{ij}\) is the outcome for individual \(i\) in cluster \(j\). (\(A\) is a treatment indicator and \(X\) is a covariate.) The key here is \(b_j\), which is a cluster level effect that is typically assumed to have a normal distribution N(0, \(\sigma_b^2)\). In a simulation, we would use use specific values to generate a probability \(\pi_{ij}\) for each; each of the \(\pi_{ij}\)’s within a cluster would be correlated by the presence of the cluster effect \(b_j\). It would follow that the \(Y_{ij}\)’s would also be correlated within cluster \(j\). We can call this the conditional data generation process, and we could use a mixed-effects regression model to recover the parameters. But we won’t do this here.
Instead, we can dispose of \(b_j\), like this:
\[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i \]
As before, we would generate the \(\pi_{ij}\)’s, but the probabilities are going to be uncorrelated now (except of course the correlation due to randomization assignment, but this would be across clusters). The within-cluster correlation is directly introduced into the \(Y_{ij}\)’s by using using multivariate data generation process. If we were in the realm of normally distributed outcomes, we would use a multivariate normal data generating process \(MVN(\mathbf{\mu}, \Sigma)\), where \(\Sigma\) is a covariance matrix. (This could be done in simstudy using genCorData or addCorData.) In this case, with a binary outcome, we need an analogous approach, which is implemented in the simstudy functions genCorGen and addCorGen. To recover the parameters used to generate these data, a generalized estimating equations (GEE) model would be used; and rather than being conditional, the parameter estimates from this model will be marginal, just as the data generation process was.
Generating data – multiple time periods, single individual measurement
OK – that is a bit more background than I intended (though probably not enough). Now onto the new function and simulations.
In the first example here, the outcomes are measured at three different periods, but an individual in a cluster is measured only once. In other words, the time periods include different sets of individuals.
If we have 3 time periods and 3 individuals in each time period, the within-cluster correlation between two individuals in the same time period is \(\alpha_1\), the correlation between individuals in adjacent time periods (periods 1&2 and periods 2&3) is \(\alpha_2\), and the correlation between individuals in time periods 1 and 3 would be \(\alpha_3\). The correlation structure for the cluster would be represented like this with each period represented in \(3 \times 3\) sub-blocks:
\[ \mathbf{R} = \left( \begin{matrix} 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 \end{matrix} \right ) \]
The overall correlation matrix for the full data set (assuming 5 clusters) is represented by block matrix \(\textbf{B}\) with
\[ \mathbf{B} = \left( \begin{matrix} \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} \\ \end{matrix} \right ) \]
where \(\mathbf{0}\) is a \(9 \times 9\) matrix of \(0\)’s.
The new function genBlockMat enables us to generate the \(R\) blocks (though currently it requires that the number of individuals per period per cluster are constant – I will relax that requirement in the future). Here are a couple of examples. In the first we are fixing \(\alpha_1 = 0.3\), \(\alpha_2 = 0.2\), and \(\alpha_3 = 0.1\):
library(simstudy) library(data.table) R

Images Powered by Shutterstock