Increasingly, academic scholars, data scientists, and quantitative researchers are turning to GitHub for collaboration and to share data, code, and results. GitHub allows people to host public and private “repositories” that allow for the easy communication of research procedures and results. Underlying the GitHub architecture is the version control system, git, which provides further benefits to researchers.
One of the things we hope to do at Code Horizons is help steer you toward the best tools to meet your needs. Here’s a guest post by Noah Greifer, a postdoctoral fellow at Johns Hopkins and the developer of WeightIt and cobalt. Noah has just completed a massive overhaul of the workhorse MatchIt package for matching in R. This post will introduce you to some of its exciting new features. You can learn how to use all of these packages in Matching and Weighting for Causal Inference with Using R, offered by Statistical Horizons.
In two earlier posts on this blog (here and here), my colleague Paul von Hippel made a strong case for using OLS linear regression instead of logistic regression for binary dependent variables. When you do that, you are implicitly estimating what’s known as a linear probability model (LPM), which says that the probability of some event is a linear function of a set of predictors.
Von Hippel had four major arguments:
 OLS regression is much faster than logistic regression, and that can be a big advantage in many applications, particularly when the data are big, the model is complex, and the model needs to be varied or updated frequently.
 People understand changes in probabilities much better than they understand odds ratios.
 Within the range of .20 to .80 for the predicted probabilities, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, OLS regression may do well if the range is narrow.
 It’s not uncommon for logistic regression to break down completely because of “quasicomplete separation” in which maximum likelihood estimates simply don’t exist. This never happens with the LPM.
Although I agreed with all of von Hippel’s points, I responded with two more blog posts (here and here) in which I gave several reasons why logistic regression was still preferable to the LPM. One of them was that LPMs frequently yield predicted probabilities that are greater than 1 or less than 0, and that obviously can’t be right.
For applications that focus on the regression coefficients, outofbounds predictions may not be a big problem. But in many other settings, valid predicted probabilities are essential. These include inverse probability weighting, propensity score methods, and expected loss calculations. Many authors have pointed to invalid predicted probabilities as the principal disadvantage of LPMs (e.g., Westin 1973, Long 1997, Hellevik 2007, Wooldridge 2010, Greene 2017).
Well, I’m happy to tell you that there’s an easy and effective fix for that problem. And it’s been right under our noses ever since Fisher (1936) identified the key elements of a solution. Here’s a quick summary of how the method works:
 The coefficient estimates from OLS with a binary outcome can be transformed into maximum likelihood estimates of the parameters of a “linear discriminant model”.
 The linear discriminant model (LDM) implies a logistic regression model for the dependence of the outcome on the predictors.
 To get valid predicted probabilities, just plug the values of the predictors into that logistic regression model using the transformed parameter estimates.
In the remainder of this post, I’ll flesh out the details and then give two examples.
What is the linear discriminant model?
The linear discriminant model was first proposed by Fisher as a method for classifying units into one or the other of two categories, based on a linear function of the predictor variables. Let x be the vector of predictors, and let y have values of 1 or 0. The LDM assumes that, within each category of y, x has a multivariate normal distribution, with different mean vectors for each category, u_{1} and u_{0}, but a covariance matrix S that is the same for both categories.
That’s it. That’s the entire model. There are two remarkable facts about this model:
Fact 1. The model specifies the conditional distribution x given the value of y. However, using Bayes’ theorem, it can be reexpressed to give the conditional probability of each value of y, given a value of x. What’s remarkable is that this yields a logistic regression model. Specifically,
log[P(y=1x)/Pr(y=0x)] = a + b’x,
where a and b are functions of the two mean vectors and the covariance matrix.
Fact 2. The other remarkable fact is that maximum likelihood estimates of a and b can be obtained (indirectly) by ordinary least squares regression of y on x. This was originally noted by Fisher but put into more usable form by Haggstrom (1983). To estimate b, the OLS slope coefficients must each be multiplied by K = N / RSS where N is the sample size and RSS is the residual sum of squares. K is typically substantially larger than 1.
The intercept a is obtained as follows. Let m be the sample mean of y and let c be the intercept in the OLS regression. Then, the ML estimate of a is
log[m/(1m)] + K(c.5) + .5[1/m – 1/(1m)]
Getting predicted probabilities
These two facts suggest the following procedure for getting predicted probabilities from a linear probability model.
 Estimate the LPM by OLS.
 Transform the parameters as described in Fact 2.
 Generate predicted probabilities by using the logistic equation in Fact 1.
I call this the LDM method. To make it easy, I have written software tools for three different statistical packages. All tools are named predict_ldm:
A SAS macro available here.
A Stata ado file available here (coauthored with Richard Williams).
An R function shown below in Appendix 3 (coauthored with Stephen Vaisey).
Evaluating the results
The LDM method will absolutely give you predicted probabilities that are always within the (0,1) interval. Aside from that, are they any good? In a moment, I’ll present two examples that suggest that they are very good indeed. However, a major reason for concern is that the LDM assumes multivariate normality of the predictors, but there are very few applications in which all the predictors will meet that assumption, or even come close.
Several studies have suggested that the LDM is pretty robust to violations of the multivariate normality assumption (Press and Wilson 1978, Wilensky and Rossiter 1978, Chatla and Shmueli 2017), but none of those investigations was very systematic. Moreover, they focused on coefficients and test statistics, not on predicted probabilities.
I have applied the method to many data sets, and all but two have shown excellent performance of the LDM method. How did I evaluate the method? I first fit the linear model and applied the LDM method to get predicted probabilities. Then I fit a logistic model using the standard iterative ML method, which makes no assumptions about the distribution of predictors. I then compared the predicted probabilities from LDM and conventional logistic regression in several ways.
I figured that conventional logit should be the gold standard. We can’t expect the LDM method to do any better than conventional logit because both rely on the same underlying model for y, except that LDM makes additional assumptions about the predictor variables.
Example 1. Married women’s labor force participation
For this example, I used a wellknown data set on labor force participation of 753 married women (Mroz 1987). The binary dependent variable has a value of 1 if the woman was currently in the labor force (478 women), otherwise 0. Independent variables are number of children under the age of six, age, education (in years), and years of labor force experience, as well as the square of experience. In the Appendices, you’ll find code for this example (and the next) using Stata, SAS and R. The output shown here was produced by Stata.
The table below gives summary statistics for the three different methods for getting predicted values. yhat contains predicted values from the linear model, yhat_ldm contains predicted values based on the LDM method, and yhat_logit contains predicted values from a conventional logistic regression.
Variable  Obs Mean Std. Dev. Min Max + yhat  753 .5683931 .2517531 .2782369 1.118993 yhat_ldm  753 .5745898 .2605278 .0136687 .9676127 yhat_logit  753 .5683931 .2548012 .0145444 .9651493
The linear predicted values have a maximum of 1.12 and a minimum of .28. In fact, 12 of the linear predicted values were greater than 1, and 14 were less than 0 (not shown in the table). By necessity, the predicted values for both LDM and logit were bounded by (0,1).
Next we’ll look at the correlations among the three sets of predicted values:
 yhat yhat_ldm yhat_logit + yhat  1.0000 yhat_ldm  0.9870 1.0000 yhat_logit  0.9880 0.9994 1.0000
Although the linear predictions are highly correlated with the logistic predictions, the correlation between the LDM predictions and the logistic predictions is even higher. Here’s the scatter plot for linear vs. logistic:
Notice that as the logistic predictions get close to 1, the linear predictions get larger than the logistic predictions. And, symmetrically, as the logistic predictions get close to 0, the linear predictions get smaller than the logistic predictions. This should alert us to the possibility that, even if none of the linear predicted values is outside the (0, 1) interval, the values at the two extremes may be biased in predictable ways.
By contrast, the plot below for LDM vs. logistic is very well calibrated.
For this example, I’d say that very little would be lost in using the LDM predicted values instead of predicted values from conventional logistic regression.
Example 2. AllCause mortality among lung cancer patients
After deleting cases with missing data, there were 1,029 patients in the data set. Death is the dependent variable, and 764 of the patients died. What’s different about this example is that all the predictors are categorical and, therefore, have a joint distribution that is nothing like a multivariate normal. The predictors are surgery (1 or 0), a randomized treatment (1 or 0), hospital (Mayo Clinic, Johns Hopkins, or Sloan Kettering), and cancer stage at diagnosis (1, 2, or 3). Both of the 3category variables are represented by two dummy (indicator) variables. The fitted models included main effects of these variables but no interactions.
Again, I ran two regressions, one linear and one logistic. The linear predictions were transformed by the LDM method. As the next table shows, there were definitely cases with linear predicted values greater than 1. In fact, there were 96 such cases.
Variable  Obs Mean Std. Dev. Min Max + yhat  1,029 .7424684 .2206033 .4116536 1.01567 yhat_ldm  1,029 .7299414 .251858 .303894 .9679288 yhat_logit  1,029 .7424684 .2247691 .3621077 .9674457
Here are the correlations among the three versions of the predicted values.
 yhat yhat_ldm yhat_logit + yhat  1.0000 yhat_ldm  0.9754 1.0000 yhat_logit  0.9815 0.9908 1.0000
Again, the LDM predictions do a better job than the linear predictions of matching the conventional logistic predictions.
Conclusion
The LDM method is a simple and effective way to transform predicted values from a linear probability model so that they do not fall outside the (0,1) interval. Although the method relies on strong assumptions about the distribution of the predictor variables, it nevertheless performed very well in 13 out of 15 data sets and models that I have examined, none of which had predictors that came close to multivariate normality.
What about the two exceptions? Well, I’m still studying them to get a better understanding of what went wrong, but I can tell you this: In one of them, the model included a 2way interaction that was highly significant in the linear model but not significant in the conventional logistic model. In the other, there were two predictors that were highly collinear. I am also designing a simulation study to explore the conditions under which the LDM method does or does not work well.
Is this a perfect method? Certainly not. I still prefer to use conventional logistic regression wherever computationally practical. But even with today’s enormous computing power, there are still many applications where linear models are quite attractive. And if you’re going to use an LPM and you want valid predicted probabilities, the LDM method seems like a pretty good way to get them.
References
Chatla, S. B., & Shmueli, G. (2017). An extensive examination of regression models with a binary outcome variable. Journal of the Association for Information Systems, 18(4), 1.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179188.
Greene, W.H. (2017) Econometric Analysis. Pearson.
Haggstrom, G. W. (1983). Logistic regression and discriminant analysis by ordinary least squares. Journal of Business & Economic Statistics, 1(3), 229238.
Hellevik, Ottar (2009): Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity 43.1 5974.
Long, J. S. (1997) Regression models for categorical and limited dependent variables. Sage Thousand Oaks.
Mroz, T.A. (1987) “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765799.
Press, S. J., & Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 73(364), 699705.
Westin, R. B. (1974). Predictions from binary choice models. Journal of Econometrics, 2(1), 116.
Wilensky, G. R., & Rossiter, L. F. (1978). OLS and logit estimation in a physician location study. In Proceedings of the Social Statistics Section, American Statistical Association (pp. 260265).
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.
Appendix 1: Stata code for the examples
Note: This program uses the predict_ldm command, which may be downloaded here.
/*Example 1. 753 married women. DV=1 if in the labor force, else 0.*/
use "https://statisticalhorizons.com/wpcontent/uploads/MROZ.dta",clear
reg inlf kidslt6 age educ exper expersq
predict_ldm
logit inlf kidslt6 age educ exper expersq
predict yhat_logit
sum yhat yhat_ldm yhat_logit
corr yhat yhat_ldm yhat_logit
scatter yhat yhat_logit, xtitle(Logistic Predicted Probabilities) ///
ytitle(Fitted Values from Linear Model)
scatter yhat_ldm yhat_logit, xtitle(Logistic Predicted Probabilities) ///
ytitle(LDM Predicted Probabilities)
/*Example 2. 1029 lung cancer patients. DV is 1 if patient died, else 0. */
use "https://statisticalhorizons.com/wpcontent/uploads/lung.dta",clear
gen dead=surv!=0
reg dead operated treat i.instit i.stage
predict_ldm
logit dead operated treat i.instit i.stage
predict yhat_logit
sum yhat yhat_ldm yhat_logit
corr yhat yhat_ldm yhat_logit
scatter yhat yhat_logit
scatter yhat_ldm yhat_logit
Appendix 2: SAS code for the examples
Note: This program uses the predict_ldm macro, which can be downloaded here. The data sets may be dowloaded here and here.
/*Example 1. 753 married women. DV=1 if in the labor force, else 0.*/
%predict_ldm (data=my.mroz, y=inlf, x=kidslt6 age educ exper expersq);
proc logistic data=predict_ldm desc;
model inlf=kidslt6 age educ exper expersq;
output out=c pred=yhat_logit;
run;
proc corr data=c;
var yhat yhat_ldm yhat_logit;
run;
proc gplot;
plot yhat*yhat_logit;
plot yhat_ldm*yhat_logit;
run;
/*Example 2. 1029 lung cancer patients. DV is 1 if patient died, else 0. */
data lung;
set my.lung;
dead=surv ne 0;
hopkins=(instit=2);
mayo=(instit=1);
if stage=. then delete;
stage3 = (stage=3);
stage2 = (stage=2);
run;
%predict_ldm(data=lung, y=dead,
x=operated treat hopkins mayo stage3 stage2);
proc logistic data=predict_ldm desc;
model dead=operated treat hopkins mayo stage3 stage2;
output out=c pred=yhat_logit;
run;
proc corr data=c;
var yhat yhat_ldm yhat_logit;
run;
proc gplot data=c;
plot yhat*yhat_logit;
plot yhat_ldm*yhat_logit;
run;
Appendix 3: R code for the examples
# Function to convert linear predictions to the probability scale: predict_ldm < function(fit) { c < fit$coefficients[1] k < nrow( fit$model ) / deviance( fit ) m < mean( fit$model[,1] ) a < log(m/(1m) ) + k*( c.5 ) + .5*(1/m  1/(1m) ) lin_preds < predict(fit) my_preds_logit < k*(lin_predsc) + a my_preds < 1/(1 + exp(my_preds_logit)) names(my_preds) < "ldm_preds" return(my_preds) } # Example 1 library(haven) library(psych) mroz<read_dta("https://statisticalhorizons.com/wpcontent/uploads/MROZ.dta") mroz_fit < lm(inlf ~ kidslt6 + age + educ + exper + expersq, data = mroz) summary(mroz_fit) lin_preds < predict(mroz_fit) ldm_preds <predict_ldm(mroz_fit) logit_fit < glm(inlf ~ kidslt6 + age + educ + exper + expersq, family = binomial, data = mroz) summary(logit_fit) logit_preds < predict(logit_fit, type = "response") three_preds < cbind(lin_preds, ldm_preds, logit_preds) describe(three_preds, fast=T) cor(three_preds) plot(logit_preds,lin_preds) plot(logit_preds, ldm_preds) # Example 2 lung<read_dta("https://statisticalhorizons.com/wpcontent/uploads/lung.dta") lung$dead<(lung$surv!=0) lung$i.stage < as.factor(lung$stage) lung$i.instit < as.factor(lung$instit) lung_fit < lm(dead ~ treat+operated+i.stage+i.instit, data = lung) summary(lung_fit) lin_preds < predict(lung_fit) ldm_preds <predict_ldm(lung_fit) logit_fit < glm(dead ~ treat+operated+i.stage+i.instit, family=binomial, data = lung) summary(logit_fit) logit_preds < predict(logit_fit, type = "response") three_preds < cbind(lin_preds, ldm_preds, logit_preds) describe(three_preds, fast=T) cor(three_preds) plot(logit_preds,lin_preds) plot(logit_preds, ldm_preds)
We are thrilled to introduce a new addition to the Statistical Horizons family — Code Horizons. This new initiative emerges from a simple realization. When we do our research, most of the things we actually do are not things we learned in graduate school. These include writing scripts to parse textual data, creating effective visualizations, making reproducible reports so that others can use our code and replicate our results, creating HTML5 slideshows and websites, and learning new languages like R or Python. More than ever, proficiency at “coding” (broadly defined) has become an essential element of research productivity in science and industry.
Personally, these are skills that I have had to learn largely on my own. Unfortunately, this usually involved a ton of painful trial and error, false starts, and dead ends. Wouldn’t it be great if that pain could be reduced, or even eliminated?
Our goal at Code Horizons is to help you get up to speed quickly on these vital skills so you can keep your focus where it belongs — doing your research.
We are just getting started, and we plan to offer many courses in the coming months and years. For now, we will be launching with two new courses — Python for Data Analysis and Workflow of Data Analysis (featuring both R and Stata). We will also continue to offer such essential courses as Data Visualization, Text as Data, and Statistics with R. Our aim is to help you learn the skills you need in an efficient and affordable way in order to rapidly increase your productivity.
As Director of Code Horizons, I am really excited about these new courses, and I will be attending them myself. Watch this space for more developments in the coming months!
When using multiple imputation, you may wonder how many imputations you need. A simple answer is that more imputations are better. As you add more imputations, your estimates get more precise, meaning they have smaller standard errors (SEs). And your estimates get more replicable, meaning they would not change too much if you imputed the data again.
There are limits, though. No matter how many imputations you use, multiple imputation estimates can never be more precise or replicable than maximum likelihood estimates. And beyond a certain number of imputations, any improvement in precision and replicability becomes negligible.
So how many imputations are enough? An old rule of thumb was that 3 to 10 imputations typically suffice (Rubin 1987). But that advice only ensured the precision and replicability of point estimates. When the number of imputations is small, it is not uncommon to have point estimates that replicate well but SE estimates that do not.
In a recent paper (von Hippel 2018), for example, I estimated the mean body mass index (BMI) of first graders, from a sample where three quarters of BMI measurements were missing. With 5 imputations, the point estimate was 16.642 the first time I imputed the data, and 16.659 the second time. That’s good replicability; despite three quarters of the values being imputed, the two point estimates differed by just 0.1%.
But the SE estimate didn’t replicate as well; it was .023 the first time I imputed the data, and .026 the second time—a difference of 13%. Naturally, if the SE estimate isn’t replicable, related quantities like confidence intervals, t statistics, and p values won’t be replicable, either.
So you often need more imputations to get replicable SE estimates. But how many more? Read on.
A New Formula
I recently published a new formula (von Hippel 2018) that estimates how many imputations M you need for replicable SE estimates. The number of imputations is approximately
This formula depends on two quantities, FMI and CV(SE).
 FMI is the fraction of missing information. The FMI is not the fraction of values that are missing; it is the fraction by which the squared SE would shrink if the data were complete. You don’t need to figure out the FMI. Standard MI software gives you an estimate.
 CV() is a coefficient of variation, which you can think of as roughly the percentage by which you’d be willing to see the SE estimate change if the data were imputed again.
For example, if you have FMI=30 percent missing information, and you would accept the SE estimate changing by 10 percent if you imputed the data again, then you’ll only need M=5 or 6 imputations. But if you’d only accept the SE changing by 5%, then you’ll need M=19 imputations. (Naturally, the same formulas work if FMI and CV(SE) are expressed as proportions, .3 and .1, rather than percentages of 30 and 10.)
Notice that the number of imputations increases quadratically, with the square of FMI. This quadratic rule is better than an older rule M = 100 FMI, according to which the number of imputations should increase linearly with FMI (Bodner 2008; White et al. 2011).
Here’s a graph, adapted from von Hippel (2018), that fits the linear and quadratic rules to a simulation carried out by Bodner (2008), showing how many imputations are needed to achieve a goal similar to CV(SE)=.05. With that goal, the quadratic rule simplifies to M=1+200 FMI^{2}, and the linear rule, as usual, is M=100 FMI.
Clearly the quadratic rule fits the simulation better than the linear rule. The rules agree when FMI=0.5, but when FMI is larger, the linear rule underestimates the number of imputations needed, and when FMI is smaller, the linear rule overestimates the number of imputations needed. For example, with 20% missing information, the linear rule says that you need 20 imputations, but the quadratic rule says you can make do with 9.
When the fraction of missing information gets above 70%, both rules underestimate the number of imputations needed for stable tbased confidence intervals. I suspect that happens because the degrees of freedom in the t statistic becomes unstable (von Hippel 2018). I am looking at that issue separately.
A Two Step Recipe
A limitation of the quadratic rule is that FMI is not known in advance. FMI has to be estimated, typically by multiple imputation. And the estimate of FMI itself can be unreliable unless the number of imputations is large (Harel 2007).
So it’s a circular problem. You need an estimate of FMI to decide how many imputations you need. But you can’t get an estimate of FMI until you impute the data. For that reason, I recommend a twostep recipe (von Hippel, 2018):
 First, carry out a pilot analysis. Impute the data using a convenient number of imputations. (20 imputations is a reasonable default, if it doesn’t take too long.) Estimate the FMI by analyzing the imputed data.
 Next, plug the estimated FMI into the formula above to figure out how many imputations you need to achieve a certain value of CV(SE). If you need more imputations than you had in the pilot, then add those imputations and analyze the data again.
There are two small wrinkles:
First, when you plug an estimate of FMI into the formula, you shouldn’t use a point estimate. Instead, you should use the upper bound of a 95% confidence interval for FMI. That way you take only a 2.5% risk of having too few imputations in your final analysis.
Second, in your analysis you may estimate several parameters, as in a multiple regression. In that case, you have to decide which SEs you want to be replicable. If you don’t have a strong opinion, the simplest thing is to focus on the SE of the parameter with the largest FMI.
Software
The twostep recipe has been implemented in three popular data analysis packages.
 In Stata, you can install my command how_many_imputations by typing ssc install how_many_. When there are multiple parameters, it uses the highest FMI.
 In SAS, you can use my macro %mi_combine, which is available in this Google Drive folder.
 In R, you can use Josh Errickson’s howManyImputations package, which is available via GitHub, here.
Alternatives
The twostage procedure isn’t the only good way to find a suitable number of imputations. An alternative is to keep adding imputed datasets until the estimates converge, or change very little as new imputed datasets are added. This approach can be applied to any quantity you want to estimate—a point estimate, a standard error, a confidence interval, a p value. This approach, called iterative multiple imputation (Nassiri et al. 2018), has been implemented in the R package imi, which is available via GitHub, here.
Stata’s mi estimate command uses a jackknife procedure to estimate how much your standard error estimates and p values would change if the data were imputed again with the same number of imputations (Royston, Carlin, and White 2009). These jackknife estimates can give you some idea whether you need more imputations, but they can’t directly tell you how many imputations to add.
(An earlier, shorter version of this post appeared on missingdata.org in January 2018)
Key reference
von Hippel, Paul T. (2018). “How many imputations do you need? A twostage calculation using a quadratic rule.” Sociological Methods and Research, published online, behind a paywall. A free prepublication version is available as an arXiv eprint.
See also
Bodner, T. E. (2008). What Improves with Increased Missing Data Imputations? Structural Equation Modeling, 15(4), 651–675. https://doi.org/10.1080/10705510802339072
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s1112100700709
Harel, O. (2007) Inferences on missing information under multiple imputation and twostage multiple imputation. Statistical Methodology, 4(1), 7589.
Nassiri, Vahid, Geert Molenberghs, Geert Verbeke, and João BarbosaBreda. (2019). “Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets.” The American Statistician, 17 pages online ahead of print. https://doi.org/10.1080/00031305.2018.1543615
Royston, Patrick, John B. Carlin, and Ian R. White. (2009). “Multiple Imputation of Missing Values: New Features for Mim.” The Stata Journal 9(2):252–64.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
When R first came out, around the year 2000, I was really excited. Here was a powerful, programmable statistical package that was free to anyone. I thought “This could revolutionize data analysis.” But when I gave it a test run, I quickly got discouraged. All the routine data management tasks seemed much harder in R than in SAS or Stata, my goto statistical packages. And many of my favorite analytical tools (like Cox regression) that were readily available in commercial packages seemed to be missing from R. Plus, everything was formulated in terms of vectors and matrices, and I knew that wouldn’t fly with the sociology graduate students I was teaching. So I put R back on the shelf and let others play with this shiny new toy.
Fast forward 20 years and things have changed dramatically. It now seems that no matter what you want to do in statistics, there’s an R package for that. Slick interface apps like RStudio have streamlined most of the routine tasks of managing R itself. And metapackages like the tidyverse have made data management and manipulation much easier and more straightforward.
Along with those changes, the community of R users has grown enormously. It’s hard to determine the exact number of R users, but recent estimates (by Oracle and by Revolution Analytics) put the worldwide user base at about 2 million. In a regularly updated blog post, Robert Muenchen has been tracking the popularity of data science software for several years, using a variety of metrics. One graph that I found particularly interesting was this one, which shows the number of mentions of various statistical packages in scholarly articles, based on a search of Google Scholar:
The most striking feature of this graph is the overall dominance of SPSS. However, SPSS peaked in 2009 and has since suffered a massive decline. SAS has also declined greatly since 2010, and it is now below R (the orange triangles) which has been steadily increasing in scholarly mentions since its inception.
The growth in R has two important consequences:
 It’s now much easier to get answers to any kind of R question, either by a quick web search or by just asking someone who works nearby.
 You can readily communicate your program code to hundreds of thousands of other researchers, without worrying about whether they have the same software installed.
It’s this second point that I want to stress here. I’ve come to the conclusion that R should be every data analyst’s second language. To me, it’s like English. No matter what your native language, it’s extremely useful to have English as a second language so that you can communicate easily with others who don’t share your native tongue. And so it is with R.
I’ve certainly found that to be true in my own work. Although I’m far from being a skilled R programmer, I’ve recently developed enough familiarity and competence with R that I can translate nearly all of my teaching examples and exercises into R. Currently, I try to provide R code for all the seminars I teach for Statistical Horizons. My two most recent publications have included R code. And I can now do a much better job of reading and reviewing papers that include R code.
R also enables me to do some things that I can’t do in SAS or Stata. For example, R can do multiple imputation for clustered data using either the jomo package or the mice package. And for linear structural equation modeling, I prefer the lavaan package over what’s in SAS or Stata (but Mplus is still better).
More and more people who come to Statistical Horizons seminars prefer to work in R, and they really appreciate having R code. Increasingly, our new instructors also prefer to teach their courses using R. Of course, that’s mainly a cohort phenomenon. Statistics departments overwhelmingly prefer to teach with R, at both the undergraduate and graduate levels, so there are thousands of newlyminted R users every year.
Don’t get me wrong, there are still lots of things I don’t like about R. There are so many usercontributed R packages that it’s hard to figure out which one will do what you want. And because there will often be many packages that will do you want, you then have the problem of figuring out which one is best. Documentation for many packages is sparse or poorly written. Even more bothersome is that some packages conflict with other packages because they have functions (or other elements) that share the same names. That can cause serious headaches.
Despite those problems, I am confident that R has a bright future. There are even newer competitors (like Python) that are favored in the data science/machine learning community. But R has reached such a critical mass that it will be very hard to stop.
How do you go about learning R? Well, there are lots of good materials on the web. But I think the best way is to take our seminar Statistics with R taught by Professor Andrew Miles. It’s designed for people who are reasonably proficient in statistics and already familiar with another package (like SPSS, SAS or Stata), but who just want to learn how to do those same things in R. In two days, you’ll get comfortable enough with R to do most of the statistical tasks that you are already doing in other packages. Andrew’s course also has lots of handson exercises, something that’s essential for deep learning of any new tool. You can get more information about his course here.
Standard methods for the analysis of panel data depend on an assumption of directional symmetry that most researchers don’t even think about. Specifically, these methods assume that if a oneunit increase in variable X produces a change of B units in variable Y, then a oneunit decrease in X will result in a change of –B units in Y.
Does that make sense? Probably not for most applications. For example, is it plausible that the increase in happiness when a person gets married is exactly matched by the decrease in happiness when a person gets divorced? Or that a $10K increase in income has the same effect on savings as a $10K decrease (in the opposite direction).
In this post, I’m going to show you how to relax that assumption. I’ll do it for the simplest situation where there are only two time points. But I’ve also written a more detailed paper covering the multiperiod situation.
Here’s the example with twoperiod data. The data set has 581 children who were studied in 1990 and 1992 as part of the National Longitudinal Survey of Youth. I’ll use three variables that were measured at each of the two time points:
anti antisocial behavior, measured with a scale from 0 to 6.
self selfesteem, measured with a scale ranging from 6 to 24.
pov poverty status of family, coded 1 for family in poverty, otherwise 0.
You can download the data here.
The goal is to estimate the causal effects of self and pov on anti. I’ll focus on fixed effects methods (Allison 2005, 2009) because they are ideal for studying the effects of increases or decreases over time. They also have the remarkable ability to control for all timeinvariant confounders.
For twoperiod data, there are several equivalent ways to estimate a fixed effects model. The difference score method is the one that’s the most straightforward for allowing directional asymmetry. It works like this: for each variable, subtract the time 1 value from the time 2 value to create a difference score. Then, just estimate an ordinary linear regression with the difference scores.
Here is Stata code for a standard symmetrical model:
use nlsy.dta, clear generate antidiff=anti92anti90 generate selfdiff=self92self90 generate povdiff=pov92pov90 regress antidiff selfdiff povdiff
You’ll find equivalent SAS code at the end of this post.
And here are the results:
 antidiff  Coef. Std. Err. t P>t [95% Conf. Interval] + selfdiff  .0391292 .0136396 2.87 0.004 .0659185 .01234 povdiff  .1969039 .1326352 1.48 0.138 .0636018 .4574096 _cons  .0403031 .0533833 0.75 0.451 .0645458 .1451521 
Selfesteem has a highly significant negative effect on antisocial behavior. Specifically, for each 1unit increase in selfesteem, antisocial behavior goes down by .039 units. But that also means that for each 1unit decrease in selfesteem, antisocial behavior goes up by .039 units. Poverty has a positive (but nonsignificant) effect on selfesteem. Children who move into poverty have an estimated increase in antisocial behavior of .112. But children who move out of poverty have an estimated decrease antisocial behavior of .112.
How can we relax the constraint that these effects have to be the same in both directions? York and Light (2017) showed the way. What’s needed is to decompose the difference score for each predictor variable into its positive and negative components. Specifically, if D is a difference score for variable X, create a new variable D^{+} which equals D if D is greater than 0, otherwise 0. And create a second variable D^{–} which equals –D if D is less than 0, otherwise 0.
Here’s how to create these variables in Stata:
generate selfpos=selfdiff*(selfdiff>0) generate selfneg=selfdiff*(selfdiff<0) generate povpos=povdiff*(povdiff>0) generate povneg=povdiff*(povdiff<0)
The inequalities in parentheses are logical expressions that have a value of 1 if the inequality is true and 0 if the inequality is false.
Now just regress antidiff on all four of these variables,
regress antidiff selfpos selfneg povpos povneg
which produces the following table:
 antidiff  Coef. Std. Err. t P>t [95% Conf. Interval] + selfpos  .0048386 .0251504 0.19 0.848 .0542362 .0445589 selfneg  .0743077 .025658 2.90 0.004 .023913 .1247024 povpos  .2502064 .2003789 1.25 0.212 .143356 .6437688 povneg  .126328 .1923669 0.66 0.512 .5041541 .2514981 _cons  .0749517 .086383 0.87 0.386 .2446157 .0947123

What does this tell us? A 1unit increase in selfesteem lowers antisocial behavior by .005 units (an effect that is far from statistically significant). A 1unit decrease in selfesteem increases antisocial behavior by .074 units (highly significant). So it looks like decreases in selfesteem have a big effect, while increases have little impact. Note that the original estimate of .039 was about midway between these two estimates.
Are the two effects significantly different? We can test that with the command
test selfpos=selfneg
which yields a pvalue of .10, not quite statistically significant.
Neither of the two poverty coefficients is statistically significant, although they are in the expected direction: moving into poverty increases antisocial behavior while moving out of poverty reduces it, but by about half the magnitude. These two effects are definitely not significantly different.
So that’s basically it for twoperiod data. When there are three or more periods, you have to create multiple records for each individual. Each record contains difference scores for adjacent periods. When estimating the regression model, you need to allow for negative correlations between adjacent records by using generalized least squares.
I discuss all these options in an article that can be found here. That paper also presents a data generating model that justifies the asymmetric first difference method. The data generating model can be extended to allow for the estimation of asymmetric logistic regression models, which can’t be estimated with difference scores.
If you want to learn more about fixed effects methods, see my two books on this topic: Fixed Effects Regression Models and Fixed Effects Regression Methods for Longitudinal Data Using SAS. Or take one of my seminars on longitudinal data analysis.
References:
Allison, Paul D. Fixed effects regression models. Vol. 160. SAGE publications, 2009.
Allison, Paul D. Fixed effects regression methods for longitudinal data using SAS. SAS Institute, 2005.
York, Richard, and Ryan Light. “Directional asymmetry in sociological analyses.” Socius 3 (2017): 113.
SAS Program /* The NLSY data set can be downloaded at statisticalhorizons.com/resources/datasets */ data nlsydiff; set my.nlsy; antidiff=anti2anti1; selfdiff=self2self1; povdiff=pov2pov1; proc reg data=nlsydiff; model antidiff=selfdiff povdiff; run; data nlsydiff; set nlsydiff; selfpos=selfdiff*(selfdiff>0); selfneg=selfdiff*(selfdiff<0); povpos=povdiff*(povdiff>0); povneg=povdiff*(povdiff<0); proc reg data=nlsydiff; model antidiff=selfpos selfneg povpos povneg; test selfpos=selfneg; test povpos=povneg; run;
When I teach courses on structural equation modeling (SEM), I tell my students that any model with instrumental variables can be estimated in the SEM framework. Then I present a classic example of simultaneous causation in which X affects Y, and Y also affects X. Models like this can be estimated if each of the two variables also has an instrumental variable—a variable that affects it but not the other variable. Specification of the model is fairly straightforward in any SEM package.
However, there are lots of other uses of instrumental variables. My claim that they can all be estimated with SEM is more speculative than I usually care to admit. A couple weeks ago I got a question about instrumental variables in SEM that really had me stumped. What seemed like the right way to do it was giving the wrong answer—at least not the answer produced by all the standard econometric methods, like twostage least squares (2SLS) or the generalized method of moments (GMM). When I finally hit on the solution, I figured that others might benefit from what I had learned. Hence, this blog post.
The question came from Steven Utke, an assistant professor of accounting at the University of Connecticut. Steve was trying to replicate a famous study by David Card (1995) that attempted to estimate the causal effect of education on wages by using proximity to college as an instrumental variable. Steve was able to replicate Card’s 2SLS analysis, but he couldn’t get an SEM model to produce similar results.
For didactic purposes, I’m going to greatly simplify Card’s analysis by excluding a lot of variables that are not essential for the example. I’ll use Stata for the analysis, but equivalent SAS code can be found in Addendum 2.
Let’s begin with a bivariate regression of the log of hourly wages on years of education, for a sample of 3,010 young men in 1976.
. use "https://dl.dropboxusercontent.com/s/bu8ze4db4bywwin/card", clear . gen lwage = log(wage) . regress lwage educ  lwage  Coef. Std. Err. t P>t [95% Conf. Interval] + educ  .0520942 .0028697 18.15 0.000 .0464674 .057721 _cons  5.570882 .0388295 143.47 0.000 5.494747 5.647017 
Education has a highly significant coefficient, whose magnitude can be interpreted as follows: each additional year of schooling is associated with about a 5% increase in wages.
Of course, the problem with treating this as a causal effect is that there are likely to be many omitted variables that affect both education and wages. We could control for those variables by measuring them and including them in the model (and Card did that for many variables). But there’s no way we can control for all the possible confounding variables, especially because some variables are difficult to measure (e.g., ability). We must therefore conclude that education is correlated with the error term in the regression (a form of endogeneity), and that our regression coefficient is therefore biased to an unknown degree.
Card proposed to solve this problem by introducing proximity to college as an instrumental variable. Specifically, nearc4 was a dummy (indicator) variable for whether or not the person was raised in a local labor market that included a fouryear college. As with any instrumental variable, the crucial assumption was that nearc4 would affect years of schooling but would NOT directly affect lwage. Although there may be reasons to doubt that assumption (some discussed by Card), they are not relevant to our discussion here.
Using the ivregress command in Stata, I estimated the instrumental variable model by 2SLS. In case you’re not familiar with that venerable method, it amounts to this: (a) do an OLS regression of educ on nearc4, (b) calculate predicted values from that regression, and (c) regress lwage on those predicted values. The only tricky part is getting the standard errors right. Here are the results:
. ivregress 2sls lwage (educ=nearc4) Instrumental variables (2SLS)  lwage  Coef. Std. Err. z P>z [95% Conf. Interval] + educ  .1880626 .0262826 7.16 0.000 .1365497 .2395756 _cons  3.767472 .3487458 10.80 0.000 3.083942 4.451001  Instrumented: educ Instruments: nearc4
Remarkably, the coefficient for education in this regression is more than three times as large as in the bivariate regression and is still highly significant. According to this estimate, each additional year of schooling yields a 21% increase in wages (calculated as 100(exp(.188)1), a very large effect by any reasonable standard. (Card got a coefficient of around .14 with other predictors in the model).
Now let’s try to do this with a structural equation model, using Stata’s sem command. As with all SEM software, the default is to do maximum likelihood estimation under the assumption of multivariate normality. The basic idea is to specify a model in which nearc4 affects educ, and educ affects lwage. But there can be no direct effect of nearc4 on lwage. Here’s a path diagram for the model:
Here’s the Stata code and the results:
. sem (lwage < educ) (educ < nearc4)   Coef. Std. Err. z P>z [95% Conf. Interval] + Structural  lwage  educ  .0520942 .0028688 18.16 0.000 .0464716 .0577169 _cons  5.570882 .0388166 143.52 0.000 5.494803 5.646961 + educ  nearc4  .829019 .1036643 8.00 0.000 .6258406 1.032197 _cons  12.69801 .0856132 148.32 0.000 12.53022 12.86581
This can’t be right, however, because the coefficient for educ is identical to what we got in our initial bivariate regression, with an almost identical standard error. Clearly, the inclusion of the regression of educ on nearc4 has accomplished nothing. Can this problem be fixed, and if so, how?
I struggled with this for about an hour before it hit me. The solution is to build into the model the very thing that we are trying to correct for with the instrumental variable. If there are omitted variables that are affecting both educ and lwage, they should produce a correlation between the error term for educ and the error term for lwage. That correlation needs to be included in the SEM model. Here’s a path diagram for the new model:
In Stata, it’s done like this:
. sem (lwage <educ) (educ<nearc4), cov(e.educ*e.lwage)
The cov option allows a covariance (and therefore a correlation) between the two error terms. Here are the new results:
  Coef. Std. Err. z P>z [95% Conf. Interval] + Structural  lwage  educ  .1880626 .0262826 7.16 0.000 .1365497 .2395756 _cons  3.767472 .3487458 10.80 0.000 3.083942 4.451001 + educ  nearc4  .829019 .1036643 8.00 0.000 .6258406 1.032197 _cons  12.69801 .0856132 148.32 0.000 12.53022 12.86581 +
Now the coefficient and standard error for educ are identical to what we saw earlier with 2SLS. Problem solved! (SEM won’t always give identical results to 2SLS, as I explain in Addendum 1).
Interestingly, the estimated correlation between the error terms (not shown in the table) was .66. That means that the collective impact of omitted variables is to affect educ and lwage in opposite directions. Whatever raises lwage lowers educ, and vice versa.
It’s hard to imagine what variables would behave like this. And that difficulty raises questions about the plausibility of the model. (Card suggested that measurement error in education could produce a negative correlation, but the degree of error would have to be unreasonably large to produce the result we just saw.) In any case, that’s not really our concern here. It does point out, however, that one of the advantages of the SEM approach is that you get an estimate of the error correlation—something you typically don’t see with 2SLS.
What’s the lesson here? If you want to use SEM to estimate an instrumental variable model, it’s essential to think carefully about why you need instruments in the first place. You should make sure that the specified model reflects all your beliefs about the causal mechanism. In particular, if you suspect that error terms are correlated with observed variables, those correlations must be built into the model.
If you’d like to learn more about structural equation modeling, check out my SEM short courses, which come in remote and ondemand versions. You can find links to past and future offerings here. You might also be interested in Felix Elwert’s course on Instrumental Variables. His past and future offerings can be found here.
Reference:
Card, David (1995) “Using geographic variation in college proximity to estimate the return to schooling.” Aspects of Labour Economics: Essays in Honour of John Vanderkamp. University of Toronto Press.
Addendum 1. JustIdentified Models
The reason 2SLS and SEM produce identical coefficient estimates for this example is that the model is justidentified. That is, the model imposes no restrictions on the variances and covariances of the variables. Other instrumental variable methods like limited information maximum likelihood or GMM also yield identical results in the justidentified situation. For models that are overidentified (for example, if there are two instrumental variables instead of just one), these methods will all yield somewhat different results.
Addendum 2. SAS code
Go to http://davidcard.berkeley.edu/data_sets.html to get the data set as a text file along with a sas program for reading in the data. Here is the code for the analysis:
proc reg data=card; model lwage76 = ed76; proc syslin 2sls data=card; endogenous ed76; instruments nearc4; model lwage76 = ed76; proc calis data=card; path lwage76 < ed76, ed76 < nearc4; proc calis data=card; path lwage76 < ed76, ed76 < nearc4, ed76 <> lwage76; *This specifies a correlation between the error terms; run;
Competing risks are common in the analysis of event time data. The classic example is death, with distinctions among different kinds of death: if you die of a heart attack, you can’t then die of cancer or suicide. But examples also abound in other fields. A marriage can end either by divorce or by the death of a spouse, but not both. If a loan is terminated by prepayment, it is no longer at risk of a default.
In this post, I argue that one of the more popular methods for regression analysis of competing risks—the analysis of subdistribution hazards, introduced by Fine and Gray (1999)—is not valid for causal inference. In fact, it can produce completely misleading results. Instead, I recommend the analysis of causespecific hazards, a longstanding and easily implemented method.
Let’s review that classic method first. The estimation of causespecific hazard functions (Kalbfleish and Prentice 2002) can be accomplished with standard methods for single kinds of events. You simply treat all competing events as though the individual were right censored at the time the competing event occurs. For example, if you want to study the effect of obesity on the risk of death due to heart disease, just estimate a Cox proportional hazards model in which all causes of death other than heart disease are treated as censoring. Or if you want to estimate the effect income on divorce, estimate a Cox model in which spousal death is treated as censoring.
By contrast, the method of Fine and Gray (1999) does not treat competing events in the same way as censored observations. Based on cumulative incidence functions, their method estimates a proportional hazards model for something they call the subdistribution hazard.
The definition of the subdistribution hazard is similar to that for a causespecific hazard, with one key difference: the causespecific hazard removes an individual from the risk set when any type of event occurs; the subdistribution hazard removes an individual from the risk set when an event of the focal type occurs or when the individual is truly censored. However, when a competing event occurs, the individual remains in the risk set. Fine and Gray acknowledge that this is “unnatural” because, in fact, those who experience competing events are no longer actually at risk of the focal event. But it’s necessary in order to get a model that correctly predicts cumulative incidence functions. (More on that later).
According to Google Scholar, the Fine and Gray paper has been cited more than 5,000 times. It is now widely available in most major software packages, including Stata (with the stcrreg command), SAS (with the EVENTCODE option in PROC PHREG) and R (with the ‘cmprsk’ package). In some fields, it has become the standard method for analyzing competing risks. In the minds of many researchers, it is the only proper way to analyze competing risks.
But there’s one big problem: the subdistribution method doesn’t isolate distinct causal effects on the competing risks. In fact, it confounds them in predictable and alarming ways. Specifically, any variable that increases the causespecific risk of event A will appear to decrease the subdistribution hazard for event B. Why? Because whenever a type A event occurs, it eliminates the possibility that a type B event will happen.
Here’s a simple simulation that demonstrates this phenomenon. I generated 10,000 observations on each of two event types, labeled A and B. Event times for type A were generated by a Weibull regression model (a parametric version of the proportional hazards model). The only predictor was variable X, which had a standard normal distribution.
Event times for type B were generated by an identical Weibull regression model in which the only predictor was variable Z, also standard normal. X and Z had a correlation of .50.
If event A occurred before event B, the event time for B was not observed. Similarly, if B occurred before A, the event time for A was not observed. Any event times greater than 8 were treated as right censored. SAS code for generating the data and performing the analysis is appended below.
In the resulting data set, there were 4350 type A events, 4277 type B events, and 1391 truly censored observations (neither A nor B was observed). Given the data generating process, any attempt to estimate causal parameters should find no effect of X on B and no effect of Z on A.
In Table 1 below, I show the results from applying the two different methods: causespecific Cox regression models with competing events treated as censoring; and subdistribution proportional hazards models. For type A, the causespecific estimate of .494 for the effect of X is close to the true value of .500 and highly significant. The coefficient of .008 for Z is close to the true value of 0 and far from statistically significant. Overall, pretty good performance.
The subdistribution estimates for type A, on the other hand, are clearly unsatisfactory. The coefficient for X appears to be biased downward by about 10%. The coefficient for Z (.216) is far from the true value of 0, and highly significant. Thus, the subdistribution estimates would lead one to conclude that an increase in Z reduced the risk of event A. What it actually did is reduce the likelihood that type A events would be observed because it increased the risk of event B.
The results for event B in the lower panel of Table 1 are the mirror image of those for event A. For both X and Z, the causespecific estimates are close to the true values. The subdistribution estimates are biased, and would lead to incorrect causal inferences.
Table 1. Results from Two Methods for Estimating Competing Risks Models, NonInformative Censoring.
CauseSpecific Hazards 
Subdistribution Hazards 

Type A 
Estimate 
S.E. 
p 
Estimate 
S.E. 
p 
x 
.490 
.018 
<.0001 
.420 
.018 
<.0001 
z 
.009 
.018 
.601 
.214 
.017 
<.0001 
Type B 






x 
.021 
.018 
.248 
.187 
.017 
<.0001 
z 
.488 
.018 
<.0001 
.433 
.018 
<.001 
So why would anyone seriously consider the subdistribution method? Well, there’s one big problem with the causespecific hazards approach. Virtually all methods based on causespecific hazards implicitly assume that censoring is noninformative. Roughly speaking, that means that if an individual is censored at a particular point in time, that fact tells you nothing about the individual’s risk of the event.
If the censoring times are determined by the study design (as when all event times beyond a certain calendar time are censored), that’s not usually an issue. But if censoring times are not under the control of the investigator, the censoring may be informative. And that can lead to bias.
If competing risks are treated as a form of censoring, then we need to be concerned about whether that censoring is informative or noninformative. How might competing events be informative? If a spouse dies, does that tell us anything about the risk that the couple would have divorced? Maybe, but probably not. Does the fact that a person dies of heart disease tell us anything about that person’s risk of dying of cancer? Maybe so. That would definitely be the case if cancer and heart disease had common risk factors, and those factors were not included in the regression model.
Unfortunately, there’s no way to test the assumption that censoring is noninformative (Tsiatis 1975). And even if there were, there’s no good method available for estimating causal effects when censoring is informative.
By contrast, the subdistribution hazard method does not explicitly assume that competing risks are noninformative. And that has been one of its major attractions. However, as I now show, the subdistribution method does no better (and actually somewhat worse) than the causespecific method when competing events are informative.
In the previous simulation, the assumption of noninformative censoring was satisfied by the data generating process. To model informative censoring, I added an unobserved common risk factor to the regression equations for the two event times. This was simply a normally distributed random variable with a standard deviation of 2. This “random intercept” induced a correlation of .28 between the uncensored event times for type A and type B. I then reestimated the models, with results shown in Table 2 below.
The message of Table 2 is this: Yes, informative censoring leads to causespecific estimates that are biased. And unlike in the previous table, the causespecific estimates might lead one to conclude, incorrectly, that Z affects the risk for event A and that X affects the risk for event B. But the subdistribution estimates are also biased. And, with one minor exception, the biases are worse for the subdistribution method than for the causespecific method.
Table 2. Results from Two Methods for Estimating Competing Risks Models, Informative Censoring.

CauseSpecific Hazards 
Subdistribution Hazards 

Type A 
Estimate 
S.E. 
p 
Estimate 
S.E. 
p 
x 
.347 
.019 
<.0001 
.349 
.019 
<.0001 
z 
.067 
.019 
.0005 
.185 
.019 
<.0001 
Type B 






x 
.102 
.019 
<.0001 
.212 
.019 
<.0001 
z 
.372 
.019 
<.0001 
.368 
.019 
<.0001 
To repeat my earlier question: Why would anyone ever seriously consider using the subdistribution method? To be fair to Fine and Gray, they never claimed that subdistribution regression would accurately estimate causal parameters. Instead, they introduced the method as a way to model the impact of covariates on the cumulative incidence functions. These functions are often preferable to KaplanMeier estimates of the survivor function because they do a better job of describing the empirical distribution of events rather than some hypothetical distribution that would apply only in the absence of competing risks.
Cumulative incidence functions are particularly useful for prediction. Suppose you have a cohort of newly diagnosed cancer patients. Based on the experience of earlier patients, you want to predict in five years what percentage will have died of cancer, what percentage will have died of other causes, and what percentage will still be alive. Cumulative incidence functions will give you that information. KaplanMeier survivor functions will not.
The Fine and Gray method provides a way to introduce covariate information into those predictions, potentially making them more accurate for individual patients. It’s important to note, however, that one can also calculate cumulative incidence functions based on causespecific hazard functions. Most commercial packages for Cox regression don’t have that capability, but there are downloadable SAS macros that will accomplish the task.
In sum, the subdistribution method may be useful for generating predicted probabilities that individuals will be in particular states at particular times. But it is not useful for estimating causal effects of covariates on the risks that different kinds of events will occur. For that task, the analysis of causespecific hazards is the way to go. Unfortunately, both methods are vulnerable to competing events that are informative for each other. The only effective way to deal with that problem is to estimate causespecific hazard models that include common risk factors as covariates.
SAS Code for Simulation
data finegraytest; do i=1 to 10000; *Generate a common risk factor. To include it in the model, change 0 to 1; comrisk=0*rannor(0); *Generate x and z, bivariate standard normal with r=.5; x=rannor(0); z=.5*x + sqrt(1.25)*rannor(0); *Generate w with Weibull distribution depending on x; logw=2 + .75*x + 1.5*log(ranexp(0)) + 2*comrisk; w=exp(logw); *Generate y with Weibull distribution depending on z; logy=2 + .75*z + 1.5*log(ranexp(0)) + 2*comrisk; y=exp(logy); *Allow events to censor each other; if y>w then do; type=1; t=w; end; else if w>y then do; type=2; t=y; end; *Censor all event times at 8; if t>8 then do; type=0; t=8; end; output; end; run; proc freq; table type; run; /*Estimate causespecific hazard regressions */ *Model for type1 event; proc phreg data=finegraytest; model t*type(0 2)=x z; run; *Model for type2 event; proc phreg data=finegraytest; model t*type(0 1)=x z; run; /*Estimate subdistributions hazard regressions */ *Model for type1 event; proc phreg data=finegraytest; model t*type(0)=x z / eventcode=1; run; *Model for type2 event; proc phreg data=finegraytest; model t*type(0)=x z / eventcode=2; run;
In my courses and books on longitudinal data analysis, I spend a lot of time talking about the betweenwithin model for fixed effects. I used to call it the hybrid model, but others have convinced me that “betweenwithin” provides a more meaningful description.
Last week my longtime collaborator, Paula England, asked me a question about the betweenwithin model that stumped me at first. When I finally figured out the answer, I realized that it could potentially be important to anyone who uses the betweenwithin method. Hence, this post.
Here is Paula’s project, coauthored with Eman Abdelhadi. They have a large, crosssectional sample of adult women from approximately 50 countries. They are estimating a logistic regression model for a dichotomous dependent variable: whether or not a woman is employed. One of the independent variables is also dichotomous: whether or not a woman is a Muslim (coded 1 or 0). In order to control for all betweencountry differences, they estimate a betweenwithin model with the following characteristics:
 a randomintercepts model with countries as clusters.
 a “within” predictor that is the Muslim indicator (dummy variable) minus the country mean of that variable.
 a “between” predictor that is the country mean, i.e., the proportion Muslim.
 other control variables at both the person level and the country level.
The coefficient for the within predictor is very close to what you would get from a classic fixed effects estimator, as estimated by conditional logistic regression. (My 2014 post discusses why they are not identical). So it represents the effect of being Muslim on employment, controlling for all countrylevel variables, both observed and unobserved.
Here is the question. Can you interpret the coefficient for the between predictor (proportion Muslim) as the effect of living in a country that is more or less Muslim, regardless of whether a person is personally a Muslim? In other words, can you interpret the between coefficient as a contextual effect?
Surprisingly, the answer is no. By construction, the within variable is uncorrelated with the between variable. For that reason, the coefficient for proportion Muslim does not actually control for the personlevel effect of being Muslim.
The reason I never thought about this question before is that I primarily use the betweenwithin model for longitudinal data, with repeated measurements clustered within persons. In that setting, the coefficients for the between variables are usually uninteresting or misleading. But when applied to other kinds of clustering, the estimation and interpretation of “between” effects can be quite important.
Fortunately, there’s a simple solution to the problem: Estimate the model using the original Muslim indicator rather than the deviation from its countrylevel mean. This model is mathematically equivalent to the original betweenwithin model—the coefficients for the Muslim indicator and for all the control variables will be exactly the same. But the coefficient for the proportion Muslim will change, and in a predictable way. Here is the algebra:
In these equations p_{ij} is the probability of employment for person i in country j, x_{ij} is the Muslim indicator for person i in country j, and xbarj is the mean of the Muslim indicator in country j.
The first equation is the usual formulation of the betweenwithin model. The second equation is the algebra that gets us to the third equation. That equation is the new formulation of the model with x_{ij} and xbarj as predictors. Notice that the coefficient of xbarj in the third equation can be found by subtracting the coefficient for the within effect from the coefficient for the between effect in the first equation.
So you don’t actually have to reestimate the model to get the contextual effect. You can just take the difference in the coefficients in the standard betweenwithin model. Furthermore, testing whether those coefficients are the same or different (which is usually done to test fixed vs. random effects) is equivalent to testing for the existence of a contextual effect.
Keep in mind that although the coefficient for xbarj in the third equation can be interpreted as a contextual effect, it does not control for any unobservables. So it could be biased by the omission of countrylevel variables or personlevel variables that affect employment and are correlated with proportion Muslim. Note, also, that the fact that the personlevel variable is dichotomous in this example is completely incidental. The same logic would apply if x_{ij} were continuous.