To describe how a prognostic or diagnostic model can be developed and tested as thoroughly as possible.
The outcome measure of the model is clearly described;
The use of a prognostic model (instead of e.g. a causal model) clearly follows from the research question;
The sample size is large enough to build a valid model.
Clearly describe the (type of) outcome measure;
Describe how dropouts and missing values were handled;
Give a correlation table with all potential predictors;
Describe in detail how the prognostic model is build;
Describe how quality and validity of the model is assessed.
To consult an expert in case there is doubt whether a prognostic model is the appropriate model.
To conduct a literature review or consult experts to select the appropriate potential predictors for the model;
To clearly state the outcome measure of the model;
To define the number of predictors and to calculate the required sample size
To assess linearity of all potential predictors and to create a correlation table of these predictors;
To describe how dropouts and missing values are handled;
To use a backward selection for constructing the model and describe used cut-off points for selection;
To assess the quality and validity of the model.
To advice the executing researcher to use this guideline and if necessary to consult an expert on type of model to be used and potential predictors to be used;
The aim of a prognostic model is to estimate (predict) the probability of a particular outcome as optimally as possible, and not just to explore the causality of the association between a specific factor and the outcome (explanatory). The way in which a prognostic model is developed differs therefore from the method for building an explanatory model. For an explanatory (causal) model there is normally a single central determinant and correction for confounding. When building a prognostic model the focus is on the search for a combination of factors which are as strongly as possible related to the outcome.
Prognostic models are often developed for the clinical practice, where the risk of disease development or disease outcome (e.g. recovery from a specific disease) can be calculated for individuals by combining information across patients. The model can then be presented in the form of a clinical prediction rule (1). It is often preferable that the variables in the model are easily determined in practice in order to ensure that a prognostic model is applicable in (clinical) practice. For prognostic models prospective research designs are used, for diagnostic models, cross-sectional designs. For the development of diagnostic models the same procedures as described in these guidelines are used as for prognostic models. Only the aim of a diagnostic model is to detect the presence of a disease.
All aspects of developing and validating a prediction model are described in the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement (1) since 2015. In that statement the different steps to be taken to develop a prediction or diagnostic model are described in detail. In this guideline the most important criteria will be discussed.
Choice of predictors
Prognostic models can be developed using a broad variety of biological, psychological and social predictors. The correct predictors need to be carefully selected. It is advisable to include all predictors which have been shown to be strongly associated with the outcome in previous research, or those which can be expected to show an association on the basis of conceptual or theoretical models. A proper systematic literature review and expert advice is important in this step. When the practical applicability of the prognostic model is important, it is preferable to use predictors that can quickly and simply be determined (e.g. no complex or invasive tests and no extensive questionnaires).
Defining the outcome measure
The outcome is central to the prognostic model and needs to be carefully selected. Think carefully about the nature of the outcome (which concept), the method for determining the outcome (which measurement instrument, by whom) and the length of follow-up (which measurement time points). The outcome of a prognostic model is often dichotomous (e.g. ill or not ill), but it may also be a continuous outcome (for instance, the severity of functional limitations), or the time until a certain event occurs (“time to event”, for instance, the time until work is resumed or time until death). When defining a dichotomous outcome, occasionally a cut-off point is chosen on a continuous scale. Bear in mind that this leads to a loss of information and therefore this should only be considered in the case of strong arguments. If dichotomized, a cut-off needs to be carefully selected, preferably based on substantive arguments and the use of a conceptual or theoretical model. For instance, at what point do we define whether or not there is a case of depression?
Choice of model
The choice of the statistical model to be used in creating the prognostic model is dependent on the definition of the outcome measure. A logistic regression model should be chosen for a dichotomous outcome. A Cox regression model can be used for a “time to event” model and a linear regression model for a continuous outcome measure. There are various other options, but these will not be discussed in this guideline.
Sample size and number of predictors
The precision of the estimates in the prognostic model is highly dependent on the size of the study population. There are different ways of generating power calculations for determining the minimal sample size of the study population. This, in particular, will determine the number of variables that can be included in the regression model. A rule of thumb is that for a continuous outcome measure (linear regression) you will need at least 10 – 15 participants per variable in the model. For a dichotomous outcome (logistic regression) at least 10 – 15 “events” or “non-events”, depending on which has the lowest number of participants, need to be considered per variable (2,3). Events and non-events refer to whether or not the outcome occurs, for instance, disease/no disease. The logistic regression rule also applies to Cox regression models. When dealing with external validation of a prognostic model the validation cohort also needs to have a sufficient number of participants (validation cohort refers to a cohort used to externally test the model). The 10 – 15 participants rule also applies here.
The regression models discussed in this guideline presupposes a linear relationship between the predictor and outcome. However, more often than not this relationship is non-linear rather than linear. An example of this, for instance, is the relationship between alcohol consumption and the risk of developing a cardiac infarction. This relationship is U-shaped. One therefore needs to consider investigating for all potential predictors (with the exception of nominal or dichotomous variables – nominal variables should always be included as dummy variables) whether the relationship with the outcome measure is indeed linear. However, a balance must be sought between a data driven search for sample idiosyncratic non-linearity and specifics applying to the population. Most important is that not the exact form of the relationship is important but the increase in predictive performance. There are various options for investigation of non-linearity which are fractional polynomials (4) and spline functions.
Spline functions can be used to further explore the linear/non-linear relationship between a predictor and the outcome (spline functions are mathematical functions that are used to carefully analyse the relationship between a predictor and the outcome measure, if this is non-linear). These spline functions do not assume a linear relationship between a predictor and the outcome measure, if this is not present, but follow the pattern of the data in more detail. If there is a non-linear relationship between the predictor and the outcome, then this can be included as a function in the regression model. The advantage of this is that this does not reduce the power of the regression model too greatly in comparison with categorizing the variables and including these as dummy variables, which often happens in a non-linear relationship.
Correlation between predictors
A significant correlation between variables will affect the selection of both predictors. It is therefore sensible to generate a correlation table, including all potential predictors. When variables are strongly correlated (e.g. >0.70), it is sensible to choose which variables you are going to use in building the model, or if you intend to add variables together into a single variable. For instance, you could choose the variable most strongly associated with the outcome measure, or the measure that is easiest to measure. N.B.: There is no problem with variables strongly correlated with each other in case of a single model. Problems arise when “forward” and “backward” selection takes place in combination with strongly correlated (independent) variables.
Handling missing values
There are dropouts and missing values in virtually every cohort study. Dropouts are participants not (or no longer) taking part in follow-up assessments and whose outcome measures are missing. The number and reasons of dropouts need to be described. If possible the personal characteristics of the dropouts should also be described and compared with those participants who did take part in the follow-up assessments, in order to investigate whether a selective dropout took place. In addition to dropouts there are often also (incidental) missing values, where results of one or more predictors are missing for a section of the participants.
There are various strategies for dealing with missing values. One of these is to only use data from participants with a complete dataset (“complete case analysis”). This method is only valid when the missing data is Missing Completely At Random (MCAR), which is seldom the case. Mostly the data is Missing At Random. In this case an advanced and recommended method as Multiple Imputation can be used. Make sure that the number of dropouts and missing values are always described in your study. For detailed information on techniques to evaluate and handle missing data we would like to refer to the guideline “handling missing data”.
3. Developing the model
Preselecting predictors and building the model
Once a set of predictors has been selected, the next step is to create the prognostic model. It is important in this process to distinguish between relevant and less relevant predictors striving for a final model with as few predictors as possible yet still giving reliable predictions. The following techniques can be used for developing a prognostic model.
A. Univariate and Stepwise regression analysis
Firstly, the relationship between each individual predictor is investigated with the outcome measure in a model that only includes the predictor and outcome measure (univariate). The relationship between the predictor and outcome are evaluated against a specific p-value: 0.05 or 0.157 are often used for this, or a lower value. Is is not adviced to preselect predictors based on univariate statistical significance. It is a better to make the first selection of predictors based on previous research or expert opinion and not to trust too much on statistical pre-selection alone.
Building the model
The options are to use a forward or backward selection method, or a combination of the two (stepwise regression). Forward and backward selection methods can be used in order to select the predictors for the model step-by-step. In the forward selection method you add variables to the model, whereas in a backward selection method you remove variables from the model. The backward selection method is preferred, as it leads to fewer errors in the estimates for the predictors and in selecting the most relevant predictors. For these reasons this method is discussed in more detail here.
In backward selection all the selected variables are firstly entered at the same time into a model. Subsequently the variables with the highest p-values are removed (i.e. those variables contributing the least) on the basis of the Wald test (which allows you to calculate the significance level of a predictor). Then the model is re-run. This step is repeated until there are no variables left with a p-value smaller than 0.05 or 0.157 (1).
Sometimes it may be informative to, following this procedure, add specific variables that did not end up in the final model (but perhaps were expected to fit in the model), to assess whether they make a significant contribution to the final model. This process is occasionally successful. It may also be interesting to interchange variables on the basis of the correlation between variables (e.g. variables that are easier to measure), to assess whether this generates an equivalent, but more easily applicable model.
B. Least absolute shrinkage and selection operator (Lasso)
The Lasso is an advanced technique for the selection of variables. The Lasso is able to shrink regression coefficients to zero. This is the same as not selecting variables in a multivariable analysis. The Lasso method combines this shrinkage with variable selection and so does not need a separate shrinkage step. Furthermore, with the Lasso the number of potential prognostic variables to select can be much larger than with “normal” backward selection. The method is promising but has not been applied much in epidemiological studies yet.
4. The performance of the prognostic model
Once you have developed a prognostic model, it is also important to investigate how well the model works, that is to say, how well does the model predict outcomes? The section below describes which techniques, depending on the choice of model, can be used to test how well your prognostic model works (1):
The percentage variance explained (R2): This indicates the percentage of the total variance of the outcome measure explained by the predictors in the prognostic model.
Logistic and Cox regression
Calibration: Calibration can be used to assess how well the observed probability of the outcome agrees with the probability predicted by the model. This can also be presented graphically in a calibration plot. In a calibration plot groups of predicted probabilities of the outcome are plotted against groups of observed probabilities (groups of 10 are often used). Subsequently you can assess the extent to which these groups lie along the perfect calibration line, which forms a 45 degree angle with the horizontal axis. The Hosmer-Lemeshow test can also be used to investigate how well the predicted probabilities agree with the observed probabilities. This test should not be statistically significant (null hypothesis: There is no difference between predicted and observed values).
Discrimination: This indicates how well the model discriminates between people with and without the outcome. If there are few predictors in the model, then a lot of people will fall into the same group of predicted probabilities and the model will not be able to discriminate very well between groups. If there are numerous predictors in the model, then few people will fall into the same group and the model will have a better discriminatory power. An ROC curve can be generated for the predicted probabilities to determine the level of discrimination. The Area Under the Curve (AUC) of the ROC curve is a measure of discriminatory power for the model, that is, how well the model is able to discriminate between people with and without the outcome based on the predicted probabilities (3). An AUC of 0.5 indicates that the model is not discriminating very well (no different to tossing a coin); an AUC of 1.0 indicates perfect discrimination. It is also advisable to use measures as sensitivity and specificity of the model at chosen probability thresholds. This guarantees a better translation of the model into clinical practice.
Reclassification tables: This is a novel method to evaluate the performance of a prediction model and can be seen as a refinement of discrimination obtained by the ROC curve (5). This method is especially useful to detect an improvement in discrimination when a new variable is added to an existing prediction model. It makes use of the reassignment of subject with and without the outcome in their corresponding risk categories. When a new variable is added to the model and prediction is improved, subjects with the outcome are reassigned to a higher risk category. This means improved reclassification. When subjects with the outcome are reassigned to lower risk categories reclassification is worsened. For subjects without the outcome it works in the opposite direction. The Net Reclassification Improvement (NRI) and Integrated Discrimination Improvement (IDI) can be used to test of significance of reclassification and create confidence intervals.
Decision Curve Analysis (DCA): Decision curve analysis is a method to evaluate the net benefit (NB) of a prediction model across clinicians and patient preferences for accepting the risk of under- or overtreatment (6). The decision to treat depends on the benefits (effectiveness) and harms (complications, costs) of the treatment. A higher NB value means that the model will be more clinically useful as indicated by the higher number of TP patients that are identified. Further, the NB of prediction models can easily be compared.
5. Creating a prediction rule
For logistic and Cox regression models the regression coefficients can be used to calculate the outcome (predicted probabilities), based on individual patient characteristics (values of the determinants). The regression coefficients can be transformed into risk scores in order to facilitate use of the prediction rule in practice. A frequently used method for this is to divide the regression coefficients by the lowest value or to multiply the coefficients by a constant, for instance 10. A score card containing these scores can then be generated to allow the probability of an outcome to be easily calculated for a given individual. This is easy to use in practice. Refer to the article by Kuijpers et al. (2006) for an example (5). Another example is to create a mathematical algorithm and install this on a website.
This is perhaps the most important part of developing a prediction rule. Prediction models commonly perform better in datasets used to develop the model than in new datasets (subjects). This means that the model’s regression coefficients and performance measures are too optimistic and that these have to be adapted to new situations (1, 6). A way to adapt prediction models is to shrink (i.e. make smaller) the regression coefficients before the model will be applied in new subjects. Internal and external validation are used to estimate the amount of optimism. In other words, validating the model explores how well predictions generated by the prognostic model agree with predictions for future patients or comparable patients not part of the study population. Determining validity of a prediction rule can be achieved in a number of ways, which are discussed briefly below. A nice reference for a more comprehensive overview is Vergouwe et al. (7).
A distinction is made between internal and external validity when validating a prediction rule.
For internal validity the model is developed and validated using exactly the same dataset of patients. Techniques that can be used to determine internal validity include: Data-splitting (where the dataset is split in two at random), cross-validation (where the dataset is split into more than two datasets at random) and bootstrapping (a type of simulation technique,8). The last method is recommended, as this makes efficient use of all the data.
For external validity a model is developed in a cohort of patients and the validity is determined using another cohort of comparable patients.
The previously described measures, such as variance explained (R2), calibration and discrimination, are used to determine validity.
Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015 Jan 6;162(1):W1-W73.
Steyerberg EW. Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. Springer, 2009.
Royston & Sauerbrei. Multivariable Model – Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables, Wiley, 2008.
Harrell FJ jr. Regression Modeling Startegies. With Applications to Linear Models, Logistic Regression, and Survival Analysis.2nd Springer, 2015.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. 2010;21(1):128-38.
Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006;26(6):565-574.
Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of prognostic models: when is a model clinically useful? Semin Urol Oncol 2002;20:96-107.
Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HC. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007 Jul 13;7:33
APH expert on prognostic and diagnostic models:
Martijn Heymans: email@example.com.
If you would like more information about developing and/or validating prediction rules, you could follow the course “Clinical Prediction models”. For more information see:
Is the use of a prognostic model appropriate given the research question?
Was the selection of the predictors based on a literature search or advice from experts?
Has the (type of ) outcome measure been clearly defined?
Have dropouts and missing values been described and have the potential consequences of these dropouts and missing values been discussed in the research report (is dealt with missing values in a sensible way, i.e. multiple imputation)?
Is the sample size of the study population sufficient?
Has linearity been assessed for all potential predictors?
Has a correlation table been created of all potential predictors?
Has a (manual) backward selection been used for building the model?
Has the model quality been assessed? If possible, have calibration and discrimination been assessed?
Was the prediction model validated?
V2.0: 3 July 2015: Revision format
V1.1: 1 Mar 2011: Several textual changes and additions, Replacement of bootstrapping by the Lasso technique, addition of reclassification tables, more emphasis on validation the model. Update references
V1.0: 1 Jan 2010: English translation.