Regularization approaches in clinical biostatistics: A review of methods and their applications

A range of regularization approaches have been proposed in the data sciences to overcome overfitting, to exploit sparsity or to improve prediction. Using a broad definition of regularization, namely controlling model complexity by adding information in order to solve ill-posed problems or to prevent overfitting, we review a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. Aspects of their practical implementation are discussed including available R-packages and examples are provided. To assess the extent to which these approaches are used in medicine, we conducted a review of three general medical journals. It revealed that regularization approaches are rarely applied in practical clinical applications, with the exception of random effects models. Hence, we suggest a more frequent use of regularization approaches in medical research. In situations where also other approaches work well, the only downside of the regularization approaches is increased complexity in the conduct of the analyses which can pose challenges in terms of computational resources and expertise on the side of the data analyst. In our view, both can and should be overcome by investments in appropriate computing facilities and educational resources.

The concept of regularization has a long history in both mathematics and statistics. Many of the early approaches are well understood by now. In terms of adding information, the most prominent origin is Bayes' idea of adding prior information to a likelihood based inference. 15,16 The Bayesian model formulation with its prior and likelihood components per se allows for regularization. The grade of regularization then depends on the informativeness of the prior. While noninformative prior choices do not lead to regularization, vague, weakly informative and informative prior choices impose different levels of regularization. Tikhonov 17 firstly aims to use regularization to solve an ill-posed problem. From a statistical perspective, Hoerl 18 provides a ridge regression formulation of Tikhonov's idea, and Foster 19 interprets this method as a Wiener Kolmogoroff or Kriging filter. Tikhonov's regularized solution can also be interpreted as a Bayes solution, see, e.g., Vogel 20 or Wolpert and Ickstadt. 21 Formally, adding information, e.g., in terms of a prior distribution, to a statistical inference problem is best described in a decision theoretic framework; see, e.g., Wald 22 or Lehmann 23 for the foundation of decision theory and Berger 24 for a detailed overview. One of the first regularization ideas to avoid overfitting in a statistical analysis is the stepwise procedure of early stopping. Its origin lies in the theory of sequential testing and goes back to Wald. 25 Nowadays, early stopping is employed in many statistical learning approaches.
Although there is a growing literature on regularization with a wealth of techniques being available to overcome the problems outlined above, it is currently largely unknown to what extent these methods are actually used in clinical medicine and what type of problems are addressed by their use. To shed some light on these questions we systematically reviewed recent volumes of three journals publishing in general medicine, namely the Journal of the American Medical Association (JAMA), the New England Journal of Medicine (NEJM) and the British Medical Journal (BMJ).
The remainder of this paper is organized as follows. In Section 2, an overview of regularization approaches is provided, starting with a brief history of regularization and in particular, covering aspects such as penalization, early stopping, ensembling and model averaging. In Section 3, a review of articles in medical journals that summarizes the current state of applications of regularization in clinical medicine is reported. Some examples are presented in Section 4 before making some closing remarks in Section 5.

Regularization approaches
In this section, we will describe a variety of regularization approaches. In particular, we will formulate specific goals as well as suitable statistical models and procedures to achieve them. The types of regularization approaches comprise penalization and including external and/or historical data (Section 2.1), early stopping (Section 2.2), ensembling (Section 2.3), and further ideas like injecting noise (Section 2.4). Table 1 summarizes all of these regularization types, their goals and the corresponding statistical methods. This section concludes with some practical remarks on regularization (Section 2.6) and corresponding software (Section 2.7).

Penalization
Penalization approaches make the trade-off between model fit and model complexity explicit by combining (a) a (lack of) fit criterion representing the ability of a model to fit the given data with (b) a penalty that measures the model complexity. In the following, we will introduce this idea in more detail for parametric models characterized by a parameter vector θ, but the ideas immediately generalize to semi-and non-parametric models. The observed data will be denoted as y and we will illustrate penalization along regression-type models where y represents a vector of observed response values while θ comprises the regression coefficients.
In a frequentist and loss-based formulation, penalized regularization approaches take the form ρ(y; θ) + pen(θ), with some appropriately chosen loss function ρ( · ; · ) and non-negative penalty term pen(·). The loss function can be the negative log-likelihood of, e.g., a generalized linear model (GLM), but other loss functions such as L1 or L2 loss or robust versions such as Huber's loss 26 are also conceivable. The penalty term is chosen to reflect the complexity of the model (as characterized by the parameter vector θ) or to enforce desirable properties of the estimate. Popular examples include: • The L2 penalty leading to ridge regression, where pen(θ) = λθ ⊤ θ = λ j θ 2 j with non-negative penalty parameter λ ≥ 0, which enforces shrinkage and adds stability to the estimation. 1 • The L1 penalty for simultaneous shrinkage and selection leading to the least absolute shrinkage and selection operator (LASSO) with pen(θ) = λ j |θ j |. 2,[27][28][29] Again the penalty enforces shrinkage and adds stability, but it also enables variable selection.
• The L0 penalty pen(θ) = λ j 1(θ j ≠ 0) with the indicator function 1(·) implying a penalty on the number of non-zero coefficients. 29 In all these examples, the penalty includes a penalty parameter λ ≥ 0 that governs the effective trade-off between the loss and the penalty term in (1). If λ → 0, the penalty loses its importance such that the resulting estimateθ minimizes the underlying loss irrespective of the chosen penalty (leading, e.g., to the maximum likelihood estimate in case of a loss function representing the negative log-likelihood), while for λ > 0 the estimate minimizes the loss subject to a constraint imposed by the penalty. In the examples above, all penalties lead to an empty model with all parameters being estimated equal to zero, when λ approaches infinity. However, the paths at which the coefficients approach zero are very distinct and depend on the underlying geometry of the penalty term. In the case of orthonormal designs, ridge regression induces a proportional shrinkage of all coefficients, and thereforeθ = 0 is only achieved as a limiting case. For the LASSO, an orthonormal design, in contrast, leads to a linear decay to zero such that coefficients are exactly set to zero already for finite values of the penalty parameter λ. Figure 1 represents the coefficient paths for the LASSO applied to the prostate data set discussed in Section 2.7. Displayed are the paths for growing log (λ), illustrating the shrinking towards zero, which results in variable selection. One peculiarity here is that, due to the non-orthogonal design of the covariates, increasing the smoothing parameter may initially lead to increasing effect sizes. Still, when the smoothing parameter is increased further, all estimates eventually approach the limiting value of zero. Various extensions and alternatives to the three penalties introduced above have been suggested to achieve other forms of penalization or to enforce other forms of the coefficients paths. For example, the penalty may leave certain parameter configurations unpenalized, such that even for λ → ∞ there will be free parameters to estimate. As the simplest case, a number of parameters (such as the intercept and parameters relating to covariates that are deemed important a priori) may be left out of the penalization term. More complex models may also comprise multiple penalty parameters, for example when additively combining penalties such as in the elastic net 30 with pen(θ) = λ 1 j θ 2 j + λ 2 j |θ j |. Penalized forms of regularized estimation enjoy a close link to Bayesian inference where, according to Bayes' theorem, the posterior p(θ|y) can be determined as i.e. the posterior is proportional to the likelihood p(y|θ) times the prior p(θ). 2 Taking the logarithm illuminates that (using ∝ to denote equality up to additive constants) log (p(θ|y)) ∝ log (p(y|θ)) + log (p(θ)) , such that maximizing the posterior is equivalent to a penalized estimate combining the log-likelihood with a penalty induced by the log-prior. This can be interpreted in two ways: On the one hand, regularized maximum likelihood estimates can also be understood as posterior mode estimates. On the other hand, the prior distribution in Bayesian inference determines a corresponding form of regularization with the log-prior inducing the penalty term. As such, regularization may also be interpreted as a way of including prior or expert knowledge in model estimation.
Concerning the examples introduced above, ridge regression corresponds to an i.i.d. zero-mean Gaussian prior for the regression coefficients, while the LASSO has its equivalent in i.i.d. zero mean Laplace priors. 31 In between frequentist, loss-based regularization and Bayesian regularization are models with random effects where some of the regression coefficients are assigned a random effects distribution that can formally also be interpreted as a Bayesian prior. Similarly, random effects estimates are often interpreted as shrinkage estimates where the random effects distribution enables estimation of a potentially large number of effects, shrunken towards zero. As a consequence, various types of models involving random effects, e.g., hierarchical mixed-effects models or spatial regression models involving spatially correlated stochastic processes, can also be seen as regularized regression where the specific form of regularization depends on the distributional assumption for the random effects. For example, most spatial regression models implement spatial dependence such that spatial effects tend to be similar when the corresponding locations are close to each other (corresponding to Tobler's famous first law of geography stating that 'everything is related to everything else, but near things are more related than distant things' 32 ). In this case, the penalty implied by the distribution of the stochastic process penalizes large differences between spatial effects at close locations.
Finally, applying the penalty not directly to the parameter vector but to functions thereof allows to enforce other types of regularization behaviour. Furthermore, considering the penalty not on the original covariates but on transformations or basis function expansions thereof contributes further flexibility. Some areas that have attracted particular interest in the last decade include: • Fusion penalties, where the goal is to fuse certain effects together, for example when considering the effects of features that can be ordered in some meaningful way. Effects of ordinal categorical covariates are just one particular example of this. One of the early suggestions is the 'fused LASSO' 33 that penalizes the L1 norm of both the coefficients and their successive differences, but several extensions have been suggested in the literature since then. [34][35][36][37][38][39] A nice and extensive overview on the topic 'Regularized regression for categorical data', for both categorical predictors and responses, can be found in Tutz and Gertheiss. 40 Also in the Bayesian framework, fusion of effects has been investigated by several researchers, see, e.g., Pauger et al. 41 or Malsiner-Walli et al. 42 A nice discussion on Bayesian regularization and effect smoothing for categorical predictors can be found in Wagner and Pauger. 43 • Semiparametric function estimation with smoothness priors where a flexible effect f (x) of a covariate x of interest shall be estimated. One option is to work with function spaces and associated norms such as the functional L2 loss pen( f ) = λ ( f ′′ (x)) 2 dx, i.e., the integrated squared second derivative that penalizes the curvature of the function. This is the basis for the famous special case of smoothing splines. When approximating the effect of interest in terms of a basis expansion such that f (x) = j β j B j (x) with appropriate basis functions B j (x), penalties can again be constructed for the basis coefficients β j with penalized splines 44,45 as one of the most prominent examples. One can then also design penalties that enforce not only smoothness but other properties such as monotonicity, convexity/concavity or constant limiting behaviour. 46,47 • Structured additive regression models that consider regression predictors that are an additive combination of various types of effects f j (ν j ) based on covariate vectors of different type and associated with quadratic penalties to enforce desirable properties of the individual effects. For example, structured additive regression comprises nonlinear effects of continuous covariates, varying coefficient terms, interaction surfaces, random effects and spatial effects as special cases, see for example Fahrmeir et al. 48 and Fahrmeir and Kneib 49 for an in-depth discussion.
• Single index models that extend generalized linear and additive models by also estimating the link function that maps the regression predictor to the conditional expectation of the response variable in a data-driven way. When a flexible, nonparametric approach is taken for the link function, regularization is also required for this part of the model. 50 With a linear predictor, single index models provide a combination of nonlinear and linear modelling techniques.
We close the discussion of penalization approaches by highlighting that the penalty pen (·) from (1) typically involves one or multiple hyperparameters that determine the impact of the penalty on the fit. Determining the optimal hyperparameter(s) from the data allows for a data-driven amount of regularization and is a central problem for turning penalty-based regularization into practice. Cross-validation (CV) is one prominent example, but for specific classes of models, more specific approaches such as (restricted) maximum likelihood for determining random effects variances or smoothing parameters in structured additive regression are also conceivable, see for example Chs. 7 and 9 in Fahrmeir et al. 51 Figure 2 shows the CV curves for the LASSO applied to the prostate cancer data, see Section 2.7 for details. In a Bayesian approach, suitable hyperpriors can be assigned to the hyperparameters, making them part of the Bayesian inferential scheme. In R, 52 penalization approaches such as LASSO or ridge regression are implemented in the packages glmnet 53 and penalized. 54

Early stopping
Many statistical and machine learning approaches build up a (potentially complex) model by iteratively refining a simple model towards the most complex case allowed by the model specification. One way of inducing regularization in such cases is to stop the fitting process before the most complex model is achieved, i.e. to identify the best trade-off between model simplicity (models close to the initial model) and fit to the data (models close to the final, most complex model) by early stopping. In fact, the penalization approaches discussed in the previous section can also be cast into this framework when considering the complete path of coefficients produced by varying the penalty parameter from infinity (simplest model determined by minimizing the penalty) to zero (complex model fit without the penalty). Early stopping then means that we are not using the most complex model without a penalty, but 'stop' at an optimal value for the penalty parameter determined, for example, again via CV techniques.
Boosting approaches are another example where regularization can be achieved by early stopping. The concept of boosting emerged from the field of machine learning 55 and was later adapted to estimate predictors for statistical models. 56,57 Main advantages of statistical boosting algorithms are their flexibility for high-dimensional data and their ability to incorporate variable selection in the fitting process. 3 Furthermore, due to the modular nature of these algorithms, they are relatively easy to extend to new regression settings. 58 In general, boosting algorithms can be also described as gradient descent approaches in function space, 4 where the algorithm iteratively fits simple (e.g. linear) regression models (so-called base-learning procedures), not to the actual observations but to the negative gradient (first derivative) of the loss functionevaluated at the previous iteration. In this way, boosting iteratively improves the fit of the model by re-directing attention to those observations that have not yet been explained well and therefore still have large gradients. For a large number of iterations, the model will finally approach the minimizer of the loss function employed in the model specification. As a consequence, the number of boosting iterations is the main (and typically only) tuning parameter, and early stopping yields a regularized estimate determined by the starting values of the algorithm and the base-learners employed to generate the way towards the most complex model. In R, boosting is e.g. implemented in the packages mboost, 59 gbm 60 and xgboost. 61 Pruning of classification and regression trees (CARTs) also fits into the range of early stopping procedures. Trees iteratively split the available data into subsets that are homogeneous with respect to some impurity measure within the subsets but maximize heterogeneity between the subsets. Taking this to the extreme, each observation would finally form its own subset, but usual tree implementations require a certain minimal number of observations in the final subsets, a strategy which already provides some (limited) protection against overfitting. However, the resulting trees are usually still too complex such that an additional pruning step is applied to remove superfluous splits. 62 Consider the algorithmic generation of the full tree: This is an iterative procedure starting from a single set comprising all observations. From there, it moves over a simple stump with just two splits and finally becomes a fine-grained tree with many subsets. Early stopping now means that we determine the optimal number of splits based on some measure for generalizability such as CV. Software packages implementing these procedures are discussed in the context of random forests below.
As a final example, consider the learning rate in deep neural networks. Deep neural networks are usually trained with stochastic gradient descent optimization that updates the weights in the network. Due to a large number of weights involved in deep networks and the corresponding model flexibility, full optimization of the model would usually lead to a perfect fit, implying over-fitting and low generalizability. As a consequence, a decaying learning rate is usually implemented such that the maximum possible change is reduced as the model fit progresses. In effect, this means that after a certain number of iterations (dictated by the exact implementation of the decay of the learning rate), there will be no change in the weights of the network anymore, which also implements early stopping. Often an exponential decay is employed such that a single scalar value determines the learning rate and this parameter has to be chosen to achieve an optimal compromise between long-training processes (small learning rate) and unstable / over-fitting results (large learning rate). See Goodfellow et al. 63 for details. Neural networks are implemented, among others, in the R-packages neuralnet 64 and deepnet. 65 Moreover, there exist interfaces to Python deep learning implementations such as keras. 66 Figure 2. Cross-validation error curve for the LASSO applied to the prostate cancer data from Section 2.7. Two special values of λ are highlighted through vertical dotted lines: λ min gives the smallest mean cross-validated error (left), while λ 1se is the value of λ that gives the most regularized model such that the cross-validated error is within one standard error of the minimum (right). The numbers on top of the plot denote the number of non-zero coefficients entering the model at the respective penalty strength.

Ensembling and model averaging
While the previous two approaches build regularization directly into a specific model, we now turn to regularization by combining a variety of models with the aim of achieving improved model performance. For the sake of illustration, consider a model with a good ability to fit the given data but large variability such that the model does not generalize well to new data. If multiple variants of such a model are available, the variability can be reduced by building an ensemble of the models or by averaging over predictions or other quantities derived from the models.
We illustrate the idea of ensembling with one of the most frequently used ensemble techniques: random forest, originally proposed by Breiman. 6 A random forest is an aggregation of a (typically large) number of classification or regression trees (which we already considered in the previous section). CARTs repeatedly partition the predictor space using binary splits of the covariate domain. The goal of the partitioning process is to find partitions such that the respective response values are very homogeneous within a partition but very heterogeneous between partitions (measured via criteria such as the mean squared error of prediction or the classification rate). CARTs can principally be used both for metric (regression trees) and for nominal/ordinal responses (classification trees). To obtain the prediction for a new observation, all response values within a partition are aggregated either by averaging (in regression trees) or simply by counting and using majority vote (in classification trees).
In the previous section, we already discussed pruning trees to avoid overfitting arising from trees with a large number of splits. However, even such pruned regression trees usually suffer from large estimation uncertainty, i.e. large variance, since small changes in the data can induce large differences in the resulting tree. To overcome this, random forests aggregate a large number B (e.g., B = 5000) of trees grown independently from each other. The combination of many trees has the advantage that the resulting predictions inherit the property of unbiasedness from the single trees while reducing the variance of the predictions. To get a final prediction, predictions of single trees are aggregated (i.e. we form an ensemble of trees), in the case of regression trees simply by averaging over all the predictions from the single trees. In order to achieve the goal that the aggregation of trees is less variant than a single tree, it is important to reduce the dependencies between the trees that are aggregated in a forest. Typically, two randomization steps are applied to achieve this goal. First, the trees are not applied to the original sample but to bootstrap samples or random subsamples of the data. Second, at each node, a (random) subset of the predictor variables is drawn which is used to find the best split. These randomization steps de-correlate the single trees and help to lower the variance of a random forest compared to single trees. The size of the random subset of predictors at each node is a tuning parameter, which could be choosen e.g. by CV. According to Probst and Boulesteix, 67 the number of trees B does not have to be tuned as long as it is chosen sufficiently large.
In R, two slightly different variants of regression forests are available. First, the classical random forest algorithm proposed by Breiman 6 is implemented in the R-package ranger. 68 The second variant is implemented in the function cforest from the party package. 69 Here, the single trees are constructed following the principle of conditional inference trees as proposed in Hothorn et al. 70 The main advantage of these conditional inference trees is that they avoid selection bias in cases where the covariates have different scales, e.g., numerical vs. categorical with many categories (see, for example, Strobl et al. 71,72 for details). Conditional forests share the feature of conditional inference trees avoiding biased variable selection. The CV of the tuning parameter mtry can be done using the machine learning framework provided by the R-package mlr3. 73 Model averaging is a second way of combining models together with the aim of improving upon the individual model performance. For illustration, let us consider a regression scenario where, from a set of p available covariates, we build all potential 2 p models. Instead of choosing one best model, e.g. based on some model choice criterion, we aim at combining the evidence for all models together, for example for forming predicted values. However, naively averaging over all models neglects differences in the ability of these models to explain the data as well as potential dependencies between the covariates. To overcome this, we weight the models according to some model fit criterion, for example the AIC. In this way, models that do not fit the data well obtain small weights and, vice versa, models with a good fit obtain large weights. If there is one single model that fits much better than all the others, the results from model averaging will be close to those from this model. However, in most cases, it is much more likely that there are multiple models with a similar fit that maybe only differ in small details. In such cases, all these models will contribute to the model-averaged prediction.
Model averaging can not only be used for forming predictions, but also for statistical inference on other properties shared by all models such as the regression coefficients or the variance. Intuitively, averaging over models again reduces the variance or the uncertainty associated with a single model. A thorough introduction to model averaging can be found in Claeskens and Hjort. 7 The R-packages BMA 74 and BMS 75 implement Bayesian model averaging, whereas the model.avg-function of the MuMIn-package 76 performs model averaging based on information criteria.
Other approaches that fit into the realm of model ensembling and model averaging include bagging 5 and boosting (as discussed in the previous section), which iteratively combine weak learners. In general, we can distinguish between parallel modes of aggregation (where various models are fitted independently of each other and then combined together) and sequential ensembles (where the models are iteratively improved and then combined). Boosting is an example of the latter while random forests and model averaging are examples of the former.

Other regularization approaches
Of course, the list of regularization approaches discussed so far is by no means exhaustive. Various other takes on regularization exist, focusing on different aspects of the model-fitting process. One example is injecting noise, where some kind of distortion is introduced in the model fitting process. Random forests with their two steps of randomization (bootstrapping observations and considering only random subsets of covariates for splits) can be cast into this class as well and, hence, are a prominent example. In other cases, random probing, i.e. the introduction of simulated additional covariates that are, by definition, independent of the response of interest, can be used to better distinguish informative and noninformative covariates in model selection procedures such as boosting. 77 Out-of-sample evaluation strategies can also be considered as an implicit mode of regularization, where the ability of the model to generalize well beyond the observed data is explicitly determined based on hold-out datasets. A further example is drop-out in neural networks, where part of the neurons in one layer is randomly shut down to avoid overfitting due to co-adaption. 78 Many of the approaches discussed above are summarized in the R-package mlr3. 73 In the supplementary material to this paper, we provide an example implementation of different regularization approaches using a data set on prostate cancer. 79 Further details are discussed in Section 2.7 below.

Comparison of the different approaches
In this section, we briefly discuss pros and cons of the various regularization types and specific statistical approaches to provide guidance for applied users.
• Flexibility. While many of the statistical regularization approaches are tailored towards specific model classes and ways of modelling the data, penalization approaches (and partly also boosting approaches) are very flexible when it comes to implementing various types of constraints (e.g. complexity, sparsity, smoothness, etc.). This often implies advantages with respect to direct interpretability. • Interpretability. While some methods are basically resulting in black box models, penalization and some of the early stopping methods (in particular, gradient-or likelihood-based boosting) can be used to estimate models that resemble unregularized versions when it comes to interpretation. This allows for an easier transition to practice. Nevertheless, some caution is still required since regularization results in biased estimates and often significance statements for those are not directly available anymore. • General loss functions. Some of the methods, in particular penalization, boosting, random forests, and deep learning, allow to use general loss functions and not only least squares or log-likelihoods resulting from probabilistic models. This can have advantages to increase the robustness of the approaches. • Over-specified models. Some methods are specifically tailored towards over-specified models, i.e. they allow to determine models with more covariates than observations or a large number of parameters resulting e.g. from basis function expansions or random effects. This includes penalization and Bayesian hierarchical approaches, but also boosting and deep learning. • Non-additive model specifications. Sacrificing interpretation for the sake of better prediction, some approaches such as deep learning, bagging, and random forests allow for flexible covariate-response relations that circumvent the restrictions of additive model specifications. • Multi-model inference. The ensembling approaches combine evidence from multiple models rather than focusing on one single 'best' model as most of the other approaches do.

Practical aspects with relevance to regularization
In any of the regularization approaches discussed before, there are a number of practical aspects that deserve particular attention when applying them in statistical analyses: • Interactions. The inclusion of interaction effects in addition to main effects considerably increases the potential size of a statistical model. From this perspective, regularization is particularly interesting here since a large number of candidate effects (including interactions) can be generated, which is afterwards controlled via a suitable regularization approach. While most regularization approaches require the user to pre-define whether and which interactions shall be included, some automatically including potential interactions. For example, random forests implicitly implement interactions due to the recursive application of covariate splits. Note also that especially for regularization approaches enabling variable selection, e.g. the LASSO or componentwise boosting, particular care needs to be taken in order to account for certain hierarchical structures, such as 'interaction effects only included if both main effects are included'.
• Transformations of covariates. Transforming covariates, for example, to account for specific types of nonlinear effects, is very common in regression analyses. Similar as with the interactions, transformations usually have to be pre-defined in regularized approaches. More precisely, regularization usually works on a pre-specified model class such as linear models, GLMs or generalized additive models (GAMs), and influences the specific version of the model estimated from the data but not the model class itself. Only when models are nested, as for example with the GLM and GAM, regularization may in fact reduce the more complex version to the simpler one. • Standardization. Standardization is a specific type of transformation that can be useful or even necessary in regularization approaches. For example, penalization approaches such as the LASSO and ridge regression critically rely on the fact that all regression coefficients can be compared to each other in absolute terms. In such cases, standardizing all covariates is necessary to achieve this. Many software packages automatically perform the standardization step internally and report back-transformed estimates afterwards, but it is important to check the exact implementation to ensure that one interprets the estimated model correctly. • Covariate scales. Similarly, covariates with different scales (e.g. continuous vs. categorical) can be problematic in regularization. For example, simple versions of random forests can be shown to have an intrinsic preference to select categorical covariates with more categories for the next split. While unbiased selection criteria have been suggested, the scale of covariates is still an important property to be considered in regularized approaches, in general. • Combining linear and nonlinear effects. Principally, both for purely linear or nonlinear effects several approaches for variable selection via penalization exist. However, for the combination of both in so-called semiparametric models, suitable penalization is more tricky and only few works in this regard have been developed in the frequentist penalized likelihood framework. 80 More work has been conducted based on the inherent model selection property of boosting [81][82][83] and in the Bayesian framework based on variable and effect selection priors. 84,85

R implementation of different regularization approaches
An example implementation of different regularization approaches can be found in the supplemental material to this paper.
Here, we used a kaggle data set on prostate cancer for illustration purposes. 79 The data contains information on the tumours of 100 patients (radius, texture, perimeter, etc.) as well as their diagnosis (binary outcome). We demonstrate the implementation of six different regularization approaches, namely a classification tree (CART), a random forest, subset selection, rigde regression, LASSO, and elastic net and compare them to standard logistic regression by means of area under the curve (AUC) and the mean classification error (MCE). Hyperparameters are chosen based on 10-fold CV and results are averaged over ten repetitions. In this example, standard logistic regression is outperformed by the regularization methods. The penalization approaches (LASSO, ridge and elastic net) perform better than logistic regression in terms of the AUC, while ridge regression, elastic net, CART, and subset selection have a smaller MCE than logistic regression.

The state of regularization applications in medicine
We performed a literature review in three top medical journals to investigate how much regularization is used in published medical research. To this aim, we reviewed all issues published between January and September 2020 in the Journal of the American Medical Association (JAMA) as well as in the New England Journal of Medicine (NEJM) and the British Medical Journal (BMJ). These journals were chosen since they range among the general medical journals with the highest impact factors in the world. We identified and reviewed all original research articles, resulting in 383 articles, see the PRISMA flow chart in the supplement.

Overview of used regularization methods
After the exclusion of three updates of a living systematic review, 380 articles remained. For each of these, the statistical methods section was screened for applications of regularization. Thereby, we used the definition and the examples of regularization approaches described in Section 2. The exact results of our search were collected in an Excel spreadsheet, which we provide as a supplement. It contains the following general information for each paper: the digital object identifier, the journal in which it was published (JAMA, NEJM or BMJ), the name of the first author, and the title. As statistically relevant variables it also includes the studies' sample sizes, a dichotomous variable describing whether regularization was used and, if so, another one indicating the exact form of regularization as described in Section 2. Moreover, we extracted the type of software used for the analyses. Our main findings regarding the use of regularization are summarized in Table 2.
In the supplement, we provide an additional table summarizing the study characteristics. The two striking key messages are as follows: 1. The majority of studies did not use regularization techniques at all. 2. If regularization was used, it was mainly by means of random effects.
In fact, out of the 128 studies that applied at least one regularization method, 104 used random effect models. Other techniques that were used cover Bayesian (16 out of 380) and penalization methods (6) as well as smoothing (3), a priori knowledge (3), CV (2) and boosting (2). Random forests and subset selection were only used once, respectively. The numbers also suggest a different journal openness regarding regularization: While 39% (SE 0.048) of the JAMA articles and 40% (SE 0.045) of the BMJ articles applied regularization, only 26% (SE 0.034) of the NEJM articles did.

Discussion of specific examples
For each regularization method, other than random effects modelling, listed in Table 2 we briefly discuss its concrete usage in the reviewed papers In the remainder of this section, we refrain from citing these papers in the references, since they serve as examples rather than literature citations. Complete information can be found in the online supplement.  Table 2. Number of regularization applications found from our literature review. As some regularization methods were occasionally used in combination, multiple enumerations are possible. Row-wise percentages are rounded to integers. This concludes the overview of the use of regularization methods applied in real studies published in 383 research papers. It is our opinion that regularization would have been beneficial as well for several other studies, since they increase flexibility and can combine evidence from multiple models or sources, as mentioned in Section 2. We illustrate this in the next section, by explaining the concrete approach and the obtained results of three different regularization applications in the literature.

Examples
To demonstrate the versatility of using regularization methods and their potential positive effects, we discuss selected biostatistical examples from the literature.

Variable selection and shrinkage methods for linear regression
An example well-known in the statistical learning community refers to the prostate cancer data set analysed in Chapter 3.4 of Hastie et al. 87 with shrinkage methods for linear regression. The data originate from a study by Stamey et al. 88 who examined the level of prostate-specific antigen (PSA) in 97 prostate cancer patients, before receiving radical prostatectomy. PSA is a well-known biomarker in prostate cancer and the correlation of log PSA (lpsa) to eight clinical variables was analysed, including log cancer volume (lca), log prostate weight (lweight), age, log of benign prostatic hyperplasia amount (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).
A natural approach to model the relationship between lpsa and the eight predictors in a multivariate approach is standard linear regression, without any regularization. At first glance, in this case, variable selection seems not urgently required (only p = 8 variables and n = 97 observations). However, Hastie et al. 87 applied various variable selection and shrinkage methods to the regression problem. First, the data were split into a training set and a test set. Then, on the training set, the models were fitted using CV for potential hyperparameter tuning. Finally, test set errors were computed on the test set that was not touched for model fitting, and its standard errors were estimated on the left-out sets in the CV.
In best subset selection, all combinations of variables were considered. Ridge regression and LASSO regression were used as shrinkage methods, and principal component regression and partial least squares as methods with decorrelated linear combinations of the original variables. The results are presented in Table 3, cf. Table 3.3 in Hastie et al. 87 The test error was highest for standard linear regression (least squares) and for partial least squares and considerably lower for all other approaches. Further, best subset and LASSO regression selected only two and four variables out of the original eight, respectively. This demonstrates the potential benefit from regularization, here in the form of penalized regression, even in this fairly simple situation, due to the considerable correlation between the original variables.

Using additional a priori information for evidence synthesis
Borrowing information, for example from an observational study, to support a small-scale randomized trial can be achieved by deriving a shrinkage estimate within a Bayesian random effects meta-analysis. 11 The approach first analyses the observational data using the shrinkage estimator in a hierarchical model and subsequently uses the derived posterior distribution to inform the analysis of the RCT. The efficiency gain of this approach was exemplified by Röver and Friede 11 in the context of Creutzfeld-Jakob disease. This disease is a rare disease with a prevalence of 1 in 1,000,000. An RCT on doxycycline 89 was terminated prematurely with only 12 patients included. However, additional data on 88 patients was available from an observational study. The primary endpoint of all-cause mortality was analysed using Cox proportional hazards regression. Using the external information from the observational study led to Bayesian shrinkage intervals spanning only two-thirds of the confidence interval derived from RCT data, thus showing a clear gain in efficiency.
The approach was also applied in a recent study 90 on children with Alport syndrome. The Alport syndrome is a rare kidney disease which typically leads to end-stage renal disease in early life and requires renal replacement. 91 Incorporating the results of real-world evidence from a prospective US cohort into the randomized data by Bayesian evidence synthesis resulted in a more precise estimate of the treatment effect indicated by a much shorter credible interval.

Boosting capture-recapture methods
Systematic reviews of clinical trials should be based on all relevant trials on the particular topic. For evaluation of the comprehensiveness of systematic literature reviews, capture-recapture analyses have been proposed. These require the selection of an appropriate model. To this end, Rücker et al. 92 proposed to combine capture-recapture analysis with componentwise boosting. The boosting procedure allows to specify the mandatory variables that are always included in the model as well as optional variables. The latter are included only if relevant. This approach turned out to be robust against overfitting, and an appropriate model for statistical inference was automatically developed. In particular, Rücker et al. 92

Discussion
A range of regularization approaches has been proposed to overcome problems such as overfitting, deal with data sparsity or improve the prediction and generalizability of results. Using a broad definition of regularization, namely the process of adding information in order to control model complexity, we reviewed a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. We discussed aspects of their practical implementation including available R-packages. In this manuscript, we focused on R as a programming language and also demonstrated the use of regularization methods in an R-implementation. However, regularization approaches are also implemented in other statistical software. For example, penalization approaches such as LASSO or Ridge regression are implemented in SAS in the GLMSELECT and the REG procedure, while more complex penalization methods can be found in PROC TPSPLINE. A random forest implementation is given by PROC HPFOREST, for example. Bayesian method can be incorporated in various ways: PROC FMM, PROC GENMOD, PROC LIFEREG and PROC PHREG allow for Bayesian analyses through a BAYES statement, while PROC BGLIMM and PROC MCMC are specifically tailored to perform Bayesian estimation. 93 Similarly, xtreg, lasso and boost provide implementations of random effects models, LASSO penalization and boosting in Stata, respectively. Examples were provided to showcase the practical use of regularization encouraging more wide spread use of these techniques in medicine. This is on the background of our review of recent issues of three general medical journals, which revealed that regularization approaches could be used more. The only exception are random effects models which featured relatively regularly. Other regularization approaches were rarely applied. In our view, there is space for improvement in the use of regularization methods in clinical medicine. Their application can be considered on a regular basis, since they only improve analyses and their interpretation. In situations where also other approaches work well, the only downside of the regularization approaches is increased complexity in the conduct of the analyses which can pose challenges in terms of computational resources and expertise on the side of the data analyst. In our view, both can and should be overcome by investments in appropriate computing facilities and educational resources. Of course, the application of regularization approaches also entails some limitations: • In general, their application is somewhat more challenging and requires a better understanding of the corresponding methodology. However, as we have shown in this review, the methodological gap between classical and regularized approaches is not always that big in the end. • Some of the regularization approaches presented in this manuscript can be computationally intense, e.g. in the case of neural networks or random forests, especially when cross-validated tuning of hyperparameters is required. • Some approaches are black box methods (e.g. tree-based ensembles, or deep neural networks), and will therefore be hard to interpret, limiting the possibility to communicate them to practitioners.
Finally, the selection of any statistical approach should of course be guided by the actual research question at hand and the corresponding goals (e.g. causal vs. exploratory vs. predictive analyses), and regularization approaches will not always be the best choice for this goal. For example, Riley et al. 94 demonstrated that penalization and shrinkage methods produced unreliable clinical prediction models especially when the sample size was small. The review of NEJM, JAMA and BMJ shed some light on the current state of the use of regularization methods in medicine. Although the review clearly shows that regularization methods are underused in clinical applications, it is limited in scope since only three journals were searched. Moreover, the choice of journals to include in such a review remains somewhat arbitrary. For instance, one reviewer suggested to include The Lancet as an additional high-impact general medicine journal. Furthermore, we focused on a relatively recent time period only and did not investigate any trends over time.
Before us, others have highlighted that existing methods are underused in clinical applications leading to suboptimal designs and analyses, sometimes even resulting in misleading interpretations. As an example, we refer to the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative (https://stratos-initiative.org/). 95 Several topic groups (TGs) of the initiative are also concerned with regularization approaches, in particular, TG 2 'Selection of variables and functional forms in multivariable analysis' 96 and TG 9 'High-dimensional data'.

Supplemental material
Supplemental material for this article is available online.