Using Predictive Fit to Inform Effect Metric Choice in Meta-Analysis

James E. Pustejovsky

2025-06-11

Effect Metric Menagerie

Group comparison of binary outcomes

  • Risk differences \(\pi_1 - \pi_0\)
  • Risk ratios (log-transformed) \(\log\left(\frac{\pi_1}{\pi_0}\right)\)
  • Odds ratios (log-transformed) \(\log\left(\frac{\pi_1 / (1 - \pi_1)}{\pi_0 / (1 - \pi_0)}\right)\)
  • Bivariate models for \(\pi_0, \pi_1\)

Group comparison of continuous outcomes

  • Raw mean differences \(\mu_1 - \mu_0\)
  • Standardized mean differences \(\delta = \frac{\mu_1 - \mu_0}{\sigma}\)
  • Percentage of maximum possible (POMP) differences
  • Response ratios (log-transformed) \(\lambda = \log\left(\frac{\mu_1}{\mu_0}\right)\)
  • Probability of superiority

Single-group prevalence

  • Raw proportions \(\pi\)
  • Arcsine-transformation \(a = \text{asin}\left(\sqrt{\pi}\right)\)
  • Freeman-Tukey double-arcsine transformation

Bivariate associations / psychometric

  • Pearson’s correlation \(\rho\)
  • Fisher’s \(z\)-transformation \(\zeta = \text{atanh}(\rho)\)
  • Reliability-corrected correlation \(\frac{\rho_{xy}}{\sqrt{\rho_{xx}\rho_{yy}}}\)

Metric choice methodology

  • Choice between standardized mean difference and response ratio metrics

But is there any way to evaluate effect metric choice as a model assumption?

Use predictive fit to inform metric choice

  • Evaluate effect metrics by performance in predicting summary data for a new study.

    • Data vector \(\mathbf{d}_i\) consisting of summary statistics used to compute effect size estimates.
  • Use leave-one-out log predictive density (LOO-LPD) to measure predictive performance. \[ \text{LOO-LPD} = \frac{1}{k} \sum_{i=1}^{k} \log p\left(\mathbf{d}_i \left| \hat\mu_{(-i)}, \hat\tau_{(-i)}, \mathbf{X}_i, N_i\right.\right) \]

Leave-one-out predictive density

  • Without using observation \(i\), make a prediction for observation \(i\) by specifying a density \(p_{(-i)}(x_i)\) (must integrate to one).

  • Score higher by putting more density on the realized outcome.

\[ \text{Predictive score:} \ \log p_{(-i)}(x_i) \]

  • LOO-LPD is the sum of scores across \(N\) observations: \(\displaystyle{\text{LOO-LPD} = \sum_{i=1}^N \log p_{(-i)}(x_i)}\)

    • Higher LOO-LPD indicates better fit.

    • Agnostic to the metric of the prediction.

Predicting summary data

  • Many effect size metrics are functions of multiple parameters.

    • Group comparisons of binary outcomes are functions of \(\pi_0,\pi_1\)

    • Group comparisons of continuous outcomes are functions of \(\mu_0,\mu_1,\sigma\)

    • Reliability-corrected correlations are functions of \(\rho_{xy},\rho_{xx},\rho_{xy}\).

  • Problem:

    • Meta-analysis models for such metrics are not sufficiently generative for predicting the data vector \(\mathbf{d}_i\).
  • Solution:

    • Augment the meta-analysis model with an auxiliary model for some of the parameters.

Effectiveness of nicotine replacement therapy

  • Cochrane Systematic Review of effects of nicotine replacement therapy vs. control on smoking cessation, defined as abstinence at 6+ month follow-up (Hartmann-Boyce et al. 2018).

  • Sample sizes ranging from \(N_i\) = 36 to 5290 (median = 240.5, IQR = 153.5 - 428.5).

Random effects meta-analysis

  • Difference ES metrics suggest very different implications and different heterogeneity
Metric Est 95% CI 80% PI I2
Odds ratio 1.75 1.63-1.88 1.29-2.38 39.06
Risk ratio 1.57 1.48-1.66 1.23-1.99 36.88
Complementary risk ratio 1.07 1.06-1.08 1.02-1.13 65.51
Risk difference 0.06 0.05-0.07 0.02-0.11 63.50

Auxiliary modeling

  • Possible auxiliary models for \(\hat\pi_{0i}\) or \(\hat\pi_{1i}\):

    • Random effects meta-analysis/meta-regression of log-odds

    • Generalized linear mixed model (Normal-binomial)

    • Beta-binomial regression

Predictive model

\[ \begin{aligned} &\text{Auxiliary model:} \quad & \text{logit}\left(\pi_{0i}\right) &\sim N\left(\mu_0, \sigma\right) \\ &\text{RE meta-analysis model:} \quad & \theta_i &\sim N\left(\mu, \ \tau^2\right) \\ &\text{Observation model:} \quad & N_{0i} \hat\pi_{0i} &\sim Binom\left( N_{0i}, \ \pi_{0i} \right) \\ & & N_{1i} \hat\pi_{1i} &\sim Binom\left( N_{0i}, \ g(\pi_{0i}, \theta) \right) \end{aligned} \]

Metric comparison

Normal-Binomial
metric LPD SE Diff. vs. OR SE
Odds ratio -995.8 21.1
Risk ratio -1002.0 22.0 -6.2 2.7
Complementary risk ratio -1015.8 22.9 -19.9 9.7
Risk difference -1384.5 30.3 -388.7 18.3

Metric comparison

Normal-Binomial Beta-Binomial
model metric LPD SE Diff. vs. OR SE LPD SE Diff. vs. OR SE
RE Meta Odds ratio -995.8 21.1 -992.8 20.5
RE Meta Risk ratio -1002.0 22.0 -6.2 2.7 -998.4 21.4 -5.6 2.6
RE Meta Complementary risk ratio -1015.8 22.9 -19.9 9.7 -1012.3 22.2 -19.5 10.3
RE Meta Risk difference -1384.5 30.3 -388.7 18.3 -1380.7 29.6 -387.9 18.4
Bivariate Normal Odds ratio -993.7 20.9 2.1 3.1
Bivariate GLMM Odds ratio -994.1 20.6 1.8 2.4

Discussion

  • Treat effect metric choice as a modeling assumption.

  • Predictive fit assessment may be relevant and useful for meta-analysis.

    • Will often require use of auxiliary models.
  • Advantages of log predictive density scoring

    • Allows comparison across effect metrics and different forms of models.

    • Auxiliary model building exercise can clarify scientific context.

  • Disadvantages and open questions

    • Which parameters should be part of the auxiliary model?

    • Other predictive scoring rules that may be relevant?

    • Is the joint distribution of \(\mathbf{d}_i\) the right focus?

You’re gonna need a bigger model.

Predictive discrepancies

Reading comprehension and content knowledge

  • Kim and Cao (Kim and Cao 2025) reported a systematic review and meta-analysis of studies on association between reading comprehension and content knowledge.

  • 380 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).

Bivariate associations

  • The data: Pearson correlation between two variables of interest from a sample of \(N_i\) observations, \(r_i\).

\(\rho\) metric

  • Effect size estimate \(r_i\), standard error \(\displaystyle{se_i = \frac{1 - r_i^2}{\sqrt{N_i}}}\)

  • Predictive model: \[ \begin{aligned} r_i &\dot{\sim} \ N\left(\rho_i, \ \frac{(1 - \rho_i^2)^2}{N_i}\right) \\ \rho_i &\sim \ N_{trunc}\left(\mu_\rho, \ \tau_\rho^2\right) \end{aligned} \]

\(\zeta = \text{atanh}(\rho)\) metric

  • Effect size estimate \(z_i = \text{atanh}(r_i)\), standard error \(\displaystyle{se_i = \frac{1}{\sqrt{N_i - 3}}}\)

  • Predictive model: \[ \begin{aligned} z_i &\dot{\sim} \ N\left(\zeta_i, \ \frac{1}{N_i - 3}\right) \\ \zeta_i &\sim \ N\left(\mu_\zeta, \ \tau_\zeta^2\right) \end{aligned} \]

  • log-predictive density: \[\begin{eqnarray} \log &d_r&(r_i | \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i) \\ &=& \log d_z\left(z_i \left| \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i \right.\right) - \log\left(1 - r_i^2\right) \end{eqnarray}\]

Metric comparison

Metric Est. 95% CI 80% PI LPD SE
r 0.36 0.34-0.38 0.13-0.59 63.36 11.80
z 0.36 0.34-0.38 0.11-0.57 67.52 11.26
Difference -4.15 1.94

Discrepancies

Class attendance and college grades

  • Credé and colleagues (Credé et al. 2010) reported a systematic review and meta-analysis of studies on association between class attendance and grades / GPA in college.

  • 99 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).

Metric comparison

Metric Est. 95% CI 80% PI LPD SE
r 0.40 0.37-0.44 0.20-0.60 33.24 8.27
z 0.41 0.37-0.45 0.16-0.61 21.39 11.95
Difference 11.85 4.72

Discrepancies

Reliability generalization of MIBS

  • Demir and colleagues (Demir et al. 2024) gathered 33 estimates of internal consistency (Cronbach \(\alpha\)) of the Mother-to-Infant Bonding Scale.

  • Sample sizes ranging from \(N_i\) = 13 to 2251 (median = 177, IQR = 98-260).

Metric Est. 95% CI 80% PI I2 LPD SE
Raw alpha 0.72 0.68-0.76 0.58-0.87 97.01 18.79 5.12
Bonett trans. 0.74 0.69-0.78 0.51-0.86 96.34 17.48 3.86
Hakstian-Whalen trans. 0.73 0.68-0.77 0.53-0.86 96.37 19.27 3.69

Incidence of olfactory loss in COVID-19 patients

  • Hannum and colleagues (Hannum et al. 2020) compiled data on rates of olfactory loss across 35 studies of COVID-19 patients.

  • Sample sizes ranging from \(N_i\) = 15 to 7178 (median = 95, IQR = 56.5 - 267.5).

  • Many different transformations of \(p_i\) are used as effect size measures (identity, logit, probit, arcsin-square-root, Freeman-Tukey).

  • Could use conventional random effects model or generalized linear mixed model.

  • Which predictive model to use?

\[ \begin{aligned} g(p_i) &\dot{\sim} \ N\left(g(\pi_i), \ \frac{h(\pi_i)}{N_i}\right) \qquad & N_i p_i &\sim \ Binom\left(N_i, \ \pi_i\right)\\ g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \qquad & g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \end{aligned} \]

Incidence of olfactory loss in COVID-19 patients

Normal Binomial
Model Metric Est. 95% CI 80% PI LPD SE LPD SE
RE logit 0.48 0.38-0.58 0.17-0.81 -178.57 12.43 -178.71 12.53
RE probit 0.49 0.39-0.58 0.17-0.81 -181.35 14.09 -181.35 14.16
RE arcsin 0.49 0.40-0.58 0.17-0.81 -173.53 11.18 -173.62 11.18
GLMM logit 0.48 0.38-0.59 0.16-0.82 -189.90 19.26
GLMM probit 0.49 0.39-0.58 0.17-0.82 -183.24 15.21

References

Ades, A. E., Guobing Lu, Sofia Dias, Evan Mayo-Wilson, and Daphne Kounali. 2015. “Simultaneous Synthesis of Treatment Effects and Mapping to a Common Scale: An Alternative to Standardisation.” Research Synthesis Methods 6 (1): 96–107. https://doi.org/10.1002/jrsm.1130.
Arends, Lidia R., Arno W. Hoes, Jacobus Lubsen, Diederik E. Grobbee, and Theo Stijnen. 2000. “Baseline Risk as Predictor of Treatment Benefit: Three Clinical Meta-Re-Analyses.” Statistics in Medicine 19 (24): 3497–518. https://doi.org/10.1002/1097-0258(20001230)19:24<3497::AID-SIM830>3.0.CO;2-H.
Credé, Marcus, Sylvia G. Roch, and Urszula M. Kieszczynka. 2010. “Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance with Grades and Student Characteristics.” Review of Educational Research 80 (2): 272–95. https://doi.org/10.3102/0034654310362998.
Cummings, Peter. 2011. “Arguments for and Against Standardized Mean Differences (Effect Sizes).” Archives of Pediatrics & Adolescent Medicine 165 (7): 592–96. https://doi.org/10.1001/archpediatrics.2011.97.
Demir, Emin, Sena Öz, Neriman Aral, and Figen Gürsoy. 2024. “A Reliability Generalization Meta-Analysis of the Mother-To-Infant Bonding Scale.” Psychological Reports 127 (1): 447–64. https://doi.org/10.1177/00332941221114413.
Downing, Beatrice C., Nicky J. Welton, Hugo Pedder, Ifigeneia Mavranezouli, Odette Megnin-Viggars, and A. E. Ades. 2025. “Synthesis of Depression Outcomes Reported on Different Scales: A Comparison of Methods for Modelling Mean Differences.” Research Synthesis Methods 16 (3): 460–78. https://doi.org/10.1017/rsm.2025.7.
Engels, Eric A., Christopher H. Schmid, Norma Terrin, Ingram Olkin, and Joseph Lau. 2000. “Heterogeneity and Statistical Significance in Meta-Analysis: An Empirical Study of 125 Meta-Analyses.” Statistics in Medicine 19 (13): 1707–28. https://doi.org/10.1002/1097-0258(20000715)19:13<1707::AID-SIM491>3.0.CO;2-P.
Fitzgerald, Kaitlyn G., and Elizabeth Tipton. 2025. “Using Extant Data to Improve Estimation of the Standardized Mean Difference.” Journal of Educational and Behavioral Statistics 50 (1): 128–48. https://doi.org/10.3102/10769986241238478.
Friedrich, Jan O., Neill K. J. Adhikari, and Joseph Beyene. 2011. “Ratio of Means for Analyzing Continuous Outcomes in Meta-Analysis Performed as Well as Mean Difference Methods.” Journal of Clinical Epidemiology 64 (5): 556–64. https://doi.org/10.1016/j.jclinepi.2010.09.016.
Guolo, Annamaria. 2022. “Measurement Errors in Control Risk Regression: A Comparison of Correction Techniques.” Statistics in Medicine 41 (1): 163–79. https://doi.org/10.1002/sim.9228.
Hannum, Mackenzie E, Vicente A Ramirez, Sarah J Lipson, et al. 2020. “Objective Sensory Testing Methods Reveal a Higher Prevalence of Olfactory Loss in COVID-19–Positive Patients Compared to Subjective Methods: A Systematic Review and Meta-Analysis.” Chemical Senses, September 29, bjaa064. https://doi.org/10.1093/chemse/bjaa064.
Hartmann-Boyce, Jamie, Samantha C Chepkin, Weiyu Ye, Chris Bullen, and Tim Lancaster. 2018. “Nicotine Replacement Therapy Versus Control for Smoking Cessation.” Cochrane Database of Systematic Reviews 2019 (1). https://doi.org/10.1002/14651858.CD000146.pub5.
Hopkins, Will G., and David S. Rowlands. 2024. “Standardization and Other Approaches to Meta-Analyze Differences in Means.” Statistics in Medicine 43 (16): 3092–108. https://doi.org/10.1002/sim.10114.
Kim, Young-Suk Grace, and Yucheng Cao. 2025. “Content Knowledge and Comprehension: A Meta-Analytic Review of Correlational and Causal Associations.” Psychological Bulletin 151 (10): 1219–44. https://doi.org/10.1037/bul0000502.
Lu, Guobing, Daphne Kounali, and A. E. Ades. 2014. “Simultaneous Multioutcome Synthesis and Mapping of Treatment Effects to a Common Scale.” Value in Health: The Journal of the International Society for Pharmacoeconomics and Outcomes Research 17 (2): 280–87. https://doi.org/10.1016/j.jval.2013.12.006.
Panagiotou, Orestis A., and Thomas A. Trikalinos. 2015. “Commentary: On Effect Measures, Heterogeneity, and the Laws of Nature.” Epidemiology 26 (5): 710. https://doi.org/10.1097/EDE.0000000000000359.
Poole, Charlie, Ian Shrier, and Tyler J. VanderWeele. 2015. “Is the Risk Difference Really a More Heterogeneous Measure?” Epidemiology 26 (5): 714. https://doi.org/10.1097/EDE.0000000000000354.
Schmid, Christopher H., Joseph Lau, Martin W. McIntosh, and Joseph C. Cappelleri. 1998. “An Empirical Study of the Effect of the Control Rate as a Predictor of Treatment Efficacy in Meta-Analysis of Clinical Trials.” Statistics in Medicine 17 (17): 1923–42. https://doi.org/10.1002/(SICI)1097-0258(19980915)17:17<1923::AID-SIM874>3.0.CO;2-6.
Stijnen, Theo, Taye H. Hamza, and Pinar Özdemir. 2010. “Random Effects Meta-Analysis of Event Outcome in the Framework of the Generalized Linear Mixed Model with Applications in Sparse Data.” Statistics in Medicine 29 (29): 3046–67. https://doi.org/10.1002/sim.4040.
Van Houwelingen, Hans C., Lidia R. Arends, and Theo Stijnen. 2002. “Advanced Methods in Meta‐analysis: Multivariate Approach and Meta‐regression.” Statistics in Medicine 21 (4): 589–624. https://doi.org/10.1002/sim.1040.
Yang, Yefeng, Coralie Williams, Alistair M. Senior, et al. 2024. “Bivariate Multilevel Meta-Analysis of Log Response Ratio and Standardized Mean Difference for Robust and Reproducible Environmental and Biological Sciences.” Pre-published May 16. https://www.biorxiv.org/content/10.1101/2024.05.13.594019v1.
Zhao, Yuxi, Elizabeth H. Slate, Chang Xu, Haitao Chu, and Lifeng Lin. 2022. “Empirical Comparisons of Heterogeneity Magnitudes of the Risk Difference, Relative Risk, and Odds Ratio.” Systematic Reviews 11 (1): 26. https://doi.org/10.1186/s13643-022-01895-7.