James E. Pustejovsky – Using Predictive Fit to Inform Effect Metric Choice in Meta-Analysis

Effect Metric Menagerie

Group comparison of binary outcomes

Risk differences \(\pi_1 - \pi_0\)
Risk ratios (log-transformed) \(\log\left(\frac{\pi_1}{\pi_0}\right)\)
Odds ratios (log-transformed) \(\log\left(\frac{\pi_1 / (1 - \pi_1)}{\pi_0 / (1 - \pi_0)}\right)\)
Bivariate models for \(\pi_0, \pi_1\)

Group comparison of continuous outcomes

Raw mean differences \(\mu_1 - \mu_0\)
Standardized mean differences \(\delta = \frac{\mu_1 - \mu_0}{\sigma}\)
Percentage of maximum possible (POMP) differences
Response ratios (log-transformed) \(\lambda = \log\left(\frac{\mu_1}{\mu_0}\right)\)
Probability of superiority

Single-group prevalence

Raw proportions \(\pi\)
Arcsine-transformation \(a = \text{asin}\left(\sqrt{\pi}\right)\)
Freeman-Tukey double-arcsine transformation

Bivariate associations / psychometric

Pearson’s correlation \(\rho\)
Fisher’s \(z\)-transformation \(\zeta = \text{atanh}(\rho)\)
Reliability-corrected correlation \(\frac{\rho_{xy}}{\sqrt{\rho_{xx}\rho_{yy}}}\)

Metric choice methodology

Large literature on effect metrics for group comparison on binary outcomes.
- Theoretical arguments about interpretability, stability, non-collapsibility (Poole et al. 2015; Panagiotou and Trikalinos 2015).
- Risk differences tend to be more heterogeneous (Engels et al. 2000; Zhao et al. 2022).

Strong opinions about effect metrics for group comparison on continuous outcomes (Cummings 2011).
- Some novel alternatives to avoid standarization (Ades et al. 2015; Lu et al. 2014; Downing et al. 2025).
- Various methods for standardization (e.g., Hopkins and Rowlands 2024; Fitzgerald and Tipton 2025).

Choice between standardized mean difference and response ratio metrics
- Sensitivity analyses using both metrics (Friedrich et al. 2011).
- Model both metrics simultaneously (Yang et al. 2024).

But is there any way to evaluate effect metric choice as a model assumption?

Use predictive fit to inform metric choice

Evaluate effect metrics by performance in predicting summary data for a new study.
- Data vector \(\mathbf{d}_i\) consisting of summary statistics used to compute effect size estimates.

Use leave-one-out log predictive density (LOO-LPD) to measure predictive performance. \[ \text{LOO-LPD} = \frac{1}{k} \sum_{i=1}^{k} \log p\left(\mathbf{d}_i \left| \hat\mu_{(-i)}, \hat\tau_{(-i)}, \mathbf{X}_i, N_i\right.\right) \]

Leave-one-out predictive density

Without using observation \(i\), make a prediction for observation \(i\) by specifying a density \(p_{(-i)}(x_i)\) (must integrate to one).

Score higher by putting more density on the realized outcome.

\[ \text{Predictive score:} \ \log p_{(-i)}(x_i) \]

LOO-LPD is the sum of scores across \(N\) observations: \(\displaystyle{\text{LOO-LPD} = \sum_{i=1}^N \log p_{(-i)}(x_i)}\)
- Higher LOO-LPD indicates better fit.
- Agnostic to the metric of the prediction.

Predicting summary data

Many effect size metrics are functions of multiple parameters.
- Group comparisons of binary outcomes are functions of \(\pi_0,\pi_1\)
- Group comparisons of continuous outcomes are functions of \(\mu_0,\mu_1,\sigma\)
- Reliability-corrected correlations are functions of \(\rho_{xy},\rho_{xx},\rho_{xy}\).

Problem:
- Meta-analysis models for such metrics are not sufficiently generative for predicting the data vector \(\mathbf{d}_i\).
Solution:
- Augment the meta-analysis model with an auxiliary model for some of the parameters.

Effectiveness of nicotine replacement therapy

Cochrane Systematic Review of effects of nicotine replacement therapy vs. control on smoking cessation, defined as abstinence at 6+ month follow-up (Hartmann-Boyce et al. 2018).
Sample sizes ranging from \(N_i\) = 36 to 5290 (median = 240.5, IQR = 153.5 - 428.5).

Random effects meta-analysis

Difference ES metrics suggest very different implications and different heterogeneity

Metric	Est	95% CI	80% PI	I2
Odds ratio	1.75	1.63-1.88	1.29-2.38	39.06
Risk ratio	1.57	1.48-1.66	1.23-1.99	36.88
Complementary risk ratio	1.07	1.06-1.08	1.02-1.13	65.51
Risk difference	0.06	0.05-0.07	0.02-0.11	63.50

Auxiliary modeling

Possible auxiliary models for \(\hat\pi_{0i}\) or \(\hat\pi_{1i}\):
- Random effects meta-analysis/meta-regression of log-odds
- Generalized linear mixed model (Normal-binomial)
- Beta-binomial regression

Predictive model

\[ \begin{aligned} &\text{Auxiliary model:} \quad & \text{logit}\left(\pi_{0i}\right) &\sim N\left(\mu_0, \sigma\right) \\ &\text{RE meta-analysis model:} \quad & \theta_i &\sim N\left(\mu, \ \tau^2\right) \\ &\text{Observation model:} \quad & N_{0i} \hat\pi_{0i} &\sim Binom\left( N_{0i}, \ \pi_{0i} \right) \\ & & N_{1i} \hat\pi_{1i} &\sim Binom\left( N_{0i}, \ g(\pi_{0i}, \theta) \right) \end{aligned} \]

Metric comparison

	Normal-Binomial
metric	LPD	SE	Diff. vs. OR	SE
Odds ratio	-995.8	21.1
Risk ratio	-1002.0	22.0	-6.2	2.7
Complementary risk ratio	-1015.8	22.9	-19.9	9.7
Risk difference	-1384.5	30.3	-388.7	18.3

Metric comparison

		Normal-Binomial				Beta-Binomial
model	metric	LPD	SE	Diff. vs. OR	SE	LPD	SE	Diff. vs. OR	SE
RE Meta	Odds ratio	-995.8	21.1			-992.8	20.5
RE Meta	Risk ratio	-1002.0	22.0	-6.2	2.7	-998.4	21.4	-5.6	2.6
RE Meta	Complementary risk ratio	-1015.8	22.9	-19.9	9.7	-1012.3	22.2	-19.5	10.3
RE Meta	Risk difference	-1384.5	30.3	-388.7	18.3	-1380.7	29.6	-387.9	18.4
Bivariate Normal	Odds ratio	-993.7	20.9	2.1	3.1
Bivariate GLMM	Odds ratio	-994.1	20.6	1.8	2.4

Discussion

Treat effect metric choice as a modeling assumption.
Predictive fit assessment may be relevant and useful for meta-analysis.
- Will often require use of auxiliary models.

Advantages of log predictive density scoring
- Allows comparison across effect metrics and different forms of models.
- Auxiliary model building exercise can clarify scientific context.

Disadvantages and open questions
- Which parameters should be part of the auxiliary model?
- Other predictive scoring rules that may be relevant?
- Is the joint distribution of \(\mathbf{d}_i\) the right focus?

You’re gonna need a bigger model.

Predictive discrepancies

Reading comprehension and content knowledge

Kim and Cao (Kim and Cao 2025) reported a systematic review and meta-analysis of studies on association between reading comprehension and content knowledge.
380 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).

Bivariate associations

The data: Pearson correlation between two variables of interest from a sample of \(N_i\) observations, \(r_i\).

\(\rho\) metric

Effect size estimate \(r_i\), standard error \(\displaystyle{se_i = \frac{1 - r_i^2}{\sqrt{N_i}}}\)
Predictive model: \[ \begin{aligned} r_i &\dot{\sim} \ N\left(\rho_i, \ \frac{(1 - \rho_i^2)^2}{N_i}\right) \\ \rho_i &\sim \ N_{trunc}\left(\mu_\rho, \ \tau_\rho^2\right) \end{aligned} \]

\(\zeta = \text{atanh}(\rho)\) metric

Effect size estimate \(z_i = \text{atanh}(r_i)\), standard error \(\displaystyle{se_i = \frac{1}{\sqrt{N_i - 3}}}\)
Predictive model: \[ \begin{aligned} z_i &\dot{\sim} \ N\left(\zeta_i, \ \frac{1}{N_i - 3}\right) \\ \zeta_i &\sim \ N\left(\mu_\zeta, \ \tau_\zeta^2\right) \end{aligned} \]
log-predictive density: \[\begin{eqnarray} \log &d_r&(r_i | \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i) \\ &=& \log d_z\left(z_i \left| \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i \right.\right) - \log\left(1 - r_i^2\right) \end{eqnarray}\]

Metric comparison

Metric	Est.	95% CI	80% PI	LPD	SE
r	0.36	0.34-0.38	0.13-0.59	63.36	11.80
z	0.36	0.34-0.38	0.11-0.57	67.52	11.26
Difference				-4.15	1.94

Discrepancies

Class attendance and college grades

Credé and colleagues (Credé et al. 2010) reported a systematic review and meta-analysis of studies on association between class attendance and grades / GPA in college.
99 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).

Metric comparison

Metric	Est.	95% CI	80% PI	LPD	SE
r	0.40	0.37-0.44	0.20-0.60	33.24	8.27
z	0.41	0.37-0.45	0.16-0.61	21.39	11.95
Difference				11.85	4.72

Discrepancies

Reliability generalization of MIBS

Demir and colleagues (Demir et al. 2024) gathered 33 estimates of internal consistency (Cronbach \(\alpha\)) of the Mother-to-Infant Bonding Scale.
Sample sizes ranging from \(N_i\) = 13 to 2251 (median = 177, IQR = 98-260).

Metric	Est.	95% CI	80% PI	I2	LPD	SE
Raw alpha	0.72	0.68-0.76	0.58-0.87	97.01	18.79	5.12
Bonett trans.	0.74	0.69-0.78	0.51-0.86	96.34	17.48	3.86
Hakstian-Whalen trans.	0.73	0.68-0.77	0.53-0.86	96.37	19.27	3.69

Incidence of olfactory loss in COVID-19 patients

Hannum and colleagues (Hannum et al. 2020) compiled data on rates of olfactory loss across 35 studies of COVID-19 patients.
Sample sizes ranging from \(N_i\) = 15 to 7178 (median = 95, IQR = 56.5 - 267.5).

Many different transformations of \(p_i\) are used as effect size measures (identity, logit, probit, arcsin-square-root, Freeman-Tukey).
Could use conventional random effects model or generalized linear mixed model.

Which predictive model to use?

\[ \begin{aligned} g(p_i) &\dot{\sim} \ N\left(g(\pi_i), \ \frac{h(\pi_i)}{N_i}\right) \qquad & N_i p_i &\sim \ Binom\left(N_i, \ \pi_i\right)\\ g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \qquad & g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \end{aligned} \]

Incidence of olfactory loss in COVID-19 patients

					Normal		Binomial
Model	Metric	Est.	95% CI	80% PI	LPD	SE	LPD	SE
RE	logit	0.48	0.38-0.58	0.17-0.81	-178.57	12.43	-178.71	12.53
RE	probit	0.49	0.39-0.58	0.17-0.81	-181.35	14.09	-181.35	14.16
RE	arcsin	0.49	0.40-0.58	0.17-0.81	-173.53	11.18	-173.62	11.18
GLMM	logit	0.48	0.38-0.59	0.16-0.82			-189.90	19.26
GLMM	probit	0.49	0.39-0.58	0.17-0.82			-183.24	15.21

References

Ades, A. E., Guobing Lu, Sofia Dias, Evan Mayo-Wilson, and Daphne Kounali. 2015. “Simultaneous Synthesis of Treatment Effects and Mapping to a Common Scale: An Alternative to Standardisation.” Research Synthesis Methods 6 (1): 96–107. https://doi.org/10.1002/jrsm.1130.

Arends, Lidia R., Arno W. Hoes, Jacobus Lubsen, Diederik E. Grobbee, and Theo Stijnen. 2000. “Baseline Risk as Predictor of Treatment Benefit: Three Clinical Meta-Re-Analyses.” Statistics in Medicine 19 (24): 3497–518. https://doi.org/10.1002/1097-0258(20001230)19:24<3497::AID-SIM830>3.0.CO;2-H.

Credé, Marcus, Sylvia G. Roch, and Urszula M. Kieszczynka. 2010. “Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance with Grades and Student Characteristics.” Review of Educational Research 80 (2): 272–95. https://doi.org/10.3102/0034654310362998.

Cummings, Peter. 2011. “Arguments for and Against Standardized Mean Differences (Effect Sizes).” Archives of Pediatrics & Adolescent Medicine 165 (7): 592–96. https://doi.org/10.1001/archpediatrics.2011.97.

Demir, Emin, Sena Öz, Neriman Aral, and Figen Gürsoy. 2024. “A Reliability Generalization Meta-Analysis of the Mother-To-Infant Bonding Scale.” Psychological Reports 127 (1): 447–64. https://doi.org/10.1177/00332941221114413.

Downing, Beatrice C., Nicky J. Welton, Hugo Pedder, Ifigeneia Mavranezouli, Odette Megnin-Viggars, and A. E. Ades. 2025. “Synthesis of Depression Outcomes Reported on Different Scales: A Comparison of Methods for Modelling Mean Differences.” Research Synthesis Methods 16 (3): 460–78. https://doi.org/10.1017/rsm.2025.7.

Engels, Eric A., Christopher H. Schmid, Norma Terrin, Ingram Olkin, and Joseph Lau. 2000. “Heterogeneity and Statistical Significance in Meta-Analysis: An Empirical Study of 125 Meta-Analyses.” Statistics in Medicine 19 (13): 1707–28. https://doi.org/10.1002/1097-0258(20000715)19:13<1707::AID-SIM491>3.0.CO;2-P.

Fitzgerald, Kaitlyn G., and Elizabeth Tipton. 2025. “Using Extant Data to Improve Estimation of the Standardized Mean Difference.” Journal of Educational and Behavioral Statistics 50 (1): 128–48. https://doi.org/10.3102/10769986241238478.

Friedrich, Jan O., Neill K. J. Adhikari, and Joseph Beyene. 2011. “Ratio of Means for Analyzing Continuous Outcomes in Meta-Analysis Performed as Well as Mean Difference Methods.” Journal of Clinical Epidemiology 64 (5): 556–64. https://doi.org/10.1016/j.jclinepi.2010.09.016.

Guolo, Annamaria. 2022. “Measurement Errors in Control Risk Regression: A Comparison of Correction Techniques.” Statistics in Medicine 41 (1): 163–79. https://doi.org/10.1002/sim.9228.

Hannum, Mackenzie E, Vicente A Ramirez, Sarah J Lipson, et al. 2020. “Objective Sensory Testing Methods Reveal a Higher Prevalence of Olfactory Loss in COVID-19–Positive Patients Compared to Subjective Methods: A Systematic Review and Meta-Analysis.” Chemical Senses, September 29, bjaa064. https://doi.org/10.1093/chemse/bjaa064.

Hartmann-Boyce, Jamie, Samantha C Chepkin, Weiyu Ye, Chris Bullen, and Tim Lancaster. 2018. “Nicotine Replacement Therapy Versus Control for Smoking Cessation.” Cochrane Database of Systematic Reviews 2019 (1). https://doi.org/10.1002/14651858.CD000146.pub5.

Hopkins, Will G., and David S. Rowlands. 2024. “Standardization and Other Approaches to Meta-Analyze Differences in Means.” Statistics in Medicine 43 (16): 3092–108. https://doi.org/10.1002/sim.10114.

Kim, Young-Suk Grace, and Yucheng Cao. 2025. “Content Knowledge and Comprehension: A Meta-Analytic Review of Correlational and Causal Associations.” Psychological Bulletin 151 (10): 1219–44. https://doi.org/10.1037/bul0000502.

Lu, Guobing, Daphne Kounali, and A. E. Ades. 2014. “Simultaneous Multioutcome Synthesis and Mapping of Treatment Effects to a Common Scale.” Value in Health: The Journal of the International Society for Pharmacoeconomics and Outcomes Research 17 (2): 280–87. https://doi.org/10.1016/j.jval.2013.12.006.

Panagiotou, Orestis A., and Thomas A. Trikalinos. 2015. “Commentary: On Effect Measures, Heterogeneity, and the Laws of Nature.” Epidemiology 26 (5): 710. https://doi.org/10.1097/EDE.0000000000000359.

Poole, Charlie, Ian Shrier, and Tyler J. VanderWeele. 2015. “Is the Risk Difference Really a More Heterogeneous Measure?” Epidemiology 26 (5): 714. https://doi.org/10.1097/EDE.0000000000000354.

Schmid, Christopher H., Joseph Lau, Martin W. McIntosh, and Joseph C. Cappelleri. 1998. “An Empirical Study of the Effect of the Control Rate as a Predictor of Treatment Efficacy in Meta-Analysis of Clinical Trials.” Statistics in Medicine 17 (17): 1923–42. https://doi.org/10.1002/(SICI)1097-0258(19980915)17:17<1923::AID-SIM874>3.0.CO;2-6.

Stijnen, Theo, Taye H. Hamza, and Pinar Özdemir. 2010. “Random Effects Meta-Analysis of Event Outcome in the Framework of the Generalized Linear Mixed Model with Applications in Sparse Data.” Statistics in Medicine 29 (29): 3046–67. https://doi.org/10.1002/sim.4040.

Van Houwelingen, Hans C., Lidia R. Arends, and Theo Stijnen. 2002. “Advanced Methods in Meta‐analysis: Multivariate Approach and Meta‐regression.” Statistics in Medicine 21 (4): 589–624. https://doi.org/10.1002/sim.1040.

Yang, Yefeng, Coralie Williams, Alistair M. Senior, et al. 2024. “Bivariate Multilevel Meta-Analysis of Log Response Ratio and Standardized Mean Difference for Robust and Reproducible Environmental and Biological Sciences.” Pre-published May 16. https://www.biorxiv.org/content/10.1101/2024.05.13.594019v1.

Zhao, Yuxi, Elizabeth H. Slate, Chang Xu, Haitao Chu, and Lifeng Lin. 2022. “Empirical Comparisons of Heterogeneity Magnitudes of the Risk Difference, Relative Risk, and Odds Ratio.” Systematic Reviews 11 (1): 26. https://doi.org/10.1186/s13643-022-01895-7.