
2025-06-11
Large literature on effect metrics for group comparison on binary outcomes.
Theoretical arguments about interpretability, stability, non-collapsibility (Poole et al. 2015; Panagiotou and Trikalinos 2015).
Risk differences tend to be more heterogeneous (Engels et al. 2000; Zhao et al. 2022).
Strong opinions about effect metrics for group comparison on continuous outcomes (Cummings 2011).
Some novel alternatives to avoid standarization (Ades et al. 2015; Lu et al. 2014; Downing et al. 2025).
Various methods for standardization (e.g., Hopkins and Rowlands 2024; Fitzgerald and Tipton 2025).
Choice between standardized mean difference and response ratio metrics
Sensitivity analyses using both metrics (Friedrich et al. 2011).
Model both metrics simultaneously (Yang et al. 2024).
Evaluate effect metrics by performance in predicting summary data for a new study.

\[ \text{Predictive score:} \ \log p_{(-i)}(x_i) \]

LOO-LPD is the sum of scores across \(N\) observations: \(\displaystyle{\text{LOO-LPD} = \sum_{i=1}^N \log p_{(-i)}(x_i)}\)
Higher LOO-LPD indicates better fit.
Agnostic to the metric of the prediction.
Many effect size metrics are functions of multiple parameters.
Group comparisons of binary outcomes are functions of \(\pi_0,\pi_1\)
Group comparisons of continuous outcomes are functions of \(\mu_0,\mu_1,\sigma\)
Reliability-corrected correlations are functions of \(\rho_{xy},\rho_{xx},\rho_{xy}\).
Problem:
Solution:
Cochrane Systematic Review of effects of nicotine replacement therapy vs. control on smoking cessation, defined as abstinence at 6+ month follow-up (Hartmann-Boyce et al. 2018).
Sample sizes ranging from \(N_i\) = 36 to 5290 (median = 240.5, IQR = 153.5 - 428.5).


| Metric | Est | 95% CI | 80% PI | I2 |
|---|---|---|---|---|
| Odds ratio | 1.75 | 1.63-1.88 | 1.29-2.38 | 39.06 |
| Risk ratio | 1.57 | 1.48-1.66 | 1.23-1.99 | 36.88 |
| Complementary risk ratio | 1.07 | 1.06-1.08 | 1.02-1.13 | 65.51 |
| Risk difference | 0.06 | 0.05-0.07 | 0.02-0.11 | 63.50 |

Possible auxiliary models for \(\hat\pi_{0i}\) or \(\hat\pi_{1i}\):
Random effects meta-analysis/meta-regression of log-odds
Generalized linear mixed model (Normal-binomial)
Beta-binomial regression


\[ \begin{aligned} &\text{Auxiliary model:} \quad & \text{logit}\left(\pi_{0i}\right) &\sim N\left(\mu_0, \sigma\right) \\ &\text{RE meta-analysis model:} \quad & \theta_i &\sim N\left(\mu, \ \tau^2\right) \\ &\text{Observation model:} \quad & N_{0i} \hat\pi_{0i} &\sim Binom\left( N_{0i}, \ \pi_{0i} \right) \\ & & N_{1i} \hat\pi_{1i} &\sim Binom\left( N_{0i}, \ g(\pi_{0i}, \theta) \right) \end{aligned} \]
| Normal-Binomial | ||||
|---|---|---|---|---|
| metric | LPD | SE | Diff. vs. OR | SE |
| Odds ratio | -995.8 | 21.1 | ||
| Risk ratio | -1002.0 | 22.0 | -6.2 | 2.7 |
| Complementary risk ratio | -1015.8 | 22.9 | -19.9 | 9.7 |
| Risk difference | -1384.5 | 30.3 | -388.7 | 18.3 |
| Normal-Binomial | Beta-Binomial | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| model | metric | LPD | SE | Diff. vs. OR | SE | LPD | SE | Diff. vs. OR | SE |
| RE Meta | Odds ratio | -995.8 | 21.1 | -992.8 | 20.5 | ||||
| RE Meta | Risk ratio | -1002.0 | 22.0 | -6.2 | 2.7 | -998.4 | 21.4 | -5.6 | 2.6 |
| RE Meta | Complementary risk ratio | -1015.8 | 22.9 | -19.9 | 9.7 | -1012.3 | 22.2 | -19.5 | 10.3 |
| RE Meta | Risk difference | -1384.5 | 30.3 | -388.7 | 18.3 | -1380.7 | 29.6 | -387.9 | 18.4 |
| Bivariate Normal | Odds ratio | -993.7 | 20.9 | 2.1 | 3.1 | ||||
| Bivariate GLMM | Odds ratio | -994.1 | 20.6 | 1.8 | 2.4 | ||||
Treat effect metric choice as a modeling assumption.
Predictive fit assessment may be relevant and useful for meta-analysis.
Advantages of log predictive density scoring
Allows comparison across effect metrics and different forms of models.
Auxiliary model building exercise can clarify scientific context.
Disadvantages and open questions
Which parameters should be part of the auxiliary model?
Other predictive scoring rules that may be relevant?
Is the joint distribution of \(\mathbf{d}_i\) the right focus?
You’re gonna need a bigger model.
Kim and Cao (Kim and Cao 2025) reported a systematic review and meta-analysis of studies on association between reading comprehension and content knowledge.
380 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).


Effect size estimate \(r_i\), standard error \(\displaystyle{se_i = \frac{1 - r_i^2}{\sqrt{N_i}}}\)
Predictive model: \[ \begin{aligned} r_i &\dot{\sim} \ N\left(\rho_i, \ \frac{(1 - \rho_i^2)^2}{N_i}\right) \\ \rho_i &\sim \ N_{trunc}\left(\mu_\rho, \ \tau_\rho^2\right) \end{aligned} \]
Effect size estimate \(z_i = \text{atanh}(r_i)\), standard error \(\displaystyle{se_i = \frac{1}{\sqrt{N_i - 3}}}\)
Predictive model: \[ \begin{aligned} z_i &\dot{\sim} \ N\left(\zeta_i, \ \frac{1}{N_i - 3}\right) \\ \zeta_i &\sim \ N\left(\mu_\zeta, \ \tau_\zeta^2\right) \end{aligned} \]
log-predictive density: \[\begin{eqnarray} \log &d_r&(r_i | \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i) \\ &=& \log d_z\left(z_i \left| \hat\mu_{\zeta (-i)}, \hat\tau_{\zeta (-i)}, N_i \right.\right) - \log\left(1 - r_i^2\right) \end{eqnarray}\]
| Metric | Est. | 95% CI | 80% PI | LPD | SE |
|---|---|---|---|---|---|
| r | 0.36 | 0.34-0.38 | 0.13-0.59 | 63.36 | 11.80 |
| z | 0.36 | 0.34-0.38 | 0.11-0.57 | 67.52 | 11.26 |
| Difference | -4.15 | 1.94 |


Credé and colleagues (Credé et al. 2010) reported a systematic review and meta-analysis of studies on association between class attendance and grades / GPA in college.
99 correlation estimates, samples ranging from \(N_i\) = 23 to 3900 (median = 151, IQR = 76-335).


| Metric | Est. | 95% CI | 80% PI | LPD | SE |
|---|---|---|---|---|---|
| r | 0.40 | 0.37-0.44 | 0.20-0.60 | 33.24 | 8.27 |
| z | 0.41 | 0.37-0.45 | 0.16-0.61 | 21.39 | 11.95 |
| Difference | 11.85 | 4.72 |


Demir and colleagues (Demir et al. 2024) gathered 33 estimates of internal consistency (Cronbach \(\alpha\)) of the Mother-to-Infant Bonding Scale.
Sample sizes ranging from \(N_i\) = 13 to 2251 (median = 177, IQR = 98-260).

| Metric | Est. | 95% CI | 80% PI | I2 | LPD | SE |
|---|---|---|---|---|---|---|
| Raw alpha | 0.72 | 0.68-0.76 | 0.58-0.87 | 97.01 | 18.79 | 5.12 |
| Bonett trans. | 0.74 | 0.69-0.78 | 0.51-0.86 | 96.34 | 17.48 | 3.86 |
| Hakstian-Whalen trans. | 0.73 | 0.68-0.77 | 0.53-0.86 | 96.37 | 19.27 | 3.69 |

Hannum and colleagues (Hannum et al. 2020) compiled data on rates of olfactory loss across 35 studies of COVID-19 patients.
Sample sizes ranging from \(N_i\) = 15 to 7178 (median = 95, IQR = 56.5 - 267.5).

Many different transformations of \(p_i\) are used as effect size measures (identity, logit, probit, arcsin-square-root, Freeman-Tukey).
Could use conventional random effects model or generalized linear mixed model.
\[ \begin{aligned} g(p_i) &\dot{\sim} \ N\left(g(\pi_i), \ \frac{h(\pi_i)}{N_i}\right) \qquad & N_i p_i &\sim \ Binom\left(N_i, \ \pi_i\right)\\ g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \qquad & g(\pi_i) &\sim \ N\left(\mu_g, \ \tau_g^2\right) \end{aligned} \]
| Normal | Binomial | |||||||
|---|---|---|---|---|---|---|---|---|
| Model | Metric | Est. | 95% CI | 80% PI | LPD | SE | LPD | SE |
| RE | logit | 0.48 | 0.38-0.58 | 0.17-0.81 | -178.57 | 12.43 | -178.71 | 12.53 |
| RE | probit | 0.49 | 0.39-0.58 | 0.17-0.81 | -181.35 | 14.09 | -181.35 | 14.16 |
| RE | arcsin | 0.49 | 0.40-0.58 | 0.17-0.81 | -173.53 | 11.18 | -173.62 | 11.18 |
| GLMM | logit | 0.48 | 0.38-0.59 | 0.16-0.82 | -189.90 | 19.26 | ||
| GLMM | probit | 0.49 | 0.39-0.58 | 0.17-0.82 | -183.24 | 15.21 | ||