Variance stabilization of Cohen&amp;#039;s d

James E. Pustejovsky

In a recent Psychological Methods paper, Haaf & Rouder (2023) suggest conducting meta-analysis of standardized mean differences after transforming them to point biserial correlations and then applying Fisher’s $z$ -transformation. They argue that this leads to a variance-stabilized effect size estimator, which has sampling variance that is constant across varying values of the true parameter $δ$ . Bartoš et al. (2023) cites Haaf & Rouder (2023) as justification for conducting meta-analysis of standardized mean differences after applying this $d$ -to- $r$ -to- $z$ transformation. All this surprised me a bit because the transformation was unfamiliar, although I’ve written about these sorts of effect size conversions before (cf. Pustejovsky, 2014). In other work (Pustejovsky & Rodgers, 2018), I’ve used the variance-stabilizing transformation given in Hedges & Olkin (1985), which is different than this approach. What gives? Is this $d$ -to- $r$ -to- $z$ business really variance stabilizing?

$z (d)$

The first step in the transformation proposed by Haaf & Rouder (2023) depends on the relative sample sizes of the two groups, which I will parameterize as $R = N_{1} : N_{2}$ . Letting $a = (R + 1)^{2} / R$ , the transformation function is given by $z (d; a) = \frac{1}{2} [\log (\sqrt{d^{2} + a} + d) - \log (\sqrt{d^{2} + a} - d)]$

The shape of the transformation is depicted in Figure 1.

Code

library(tidyverse)

d_to_z <- function(d, R) {
  r <- d / sqrt(d^2 + (R + 1)^2 / R)
  atanh(r)
}

d_to_h <- function(d, R) {
  a <- (R + 1)^2 / R
  sqrt(2) * sign(d) * (log(abs(d) + sqrt(d^2 + 2 * a)) - log(2 * a) / 2)
}

transform_dat <- 
  expand_grid(
    d = seq(-5,5,0.02),
    R = 1:3
  ) |>
  mutate(
    a = (R + 1)^2 / R,
    z = d_to_z(d, R),
    h = d_to_h(d, R),
    R_fac = factor(paste(R,"1", sep = ":"))
  )

ggplot(transform_dat) +
  aes(d, z, color = R_fac) + 
  geom_abline(slope = 1 / 2, intercept = 0, linetype = "dashed") + 
  geom_line() + 
  theme_minimal() + 
  theme(legend.position = "inside", legend.position.inside = c(0.9, 0.2)) + 
  labs(color = "R") + 
  scale_x_continuous(breaks = seq(-5,5,1), minor_breaks = NULL)

Figure 1: d-to-r-to-z transformation for various allocation fractions $p$ . The dashed line is y = x / 2.

One way to check whether this transformation is variance-stabilizing is by finding the delta-method approximation to its variance. I find the first derivative of $z (d)$ to be $z^{'} (d; a) = \frac{1}{\sqrt{d^{2} + a}},$ and so $\begin{aligned} Var (z (d)) & \approx {(z^{'} (δ; a))}^{2} \times Var (d) \\ = \frac{1}{δ^{2} + a} \times \frac{1}{N} (a + \frac{δ^{2}}{2}) \\ = \frac{1}{N} (\frac{a + \frac{1}{2} δ^{2}}{a + δ^{2}}) . \end{aligned}$ This looks like the following:

Code

transform_dat %>%
  mutate(
    Vz = (a + d^2 / 2) / (a + d^2)
  ) %>%
ggplot() +
  aes(d, Vz, color = R_fac) + 
  geom_hline(yintercept = 1) + 
  geom_line() + 
  theme_minimal() + 
  theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) + 
  labs(x = expression(delta), y = expression(N %*% Var(z)), color = "R") + 
  scale_x_continuous(breaks = seq(-5,5,1), minor_breaks = NULL)

Figure 2: Delta method approximation for the variance of $z (d)$ .

Based on this approximation, $Var (z (d))$ is not independent of $δ$ and is instead decreasing in the magnitude of the true effect size.

$h (z)$

The variance-stabilizing transformation given in Hedges & Olkin (1985; see also Pustejovsky & Rodgers, 2018) is $h (d; a) = \sqrt{2} \times sgn (d) \times {sinh}^{- 1} (\frac{| d |}{\sqrt{2 a}}) = \sqrt{2} (\frac{d}{| d |}) [\log (| d | + \sqrt{d^{2} + 2 a}) - \frac{1}{2} \log (2 a)] .$ This transformation has derivative $h^{'} (d; a) = \frac{2}{\sqrt{d^{2} + 2 a}},$ and so $Var (h (d)) \approx {(h^{'} (δ; a))}^{2} \times Var (d) = \frac{2}{δ^{2} + 2 a} \times \frac{1}{N} (a + \frac{δ^{2}}{2}) = \frac{1}{N} .$

Figure 3 compares the two transformation functions, with $h (z)$ in red and $z (d)$ in blue. The functions are very similar over the range [-2, 2] and only begin to diverge when $| d | > 3$ , the realm of what would typically be considered very large effect sizes.

Code

transform_dat |>
  mutate(R_fac = fct_rev(R_fac) |> fct_relabel(\(x) paste("R ==", x))) |>
  pivot_longer(cols = c(z, h), names_to = "stat", values_to = "val") |>
  ggplot() + 
  aes(d, val, color = stat) + 
  geom_line() + 
  facet_wrap(~ R_fac, labeller = "label_parsed") + 
  theme_minimal() + 
  theme(legend.position = "inside", legend.position.inside = c(0.05, 0.9)) + 
  scale_x_continuous(breaks = seq(-4,4,2)) + 
  labs(y = "", color = "")

Figure 3: Comparison of transformation functions given by Haaf and Rouder (2023) and by Hedges and Olkin (1985)

This suggests that any differences in the performance of the transformations will be driven by behavior in the extremes of the distribution and will thus be less pronounced when $N_{1}$ and $N_{2}$ are large.

Two-sample simulations

I ran some quick simulations to look at how the sampling variance of $d$ , $z (d)$ , and $h (d)$ change as a function of the true average effect size $δ$ . These simulations are based on a data-generating process that is consistent with the stated assumptions of Haaf & Rouder (2023), in which two independent samples are drawn from normally distributed populations with a standardized mean difference of $δ$ , with total sample size $N$ and allocation ratio of $R = N_{1} : N_{2}$ . A variance-stabilizing transformation should produce an estimator with true standard error (i.e., square root of the sampling variance) that is constant, not depending on $δ$ .

Code

library(simhelpers)
library(future)
plan(multisession)

# Generate standardized mean differences d, z(d), h(d)

r_smd <- function(reps, delta = 0, N = 20, R = 1) {
  a <- sqrt(N * R / (R + 1)^2)
  ncp <- delta * a
  tstats <- rt(reps, df = N - 2, ncp = ncp)
  d <- tstats / a
  z <- d_to_z(d, R = R)
  h <- d_to_h(d, R = R)
  data.frame(d = d, z = z, h = h)
}

# Summarize mean and variance
calc_M_SE <- function(dat) {
  M <- colMeans(dat, na.rm = TRUE)
  SE <- apply(dat, 2, sd, na.rm = TRUE)
  data.frame(stat = names(dat), M = M, SE = SE)
}

# Simulation driver
sim_d_twosample <- compose(calc_M_SE, r_smd)

# Parameter grid
sim_grid <-
  expand_grid(
    N = c(10, 30, 50),
    R = 1:3,
    delta = seq(0, 3, 0.1)
  ) |>
  mutate(
    reps = 2e4
  )

# Execute simulations
d_twosample_res <- 
  sim_grid |>
  evaluate_by_row(
    sim_d_twosample, 
    system_time = FALSE,
    verbose = FALSE
  ) |>
  mutate(
    R_fac = factor(R) |> fct_relabel(\(x) paste0("R == ", x, ":1")),
    N_fac = factor(N) |> fct_relabel(\(x) paste("N ==", x)),
    SE_scaled = if_else(stat == "d", SE * sqrt((N - 2) * R / (R + 1)^2), sqrt(N - 2) * SE)
  )

Figure 4 shows the relationship between the parameter $δ$ and the true standard error of the standardized mean difference estimator $d$ , along with the true standard errors of the transformed effect size estimators $z (d)$ and $h (d)$ . To facilitate comparison between the raw and transformed estimators, I have re-scaled the standard errors according to their values when $δ = 0$ (multiplying by $\sqrt{(N - 2) R / (R + 1)^{2}}$ for $d$ and by $\sqrt{N - 2}$ for $z (d)$ and $h (d)$ ).

Code

ggplot(d_twosample_res) + 
  aes(delta, SE_scaled, color = stat) + 
  geom_hline(yintercept = 0) + 
  geom_line() + 
  facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + 
  theme_minimal() +
  labs(
    x = expression(delta), 
    y = "True Standard Error (rescaled)", 
    color = ""
  )

Figure 4: True standard errors of untransformed $d$ and transformed ( $z (d)$ and $h (d)$ ) effect size estimates under a two-sample homoskedastic normal model. Standard errors are rescaled so that SE = 1 when $δ$ = 0.

Consistent with the delta method approximation, it is evident that the sampling variance of $z (d)$ is not constant, but rather is decreasing in $δ$ . For instance, when $R = 1 : 1$ and $N = 50$ , the true SE of $z (d)$ decreases to 87% of its null value as $δ$ increases to 2. This pattern is the opposite of what we see for the raw effect size $d$ , which has standard error that increases to 125% of its null value under the same conditions. Between these two, the transformed $h (d)$ effect sizes have stable SEs across the range of $δ$ . Thus, $h (d)$ is variance stabilizing but $z (d)$ is not.

Dichotomized bivariate normal simulations

In Pustejovsky (2014), I argued that the appropriateness of $d$ -to- $r$ -to- $z$ transformations and the specific form of transformation to use both depend on features of the study’s design. The two-sample normal model would be relevant for experimental studies, and it seems that the $z (d)$ transformation is not variance-stabilizing there. But perhaps it would work for other study designs and data-generating processes? In Pustejovsky (2014), I also looked at extreme groups designs and “dichotomization” designs, where a researcher is interested in the correlation between two continuous variables, one of which has been dichotomized based on some quantile of its distribution. Just out of curiosity, I will check whether the $r$ -to- $z$ transformation works for a dichotomization design.

There are several ways to simulate a dichotomization design, which vary depending on a) whether you use a threshold defined by a population quantile or a sample quantile and b) whether, with the population quantile approach, you hold the per-group sample sizes fixed or only the total sample size (both are naturally fixed when defining the threshold based on a sample quantile). I will simulate using either population and sample quantiles, in each case while holding the total sample size fixed.¹ This creates a complication with small sample sizes because it’s possible to generate data where all of the observations fall above or below the threshold, so the standardized mean difference is undefined. I will handle this by just excluding such cases (i.e., conditioning on $N_{1} > 0$ and $N_{2} > 0$ ).

Code

# Generate standardized mean differences d, z(d), h(d)

r_bivariate <- function(
  reps, delta = 0, N = 20, R = 1, threshold = "population"
) {
  require(mvtnorm)
  rho <- delta / sqrt(delta^2 + (R + 1)^2 / R)
  Sigma <- rho + diag(1 - rho, nrow = 2L)
  d <- replicate(reps, {
    Z <- rmvnorm(n = N, mean = c(0,0), sigma = Sigma)
    
    if (threshold == "population") {
      X <- ifelse(Z[,1] <= qnorm(R / (R + 1)), "A","B")
      Xtb <- table(X)
      if (any(Xtb == 0)) return(NA_real_)
    } else {
      X <- ifelse(rank(Z[,1]) <= N * R / (R + 1), "A","B")
      Xtb <- table(X)
      R <- Xtb[1] / Xtb[2]
    }
    
    M <- tapply(Z[,2], X, mean)
    V <- tapply(Z[,2], X, var)
    Vpool <- sum((Xtb - 1) * V) / (N - 2)
    as.numeric(diff(M)) / sqrt(Vpool)
  }, simplify = FALSE) |>
    unlist()
  z <- d_to_z(d, R = R)
  h <- d_to_h(d, R = R)
  data.frame(d = d, z = z, h = h)
}

# Simulation driver
sim_bivariate <- compose(calc_M_SE, r_bivariate)

# Execute simulations
d_bivariate <- 
  sim_grid |>
  cross_join(tibble(threshold = c("population","sample"))) |> 
  evaluate_by_row(
    sim_bivariate,
    system_time = FALSE,
    verbose = FALSE
  ) |>
  mutate(
    R_fac = factor(R) |> fct_relabel(\(x) paste0("R == ", x, ":1")),
    N_fac = factor(N) |> fct_relabel(\(x) paste("N ==", x)),
    SE_scaled = if_else(stat == "d", SE * sqrt((N - 2) * R / (R + 1)^2), sqrt(N - 2) * SE)
  )

The figures below show the true standard errors of $d$ , $z (d)$ , and $h (d)$ as a function of the standardized mean difference parameter $δ$ . They are constructed using the same layout as Figure 4 (click on the tab for “population” or “sample” to see the results for your preferred form of dichotomization). Is $d$ -to- $r$ -to- $z$ variance stabilizing for the dichotomization design? In short, nope. Applying it here leads to an even stronger mean-variance relation than in the two-sample data-generating process, where larger means produce effect sizes with smaller sampling variation. But the $h (d)$ transformation does not really work either. It too exhibits a negative relation between true SE and $δ$ , though less pronounced than that of $z (d)$ . The untransformed effect size $d$ still shows a positive relation between true SE and $δ$ under some conditions, but it is weaker than under the two-sample data-generating process. For all three effect sizes, the strength of the relation changes depending on sample size—something that was not evident in the two-sample data-generating process.

Code

d_bivariate |>
  filter(threshold == "population") |>
ggplot() + 
  aes(delta, SE_scaled, color = stat) + 
  geom_hline(yintercept = 0) + 
  geom_hline(yintercept = 1, color = "grey") + 
  geom_line() + 
  facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + 
  theme_minimal() +
  labs(
    x = expression(delta), 
    y = "True Standard Error (rescaled)", 
    color = ""
  )

Figure 5: True standard errors of untransformed $d$ and transformed ( $z (d)$ and $h (d)$ ) effect size estimates under a dichotomized bivariate normal model with a population quantile threshold. Standard errors are rescaled so that SE = 1 when $δ$ = 0.

Code

d_bivariate |>
  filter(threshold == "sample") |>
ggplot() + 
  aes(delta, SE_scaled, color = stat) + 
  geom_hline(yintercept = 0) + 
  geom_hline(yintercept = 1, color = "grey") + 
  geom_line() + 
  facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + 
  theme_minimal() +
  labs(
    x = expression(delta), 
    y = "True Standard Error (rescaled)", 
    color = ""
  )

Figure 6: True standard errors of untransformed $d$ and transformed ( $z (d)$ and $h (d)$ ) effect size estimates under a dichotomized bivariate normal model with a sample quantile threshold. Standard errors are rescaled so that SE = 1 when $δ$ = 0.

Observations

In neither data-generating processes does the $d$ -to- $r$ -to- $z$ transformation lead to variance stabilization. For effect sizes that are any less than very large, the transformations appear to be quite close to linear. Here is a zoomed-in view of both transformations:

Code

transform_dat |>
  mutate(R_fac = fct_rev(R_fac) |> fct_relabel(\(x) paste("R ==", x))) |>
  pivot_longer(cols = c(z, h), names_to = "stat", values_to = "val") |>
  filter(abs(d) < 1.5) |>
  ggplot() + 
  aes(d, val, color = stat) + 
  geom_line() + 
  facet_wrap(~ R_fac, labeller = "label_parsed") + 
  theme_minimal() + 
  theme(legend.position = "inside", legend.position.inside = c(0.05, 0.9)) + 
  scale_x_continuous(breaks = -1:1) + 
  scale_y_continuous(breaks = -seq(-0.5,0.5,0.5)) + 
  labs(y = "", color = "")

Figure 7: Comparison of transformation functions given by Haaf and Rouder (2023) and by Hedges and Olkin (1985), over the range [-1.5, 1.5].

It would seem that applying either transformation will often be inconsequential for one’s conclusions except in meta-analyses with quite large effect size estimates. In meta-analyses where the transformation is consequential, its performance seems to be contingent on the assumed data-generating process, yet meta-analysts rarely have sufficient information to assess distributional assumptions about the individual-level outcomes in primary studies.

A further difficulty with both of the transformations is that they depend on the allocation ratio $R$ . In some meta-analyses of experimental studies, it may be that most or all primary studies have equally sized groups. But this will not always be the case. In meta-analyses where primary studies have varied sample sizes, one is then faced with the question of how to apply the transformation: should you use the correct variance-stabilizing transformation for each effect size, which then puts each primary study on a different metric, or should you use a common transformation (i.e., the same value of $a$ ) and then accept that the transformation will not stabilize all of the effects? Personally, I would opt for neither.

In my view, using a variance-stabilizing transformation of $d$ from an experimental study seems like extra work and more hassle than it’s worth. It adds an additional layer of technical complication to a meta-analysis, and the potential gains (in terms of variance-stabilization) are contingent on assumptions that one is rarely in position to adequately assess. Better to work with untransformed effect sizes, acknowledging the extent of the approximations involved and, when necessary, finding other ways to deal with the mean-variance relation of standardized mean differences.

References

Azzalini, A. (2005). The Skew-normal Distribution and Related Multivariate Families. Scandinavian Journal of Statistics, 32(2), 159–188. https://doi.org/10.1111/j.1467-9469.2005.00426.x

Bartoš, F., Maier, M., Wagenmakers, E.-J., Doucouliagos, H., & Stanley, T. D. (2023). Robust Bayesian meta-analysis: Model-averaging across complementary publication bias adjustment methods. Research Synthesis Methods, 14(1), 99–116. https://doi.org/10.1002/jrsm.1594

Haaf, J. M., & Rouder, J. N. (2023). Does every study? Implementing ordinal constraint in meta-analysis. Psychological Methods, 28(2), 472–487. https://doi.org/10.1037/met0000428

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press.

Pustejovsky, J. E. (2014). Converting from d to r to z when the design uses extreme groups, dichotomization, or experimental control. Psychological Methods, 19(1), 92–112. https://doi.org/10.1037/a0033788

Pustejovsky, J. E., & Rodgers, M. (2018). Testing for funnel plot asymmetry of standardized mean differences. Research Synthesis Methods. https://doi.org/10.1002/jrsm.1332

Footnotes

Interested readers could modify the code to handle the further case, with population quantile thresholds and sample sizes fixed per-group, by generating observations from skew-normal distributions for each group. See Azzalini (2005) and the rsn() function from the {sn} package.↩︎

--- title: Variance stabilization of Cohen's d date: '2025-07-23' categories: - effect-size - standardized-mean-difference - delta-method code-fold: true code-tools: true toc: true bibliography: d-r-z-refs.bib csl: "../apa.csl" --- In a recent _Psychological Methods_ paper, @Haaf2023does suggest conducting meta-analysis of standardized mean differences after transforming them to point biserial correlations and then applying Fisher's $z$-transformation. They argue that this leads to a variance-stabilized effect size estimator, which has sampling variance that is constant across varying values of the true parameter $\delta$. @Bartos2023robust cites @Haaf2023does as justification for conducting meta-analysis of standardized mean differences after applying this $d$-to-$r$-to-$z$ transformation. All this surprised me a bit because the transformation was unfamiliar, although I've written about these sorts of effect size conversions before [cf. @Pustejovsky2014converting]. In other work [@Pustejovsky2018testing], I've used the variance-stabilizing transformation given in @Hedges1985statistical, which is different than this approach. What gives? Is this $d$-to-$r$-to-$z$ business _really_ variance stabilizing? ## $z(d)$ The first step in the transformation proposed by @Haaf2023does depends on the relative sample sizes of the two groups, which I will parameterize as $R = N_1 : N_2$. Letting $a = (R + 1)^2 / R$, the transformation function is given by $$ z(d; a) = \frac{1}{2}\left[\log\left(\sqrt{d^2 + a} + d\right) - \log\left(\sqrt{d^2 + a} - d\right)\right] $$ The shape of the transformation is depicted in @fig-Haaf-transform. ```{r Haaf-transform} #| label: fig-Haaf-transform #| fig-width: 8 #| fig-height: 4 #| fig-cap: "d-to-r-to-z transformation for various allocation fractions $p$. The dashed line is y = x / 2." library(tidyverse) d_to_z <- function(d, R) { r <- d / sqrt(d^2 + (R + 1)^2 / R) atanh(r) } d_to_h <- function(d, R) { a <- (R + 1)^2 / R sqrt(2) * sign(d) * (log(abs(d) + sqrt(d^2 + 2 * a)) - log(2 * a) / 2) } transform_dat <- expand_grid( d = seq(-5,5,0.02), R = 1:3 ) |> mutate( a = (R + 1)^2 / R, z = d_to_z(d, R), h = d_to_h(d, R), R_fac = factor(paste(R,"1", sep = ":")) ) ggplot(transform_dat) + aes(d, z, color = R_fac) + geom_abline(slope = 1 / 2, intercept = 0, linetype = "dashed") + geom_line() + theme_minimal() + theme(legend.position = "inside", legend.position.inside = c(0.9, 0.2)) + labs(color = "R") + scale_x_continuous(breaks = seq(-5,5,1), minor_breaks = NULL) ``` One way to check whether this transformation is variance-stabilizing is by finding the [delta-method approximation](/posts/Multivariate-delta-method/) to its variance. I find the first derivative of $z(d)$ to be $$ z'(d; a) = \frac{1}{\sqrt{d^2 + a}}, $$ and so $$ \begin{aligned} \text{Var}(z(d)) &\approx \left( z'(\delta; a) \right)^2 \times \text{Var}(d) \\ &= \frac{1}{\delta^2 + a} \times \frac{1}{N} \left(a + \frac{\delta^2}{2}\right) \\ &= \frac{1}{N} \left(\frac{a + \frac{1}{2}\delta^2}{a + \delta^2}\right). \end{aligned} $$ This looks like the following: ```{r Var-z} #| label: fig-Var-z #| fig-width: 8 #| fig-height: 4 #| fig-cap: "Delta method approximation for the variance of $z(d)$." transform_dat %>% mutate( Vz = (a + d^2 / 2) / (a + d^2) ) %>% ggplot() + aes(d, Vz, color = R_fac) + geom_hline(yintercept = 1) + geom_line() + theme_minimal() + theme(legend.position = "inside", legend.position.inside = c(0.9, 0.8)) + labs(x = expression(delta), y = expression(N %*% Var(z)), color = "R") + scale_x_continuous(breaks = seq(-5,5,1), minor_breaks = NULL) ``` Based on this approximation, $\text{Var}(z(d))$ is not independent of $\delta$ and is instead _decreasing_ in the magnitude of the true effect size. ## $h(z)$ The variance-stabilizing transformation given in @Hedges1985statistical [see also @Pustejovsky2018testing] is $$ h(d; a) = \sqrt{2} \times \text{sgn}(d) \times \text{sinh}^{-1}\left(\frac{|d|}{\sqrt{2a}}\right) = \sqrt{2} \left(\frac{d}{|d|}\right) \left[\log\left(|d| + \sqrt{d^2 + 2a}\right) - \frac{1}{2}\log\left(2a\right)\right]. $$ This transformation has derivative $$ h'(d; a) = \frac{2}{\sqrt{d^2 + 2a}}, $$ and so $$ \text{Var}(h(d)) \approx \left( h'(\delta; a) \right)^2 \times \text{Var}(d) = \frac{2}{\delta^2 + 2a} \times \frac{1}{N} \left(a + \frac{\delta^2}{2}\right) = \frac{1}{N}. $$ @fig-transformation-comparison compares the two transformation functions, with $h(z)$ in red and $z(d)$ in blue. The functions are very similar over the range [-2, 2] and only begin to diverge when $|d| > 3$, the realm of what would typically be considered _very_ large effect sizes. ```{r transform-comparison} #| label: fig-transformation-comparison #| fig-width: 8 #| fig-height: 3 #| fig-cap: "Comparison of transformation functions given by Haaf and Rouder (2023) and by Hedges and Olkin (1985)" transform_dat |> mutate(R_fac = fct_rev(R_fac) |> fct_relabel(\(x) paste("R ==", x))) |> pivot_longer(cols = c(z, h), names_to = "stat", values_to = "val") |> ggplot() + aes(d, val, color = stat) + geom_line() + facet_wrap(~ R_fac, labeller = "label_parsed") + theme_minimal() + theme(legend.position = "inside", legend.position.inside = c(0.05, 0.9)) + scale_x_continuous(breaks = seq(-4,4,2)) + labs(y = "", color = "") ``` This suggests that any differences in the performance of the transformations will be driven by behavior in the extremes of the distribution and will thus be less pronounced when $N_1$ and $N_2$ are large. # Two-sample simulations I ran some quick simulations to look at how the sampling variance of $d$, $z(d)$, and $h(d)$ change as a function of the true average effect size $\delta$. These simulations are based on a data-generating process that is consistent with the stated assumptions of @Haaf2023does, in which two independent samples are drawn from normally distributed populations with a standardized mean difference of $\delta$, with total sample size $N$ and allocation ratio of $R = N_1:N_2$. A variance-stabilizing transformation should produce an estimator with true standard error (i.e., square root of the sampling variance) that is constant, not depending on $\delta$. ```{r two-sample-sims} #| cache: true library(simhelpers) library(future) plan(multisession) # Generate standardized mean differences d, z(d), h(d) r_smd <- function(reps, delta = 0, N = 20, R = 1) { a <- sqrt(N * R / (R + 1)^2) ncp <- delta * a tstats <- rt(reps, df = N - 2, ncp = ncp) d <- tstats / a z <- d_to_z(d, R = R) h <- d_to_h(d, R = R) data.frame(d = d, z = z, h = h) } # Summarize mean and variance calc_M_SE <- function(dat) { M <- colMeans(dat, na.rm = TRUE) SE <- apply(dat, 2, sd, na.rm = TRUE) data.frame(stat = names(dat), M = M, SE = SE) } # Simulation driver sim_d_twosample <- compose(calc_M_SE, r_smd) # Parameter grid sim_grid <- expand_grid( N = c(10, 30, 50), R = 1:3, delta = seq(0, 3, 0.1) ) |> mutate( reps = 2e4 ) # Execute simulations d_twosample_res <- sim_grid |> evaluate_by_row( sim_d_twosample, system_time = FALSE, verbose = FALSE ) |> mutate( R_fac = factor(R) |> fct_relabel(\(x) paste0("R == ", x, ":1")), N_fac = factor(N) |> fct_relabel(\(x) paste("N ==", x)), SE_scaled = if_else(stat == "d", SE * sqrt((N - 2) * R / (R + 1)^2), sqrt(N - 2) * SE) ) ``` ```{r} #| include: false d2_factors <- d_twosample_res %>% mutate(SE_scaled = round(100 * SE_scaled)) |> filter(R == 1, N == 50, delta == 2) ``` @fig-true-SE-twosample shows the relationship between the parameter $\delta$ and the true standard error of the standardized mean difference estimator $d$, along with the true standard errors of the transformed effect size estimators $z(d)$ and $h(d)$. To facilitate comparison between the raw and transformed estimators, I have re-scaled the standard errors according to their values when $\delta = 0$ (multiplying by $\sqrt{(N - 2) R / (R + 1)^2}$ for $d$ and by $\sqrt{N - 2}$ for $z(d)$ and $h(d)$). ```{r two-sample-true-SE} #| label: fig-true-SE-twosample #| fig-cap: "True standard errors of untransformed $d$ and transformed ($z(d)$ and $h(d)$) effect size estimates under a two-sample homoskedastic normal model. Standard errors are rescaled so that SE = 1 when $\\delta$ = 0." #| fig-width: 8 #| fig-height: 6 #| out-width: 100% #| lightbox: true #| column: body-outset ggplot(d_twosample_res) + aes(delta, SE_scaled, color = stat) + geom_hline(yintercept = 0) + geom_line() + facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + theme_minimal() + labs( x = expression(delta), y = "True Standard Error (rescaled)", color = "" ) ``` Consistent with the delta method approximation, it is evident that the sampling variance of $z(d)$ is not constant, but rather is _decreasing_ in $\delta$. For instance, when $R = 1:1$ and $N = 50$, the true SE of $z(d)$ decreases to `{r} d2_factors |> filter(stat == 'z') |> pull(SE_scaled)`% of its null value as $\delta$ increases to 2. This pattern is the opposite of what we see for the raw effect size $d$, which has standard error that increases to `{r} d2_factors |> filter(stat == 'd') |> pull(SE_scaled)`% of its null value under the same conditions. Between these two, the transformed $h(d)$ effect sizes have stable SEs across the range of $\delta$. Thus, $h(d)$ is variance stabilizing but $z(d)$ is not. # Dichotomized bivariate normal simulations In @Pustejovsky2014converting, I argued that the appropriateness of $d$-to-$r$-to-$z$ transformations and the specific form of transformation to use both depend on features of the study's design. The two-sample normal model would be relevant for experimental studies, and it seems that the $z(d)$ transformation is not variance-stabilizing there. But perhaps it would work for other study designs and data-generating processes? In @Pustejovsky2014converting, I also looked at extreme groups designs and "dichotomization" designs, where a researcher is interested in the correlation between two continuous variables, one of which has been dichotomized based on some quantile of its distribution. Just out of curiosity, I will check whether the $r$-to-$z$ transformation works for a dichotomization design. There are several ways to simulate a dichotomization design, which vary depending on a) whether you use a threshold defined by a population quantile or a sample quantile and b) whether, with the population quantile approach, you hold the per-group sample sizes fixed or only the total sample size (both are naturally fixed when defining the threshold based on a sample quantile). I will simulate using either population and sample quantiles, in each case while holding the total sample size fixed.^[Interested readers could modify the code to handle the further case, with population quantile thresholds and sample sizes fixed per-group, by generating observations from skew-normal distributions for each group. See @Azzalini2005skewnormal and the `rsn()` function from the `{sn}` package.] This creates a complication with small sample sizes because it's possible to generate data where all of the observations fall above or below the threshold, so the standardized mean difference is undefined. I will handle this by just excluding such cases (i.e., conditioning on $N_1 > 0$ and $N_2 > 0$). ```{r bivariate-sims} #| cache: true # Generate standardized mean differences d, z(d), h(d) r_bivariate <- function( reps, delta = 0, N = 20, R = 1, threshold = "population" ) { require(mvtnorm) rho <- delta / sqrt(delta^2 + (R + 1)^2 / R) Sigma <- rho + diag(1 - rho, nrow = 2L) d <- replicate(reps, { Z <- rmvnorm(n = N, mean = c(0,0), sigma = Sigma) if (threshold == "population") { X <- ifelse(Z[,1] <= qnorm(R / (R + 1)), "A","B") Xtb <- table(X) if (any(Xtb == 0)) return(NA_real_) } else { X <- ifelse(rank(Z[,1]) <= N * R / (R + 1), "A","B") Xtb <- table(X) R <- Xtb[1] / Xtb[2] } M <- tapply(Z[,2], X, mean) V <- tapply(Z[,2], X, var) Vpool <- sum((Xtb - 1) * V) / (N - 2) as.numeric(diff(M)) / sqrt(Vpool) }, simplify = FALSE) |> unlist() z <- d_to_z(d, R = R) h <- d_to_h(d, R = R) data.frame(d = d, z = z, h = h) } # Simulation driver sim_bivariate <- compose(calc_M_SE, r_bivariate) # Execute simulations d_bivariate <- sim_grid |> cross_join(tibble(threshold = c("population","sample"))) |> evaluate_by_row( sim_bivariate, system_time = FALSE, verbose = FALSE ) |> mutate( R_fac = factor(R) |> fct_relabel(\(x) paste0("R == ", x, ":1")), N_fac = factor(N) |> fct_relabel(\(x) paste("N ==", x)), SE_scaled = if_else(stat == "d", SE * sqrt((N - 2) * R / (R + 1)^2), sqrt(N - 2) * SE) ) ``` The figures below show the true standard errors of $d$, $z(d)$, and $h(d)$ as a function of the standardized mean difference parameter $\delta$. They are constructed using the same layout as @fig-true-SE-twosample (click on the tab for "population" or "sample" to see the results for your preferred form of dichotomization). Is $d$-to-$r$-to-$z$ variance stabilizing for the dichotomization design? In short, nope. Applying it here leads to an even stronger mean-variance relation than in the two-sample data-generating process, where larger means produce effect sizes with smaller sampling variation. But the $h(d)$ transformation does not really work either. It too exhibits a negative relation between true SE and $\delta$, though less pronounced than that of $z(d)$. The untransformed effect size $d$ still shows a positive relation between true SE and $\delta$ under some conditions, but it is weaker than under the two-sample data-generating process. For all three effect sizes, the strength of the relation changes depending on sample size---something that was not evident in the two-sample data-generating process. ::: {.panel-tabset} ## Population ```{r bivariate-true-SE-pop} #| label: fig-true-SE-bivariate-population #| fig-cap: "True standard errors of untransformed $d$ and transformed ($z(d)$ and $h(d)$) effect size estimates under a dichotomized bivariate normal model with a population quantile threshold. Standard errors are rescaled so that SE = 1 when $\\delta$ = 0." #| fig-width: 8 #| fig-height: 6 #| out-width: 100% #| lightbox: true #| column: body-outset d_bivariate |> filter(threshold == "population") |> ggplot() + aes(delta, SE_scaled, color = stat) + geom_hline(yintercept = 0) + geom_hline(yintercept = 1, color = "grey") + geom_line() + facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + theme_minimal() + labs( x = expression(delta), y = "True Standard Error (rescaled)", color = "" ) ``` ## Sample ```{r bivariate-true-SE-sam} #| label: fig-true-SE-bivariate-sampled #| fig-cap: "True standard errors of untransformed $d$ and transformed ($z(d)$ and $h(d)$) effect size estimates under a dichotomized bivariate normal model with a sample quantile threshold. Standard errors are rescaled so that SE = 1 when $\\delta$ = 0." #| fig-width: 8 #| fig-height: 6 #| out-width: 100% #| lightbox: true #| column: body-outset d_bivariate |> filter(threshold == "sample") |> ggplot() + aes(delta, SE_scaled, color = stat) + geom_hline(yintercept = 0) + geom_hline(yintercept = 1, color = "grey") + geom_line() + facet_grid(N_fac ~ R_fac, scales = "free_y", labeller = "label_parsed") + theme_minimal() + labs( x = expression(delta), y = "True Standard Error (rescaled)", color = "" ) ``` ::: # Observations In neither data-generating processes does the $d$-to-$r$-to-$z$ transformation lead to variance stabilization. For effect sizes that are any less than very large, the transformations appear to be quite close to linear. Here is a zoomed-in view of both transformations: ```{r transform-comparison-zoom} #| label: fig-transformation-comparison-zoom #| fig-width: 8 #| fig-height: 3 #| fig-cap: "Comparison of transformation functions given by Haaf and Rouder (2023) and by Hedges and Olkin (1985), over the range [-1.5, 1.5]." transform_dat |> mutate(R_fac = fct_rev(R_fac) |> fct_relabel(\(x) paste("R ==", x))) |> pivot_longer(cols = c(z, h), names_to = "stat", values_to = "val") |> filter(abs(d) < 1.5) |> ggplot() + aes(d, val, color = stat) + geom_line() + facet_wrap(~ R_fac, labeller = "label_parsed") + theme_minimal() + theme(legend.position = "inside", legend.position.inside = c(0.05, 0.9)) + scale_x_continuous(breaks = -1:1) + scale_y_continuous(breaks = -seq(-0.5,0.5,0.5)) + labs(y = "", color = "") ``` It would seem that applying _either_ transformation will often be inconsequential for one's conclusions except in meta-analyses with quite large effect size estimates. In meta-analyses where the transformation is consequential, its performance seems to be contingent on the assumed data-generating process, yet meta-analysts rarely have sufficient information to assess distributional assumptions about the individual-level outcomes in primary studies. A further difficulty with both of the transformations is that they depend on the allocation ratio $R$. In some meta-analyses of experimental studies, it may be that most or all primary studies have equally sized groups. But this will not always be the case. In meta-analyses where primary studies have varied sample sizes, one is then faced with the question of how to apply the transformation: should you use the correct variance-stabilizing transformation for each effect size, which then puts each primary study _on a different metric_, or should you use a common transformation (i.e., the same value of $a$) and then accept that the transformation will not stabilize all of the effects? Personally, I would opt for neither. In my view, using a variance-stabilizing transformation of $d$ from an experimental study seems like extra work and more hassle than it's worth. It adds an additional layer of technical complication to a meta-analysis, and the potential gains (in terms of variance-stabilization) are contingent on assumptions that one is rarely in position to adequately assess. Better to work with untransformed effect sizes, acknowledging the extent of the approximations involved and, when necessary, finding other ways to deal with the mean-variance relation of standardized mean differences.