Effect size : a statistical basis for clinical practice

a Graduate student, School of Dentistry, Federal University of Goiás, Goiás, Brazil b Associated Professor, Division of Prothesis and Scientific Methodology; School of Dentistry, Federal University of Goiás c Head, Division of Endodontics, School of Dentistry, Federal University of Goiás, Goiás, Brazil d Associated Professor, Division of Orthodontics, School of Dentistry, Federal University of Goiás, Goiás, Brazil ABSTRACT


INTRODUCTION
Effect size (ES) is the statistical measure which quantifies the strength of a phenomenon [1].It is also known as effect magnitude and applies to different epidemiological designs, including observational and interventional studies.This estimate measures the magnitude of the difference between groups, the strength of the association between variables, and the risk of occurrence for a given event.As this measure is standardized, it allows one to compare estimates between different studies and is used in a pooled manner in metaanalyzes [2].
Hypothesis tests and the concept of P value were the main statistical tools used to analyze the strength of scientific evidence in quantitative studies throughout the twentieth century.However, the use of the P value as a statistical reference to show the effectiveness of treatments has been questioned in several publications [3,4,5], culminating in the 2016 American Statistical Association's recommendation of avoiding conclusions based exclusively on P values [6].The classical methodology presents certain adverse characteristics, such as: 1) low reproducibility; 2) dependence on sample size and variance; 3) frequent clinical inconsistency; 4) use of arbitrary cut-off values (P > 0.05 and P < 0.05 as a result, to accept or reject the null hypothesis, respectively); and, 5) presenting dichotomous results only, that is, statistically significant or non-significant [7,8,9,10].A more appropriate interpretation includes knowing how much one intervention or association is better or greater when compared to another, and not simply whether or not there is a difference or association [11].The concept of ES serves to fill this gap.Statistical significance and ES are currently complementary, and it is recommended that they be applied together, especially for the analysis and interpretation of primary outcomes [8,12,13].
Despite the fact that the ES estimate is related to relevant information and that its description has been widely recommended when reporting P value, few studies have explored or applied this concept to the health field [10].In that light, the purpose of this article is to describe the conceptual basis of the most common measures of ES, and present information on its application, calculation and interpretation, with a view to helping understand the conclusion of studies and thus lead to improved clinical practice.

Data interpretation
ES can be applied to different study designs.In clinical trials, it is used to detect whether one intervention is better or worse when compared to another, and not simply whether or not there is a difference, as explored by the concept of statistical significance [11].It thus assists in making a clinical decision about the superiority (or otherwise) of a given intervention.When the ES is large enough, it differentiates between two treatments or decides whether one is preferable to another, from a clinical point of view [12].In observational studies, ES can be understood as the strength of association between outcome and predictor variables and suggests not only whether there is an association but also how magnified it is [7,14].In addition, ES can also be expressed by relative risk [7,15].

Meta-analysis
In meta-analysis, ES is extracted from individual primary studies, either directly or from data transformation, and then pooled to synthesize a standardized measure with greater statistical power [10,16].In this way, the greater accuracy provided by the joint data can be used to resolve controversies between primary studies and give an objective estimate of the scientific evidence.However, the interpretation and comparison of ES require careful consideration of the sources of variability [13].
Sample size calculation and statistical power ES estimation is applied for calculating sample size (n) and statistical power (1-β) [17].For the purpose of calculating a reasonable sample size, ES can be estimated by similar article published by others, pilot study results, or the minimum difference that would be considered important by experts [2].Improper n affects the veracity of P values, and compromises internal validity.An undersized sample increases the probability of a type II error (β), while an oversized sample (big data) increases the probability of a type I error.Thus, large samples can give rise to a reduced P value.Thereby exaggerating the importance of the difference between interventions or associations [18].For this reason, n calculation should be performed a priori.Otherwise, a post hoc statistical power analysis could confirm, or not, the validity of the study [15].
Statistical power is the probability of correctly rejecting the null hypothesis [4].It can be influenced by three factors: level of significance (α), sample size, and ES [10].As a general rule, the smaller the variance and the greater the ES and power, the smaller the sample size, and vice versa [19].In other words, a strong association between two variables or a large existing difference will be easily detected in the sample, so that a small sample will be able to demonstrate this effect.To be detected, however, a weak association or small difference will require a larger sample size and power [20].

Calculating ES
ES estimates can be calculated by various formulas linked to different statistical tests, and it is advisable to report which formula was used when mentioning ES scores [11,13,10].Most commonly used measures have been grouped according to the following categories: I -Group difference, II -Association strength, III -Risk estimation, and IV -Multivariate data (Table 1).

I -Group difference
This category evaluates the difference between means or frequencies when two or more groups are involved.The difference between means can be expressed in absolute or standardized terms.The simple difference between two means, for example, is the absolute difference, while the standardized is dimensioned by the variability, and so is more suitable for the comparison of multiple studies [2].

I.1 Difference of mean between two groups
The d, g and Δ formulas are used for this purpose.They have similar numerators (difference between absolute means) but different denominators, and are the population, combined, and control group standard deviations [10].d This measure was proposed by Cohen in 1962 and represents the most commonly used standardized mean difference [1].It requires the following assumptions: normal distribution, unpaired groups, and variables measured on a continuous scale [1].This index is used when the population standard deviation is known (δ formula) or when the standard deviation is used in a pooled manner (d formula) (Table 1) [10].The d mean (dm) index is applied for the paired t test, using the means of the standard deviations between the groups [7].
g This measure was proposed by Hedges in 1982 [10,21].The formula uses the pooled sample standard deviation and adds the correction factor J = (1-3/(4gl-1).This approach is commonly used for groups with different sample sizes, and is also used in small sample cases (n < 20) or when the population standard deviation is unknown.It is commonly used for the t-test and meta-analysis.

Delta (∆ )
This measure was proposed by Glass in 1976 [10,22].It is an alternative approach to both d and g, when the standard deviations of the groups are significantly different.In this case, the standard deviation from the control group is chosen rather than a combined standard deviation.

I.2 Difference of mean between more than two groups
Eta squared (η 2 ) and partial eta squared ( partial η 2 ) η 2 was introduced by Fischer in 1925 [8].η 2 and partial η 2 are the most commonly reported ES estimations for one-way analysis of variance (1-way ANOVA).They were drafted to compare three or more groups measured by continuous variables [19].η 2 becomes biased as sample size decreases, caused by the lack of a population correction factor [11].Also, with multiple factors tends to underestimate ES as the number of factors increases.For this reason, the partial eta will be better indicated [11,23].It should be emphasized that the preference of use is η 2 > partial η 2 > ω 2 accordingly to the sample size [23].
Epsilon squared (ε 2 ) and omega squared (ω 2 ) ε 2 and ω 2 were proposed by Kelley in 1935 and Hays in 1963, respectively [8,11].These are estimates of the effect size provided by the ANOVA test which use population correction factor.ω 2 and ε 2 are more conservative compared to η 2 , thereby reducing the bias of small samples [11].ω 2 is not appropriate for comparing groups with reduced samples.In such cases, the η 2 should be used [24].Both ω 2 and ε 2 are better suited to compare ES between studies with the same experimental design [25].

I.3 Frequency difference
This subcategory is used for nominal qualitative data.It compares difference of frequency between groups, and the typical statistical test involved is the chi-square test.

phi (φ) φ correlation coefficient was proposed by Karl Pearson
and is used in a 2×2 contingency table.The values range from -1 to 1 [13].

II -AssociAtion strenGth
This category evaluates the strength of the shared variance between two variables (predictor and outcome).

Pearson (r) and Spearman (r s ) coefficient correlation r was proposed by Karl Pearson and r s by Charles
Spearman.r measures the association strength between two continuous variables, and the usual normality and homoscedasticity assumptions are assumed.r s is the nonparametric version, indicated for ordinal and continuous non-normal distribution variables.Neither have any unit of measurement, and range from -1 to +1 [26,27].

R²
R² is called the "coefficient of determination", also referred as r 2 or r-squared.It is calculated as the square of the r, ranges from 0 to 1, and is used in regression analysis [7].It is the square of the r and ranges from 0 to 1. R 2 provides the value as a percentage when multiplied by 100 meaning the percentage of the variance of either variable is shared with the other variable.f 2 (Cohen) f 2 is recommended for comparing more than two groups by means of repeated measurements in regression and linear hierarchy models, whose objective is to evaluate the relationship of the variable with the outcome.This is represented by a global effect model where R is the correlation between the dependent and independent variables (both continuous) [28].The local effect size can be estimated by modifying the global formula, where RB represents the variable of interest and RA the other variables.RAB is the combined ratio [28].

III -risk estimAtion
This category compares the chance or risk for an outcome between two or more groups [2].The score 1 represents no effect.

Odds Ratio (OR)
OR is an association measure applied to transverse and retrospective (case control) designs.It is reported as an ES index obtained from contingency tables (association between exposure and outcome) [29].OR represents the ratio of chance of occurrence against the chance of nonoccurrence of a determining event [23].

Relative Risk (RR)
RR is an association measure applied to prospective designs (clinical trials and cohort studies).It comprises a ratio of incidence observed in exposed and non-exposed groups [7].The score ranges from 0 to 1 and can be transformed into a percentage when multiplied by 100.

iV -multiVAriAte dAtA
This category deals with multivariate analysis, which plays a crucial role in understanding complex data sets which require a simultaneous examination of various variables.

Adjusted Eta squared (n 2 adjusted ) n 2
adjusted is used for comparing three or more groups with predictor (independent) variables in multivariate analysis (MANOVA and ANCOVA).Its main advantage in relation to η 2 is in analyzing the effect of a specific variable while controlling the effect of other variables in the study (Hays, 1994).It has also been proposed for improving the comparability of ES findings between studies with the same methodological design.
Adjusted R squared (R 2 adjusted ) R² has a corrected variation called R² adjusted , which seeks to correct the variance errors shared by multiple predictors, used in multiple regression [11].

INTERPRETING ES
The extraction of maximum useful information from statistical research data helps the researcher to interpret results.ES estimates describe the observed effect and approaches to the practical relevance of the study.In addition, in terms of the statistical significance test, they emphasize the power of the tests and reduce the random error of a mere sample variation.In general, the larger the size, the larger the effect and impact caused by the variable under study.It should be noted that the effect size being sought is that of the population, but as this value is not available the sample effect size should be used to estimate the probable effect size of the population [15].
There is no consensus as to what constitutes a small, moderate, or large ES, because there are many different means of calculation and variations differ depending on the field of investigation [1,29].According to Cohen (1988), pain values were small (0.20 > d ≤ 0.49), medium (0.50 > d ≤ 0.79) and large (d ≥ 0.80) (Table 2) (Figure 1) [1].ES estimates should also be reported in conjunction with a confidence interval (CI) because an ES sample is a random variable.A large CI should be interpreted with caution due to imprecision [30].A large ES and no statistical significance implies that sample size needs to be increased, while the opposite, a statistical significance (P value) in conjunction with a small ES implies that the result indicates that the   significance only occurred due to the sample size increase (Table 3) [30,31].Statistical errors can be better detected by reporting the CI and ES estimates (Figure 2) [20].
Effect size for clinical practice | Barros et al.

n 1 + 2 2 .
n 2 -2 Combined standard deviation M 1 = mean of the experimental group DP 1 = standard deviation of the experimental group n 1 = sample size of the experimental group M 2 = mean of the control group DP 2 = standard deviation of the control group n 2 = sample size of the control group μ 1 -μ 2 σ = σ Population standard deviation μ 1 = population mean of the experimental group µ 2 = population mean of the control group σ = population standard deviation d m d m = M 1 -M 2 DP 1 + DP 2 Mean of the standard deviation M 1 = mean of the experimental group M 2 = mean of the control group DP 1 = standard deviation of the experimental group DP 2 = standard deviation of the control group g of Hedges: Combined standard deviation M 1 = mean of the experimental group DP 1 = standard deviation of the experimental group n 1 = sample size of the experimental group M 2 = mean of the control group DP 2 = standard deviation of the control group n 2 = sample size of the control group gl = degree of freedom (n-1) Δ of Glass: Δ = M 1 -M 2 DP control M 1 = mean of the experimental group M 2 = mean of the control group DP control = standard deviation of the control group η 2 = SS E = sum of squares for the exposure variable SS T = total variance of outcome variables partial η 2 = SS E SS E + SS ER SS E = sum of squares for the exposure variable SS R = sum of squared errors ε² = SS B -df b MS R SS T SS B = sum between group effect SS T = total variance MS R = mean square of residuals df B = degree of freedom between groups ω 2 = SS B -(df b ) MS R SS T + MS R SS B = sum between group effect SS T = total variance MS R = mean square of residuals df B = degree of freedom between groups phi (φ) = x² = qui-square of independence n = sample size V of Cramér: V = x² = qui-square of independence n = sample size df = lower degrees of freedom for number of rows and columns r xy of Pearson: r xy = ∑(x -x ) (y -y ) √ ∑(x -x ) 2 √ ∑(y -y ) 2 ∑ = sum x and y = dependent variable value x and y = simple arithmetic means of the x and y values r of Spearman: and S i = sorted by ranks R and S = R and S mean (average) n = number of pairs ∑ = sum SS E = squares sum for the effects SS T = data variance f 2 = R B = interest variable R A = other variables R AB = combined ratio OR = X = outcome probability in the treatment group Y = outcome probability in the control group RR = Exposure (x) and outcome (y) association η 2 ajusted = SS E = squares sum for the effects SS R = residual sum r 2 ajusted =1-[ (1 -R 2 ) -(n -1) n -k -1 ] n = sample size k = number of predictors variables R² = squared correlation coefficient Effect size for clinical practice | Barros et al.

Figure 2 .
Figure 2. Confidence interval to support effect size interpretation (fictitious data).

Table 1 .
Formulas for effect size estimative

Table 2 .
Common effect size uses and values * Variable according to the decision context and comparative value of specific research area.SD (standard deviation); ANOVA (analysis of variance); ANCOVA (Covariance Analysis); MANOVA (Multivariate Analysis of Covariance).Effect size for clinical practice | Barros et al.

Table 3 .
Interpretation of results based on data variation (fictitious data)