Abstract
Background Most quantitative tests do not perfectly discriminate between subjects with and without a given disease and their results do not always allow certainty about disease status for diagnostic or screening purposes. We propose a method to construct a three-zone partition for quantitative tests to avoid the binary constraint of a ‘black or white’ decision that often does not fit the reality of clinical or screening practice. This partition intentionally includes a grey zone between positive and negative conclusions.
Methods and Results We show that the width of this grey zone depends on the difference between the means of test results for subjects with and without the disease, the variability of the test results and its components (biological, measurement), and the level of the misclassification risks (false positive, false negative) required by the context of use. We illustrate the method by application to the tuberculin skin test and iron deficiency markers in children.
Conclusion This method can be used both to display the discriminatory performance of a quantitative test in a variety of contexts and to scrutinize its components of variability. Due to the simplicity of the graphical representations, the grey zone approach may be useful during the development of quantitative tests and the publication of their performance.
Discrimination, diagnostic, screening, quantitative test, reliability, measurement, graphical method
Diagnostic and screening discrimination problems require a rule that enables a new subject to be assigned to the correct population, e.g. with or without a given disease, with the lowest rate (or cost) of misclassification. In the case of a quantitative test and a binary decision, the discrimination rule classifies as diseased a subject whose value is above (or below) an optimal cutpoint, determined under given constraints (e.g. rate or cost of false positives/negatives).1,2 However, since most tests do not discriminate perfectly between subjects with and without a given disease, certainty about disease status cannot be obtained for results within a given range of (intermediate) values. To deal with this problem, the construction of a three-zone partition, including a middle inconclusive zone of intermediate values has been proposed3 and applied to categorical and ordinal tests.4,5
In this paper, we extend the grey zone approach to quantitative diagnostic and screening tests, and illustrate it by application to the tuberculin skin test and iron deficiency markers in children. The graphical representations allowed by this approach are intended to help in the development of quantitative tests, and the evaluation and reporting of their measurement properties.
Construction of the grey zone for diagnostic or screening discrimination
Definition and construction of the grey zone
We define the grey zone of a quantitative test as an area of values where the discriminatory performance is ‘insufficient’, in the sense that a value in the grey zone does not allow the target disease to be scored as either present or absent. It is thus the range of values that do not eliminate uncertainty about the disease status.
In situations where perfect discrimination between subjects with and without a given disease is possible, e.g. some IgM-antibody serological tests, the construction of a grey zone is irrelevant. However, there is often a significant overlap between distributions of test results for subjects with and without the disease, and the grey zone may be wide. Its width obviously depends on the level of the overlap of distributions due to ‘true’ overlap but also on the measurement error. It also depends on the requirement of the clinical or screening context in terms of likelihood ratios (LR). Indeed, to confirm or exclude the presence of a target disease, high positive LR (LR+) and low negative LR (LR−) values are necessary to ensure post-test probabilities close to 1 or 0 for a range of pre-test probability values. The required degree of closeness of post-test probabilities to 1 or 0 depends on the context. Sometimes, post-test probabilities over 0.99 or even 0.999 (or under 0.01 or 0.001) are required to confirm or exclude the presence of a target disease. Examples include confirming the diagnosis of human immunodeficiency virus (HIV) infection to initiate antiretroviral therapy or to exclude Down’s syndrome by a prenatal screening test. Otherwise, clinicians or Public Health professionals may find slightly lower probabilities e.g. 0.95 (0.05) sufficient to decide whether a target disease is present.
Once the analysis of the context has provided suitable values of LR, the identification of the two cut-off points delimiting the grey zone is straightforward: one, gup, associated with the minimal desirable value of LR+; the other, glow, associated with the maximal desirable value of LR−. These cut-off points define an area of inconclusive values: the grey zone.
The tuberculin skin test
Results of the tuberculin skin test are important for care management decisions, and in particular whether to initiate antituberculous therapy in HIV-seropositive patients.6–9 Let us suppose that clinicians in a specialized AIDS unit want to use the tuberculin skin test to rule in or rule out a diagnosis of tuberculosis in a HIV-1 infected patient with signs of ‘probable’ tuberculosis, the pre-test probability being estimated to be between 0.30 and 0.50. Clinicians require post-test levels of (1) about 0.95 for the positive predictive value and thus, LR+ being over 44 to accept the hypothesis and treat the patient without further exploration; and (2) 0.05 for the negative predictive value and thus the LR− being under 0.022 to reject the diagnosis and seek another (these values are subjective probabilities provided by clinicians working in such AIDS units). The construction of a grey zone (Figure 1, panels A and B) for these LR values (using reference data on the distribution of tuberculin skin test results in healthy and tuberculosis-infected subjects10) gives the interval 7.8–16.6 mm. An immediate inspection of the width of the grey zone shows the poor discrimination ability of this test in the context considered: the grey zone corresponds to one-third of the range of possible values. Also, the usual cut-off point of 10 mm, shown in bold in Figure 1, is clearly located well within the grey zone.
The reticulocyte haemoglobin content test
Early identification of iron deficiency in children is important for prevention of several systemic complications including the impairment of mental and motor development. A recent study11 suggested that reticulocyte haemoglobin content (CHr) could be used for screening children for iron deficiency. We used the data reported to derive the distributions of CHr in iron-deficient children and children with satisfactory iron status: these distributions were largely overlapping (Figure 2, panel A). We constructed a grey zone for the main use of the test discussed by the authors: to detect iron deficiency, in order to treat it, in a population where the frequency of the disease is about 10%. To construct the grey zone, we considered that the negative predictive value should be below 0.1% (due to the severe developmental consequences of failure to detect), and this implies LR− being under 0.01. We also considered that the treatment, which has few side effects, could usefully be given if the post-test probability was over 0.70, which implies LR+ being over 20 (as in the first example, these values are subjective probabilities obtained by interviewing paediatricians). The grey zone constructed with such values of LR results in (Figure 2, panel B) the interval 22.0–28.2 pg i.e. more than one-third of the range of observed values. Note that the ‘optimal cut-off value’ suggested by the authors, 26 pg (based on receiver operating characteristic [ROC] curve analysis), is clearly inside the grey zone. Also note that the grey zone would have approximately the same limits for every value of LR+ ⩾ 20 and LR− ⩽ 0.1 (Figure 2, panel B). In this example, the grey zone width appears relatively insensitive to LR requisites.
The grey zone and diagnostic or screening discrimination
Clinicians need conclusive tests that allow the diagnostic process to be terminated as quickly but as confidently as possible.12 For quantitative tests, this means that cutpoints associated with very high positive LR+ and very low LR− should be identified to rule in or to rule out a diagnostic hypothesis.
Conclusive tests are also required for screening.13 There should be as few false negative cases as possible, and false positives are also unwelcome because they are often further investigated by more invasive or costly methods. The grey zone approach would allow a differentiated attitude towards results: definitely negative (no further action), most certainly positive (requiring verification) and ‘grey’ (requiring another test or a follow-up).
For a given clinical or screening context, and a range of estimated pre-test probability values, our method can therefore be used with a candidate quantitative test to construct a three-zone partition including the grey zone according to LR requisites. Similarly, it could help evaluation of the discriminatory performance of a test in various clinical or screening contexts with different LR requisites, and help to choose between several tests or thresholds in a given context.
Identifying the proportion of results that will fall within the grey zone will also help assess the usefulness of a test in practice (see below).
Entering the grey zone and analysing its determinants
The limits of the grey zone are determined by an analysis of the context and its requirements in terms of LR (according to the usual range of pre-test probabilities and targeted post-test probabilities). However, the mathematical management of LR is relatively complex, and dialogue with non-specialists about LR can be difficult.14 The presentation and discussion of the grey zone concept therefore needs to be as simple as possible. Fortunately, once the limits of the grey zone (gup and glow) have been determined by an analysis of LR, the reasoning and the interpretation of the grey zone may be pursued with concepts and values of sensitivity and specificity that are much more understandable by non-specialists.
Risks of misclassification and expected proportion of test values in the grey zone
The construction of a grey zone for a test therefore implies three possible responses: ‘positive’, ‘inconclusive’ or ‘grey’, and ‘negative’. A subject with the disease should ideally be classified as ‘positive’ and a subject without the disease as ‘negative’, and consequently there are four risks of misclassification (Figure 3):
the risk of a subject with the disease being classified as ‘negative’, called λ (lambda) and determined by the value of the lower limit of the grey zone; λ is 1 minus the sensitivity of the test at glow;
the risk of a subject without the disease being classified as ‘positive’, called υ (upsilon) and determined by the value of the upper limit of the grey zone; υ is 1 minus the specificity of the test at gup;
the risk of a subject with the disease being classified as ‘grey’, called υ′ (as it mainly depends on the value of the upper limit of the grey zone); υ′ is 1 minus the sensitivity of the test at glow minus λ;
the risk of a subject without the disease being classified as ‘grey’, called λ′ (as it mainly depends on the value of the lower limit of the grey zone); λ′ is 1 minus the specificity of the test at glow minus υ.
Estimating these risks is straightforward using a plot of sensitivity and specificity (Figure 4).
Proportion of results in the grey zone
The proportion of results in the grey zone is of utmost importance when considering the usefulness of a diagnostic or screening test. This proportion can be easily computed using collected (real) data. It can also be evaluated a priori. Indeed, the probability of a value being in the grey zone p(g), depends on the risks λ′ and υ′ and the probability of having the disease, p(D):
\[p(g)\ =\ {\lambda}{^\prime}[1\ {-}\ p(D)]\ +\ {\upsilon}{^\prime}p(D)\]
The tuberculin skin test
The grey zone determined above in the context of a probable diagnosis of tuberculosis (p(D) between 0.30 and 0.70) is 7.8–16.6 mm. These limits correspond (Figure 1, panel C) to υ = λ = 0.025, υ′ = 0.435 and λ′ = 0.295, and the expected proportion of values inside the grey zone would therefore be between 0.34 and 0.39.
The reticulocyte haemoglobin content test
The grey zone determined above in the context of screening in a population where p(D) is 0.10 is 22.0–28.2 pg. These limits give λ = 0.04, υ = 0.02, λ′ = 0.46 and υ′ = 0.83 and the expected proportion of values inside the grey zone would be 0.50 (Figure 2, panel C).
Width of the grey zone and its determinants
Once the requisites in terms of LR values have been determined by the analysis of the clinical or screening context, the grey zone can be constructed for the test under consideration. The width of this grey zone depends on the overlap of the distributions of test values for subjects with and without the disease, and in turn, on the difference of location and level of dispersion of these distributions.
Where normal distributions of test values can be obtained, possibly after transformation, the limits of the grey zone gup and glow can be expressed in a relatively simple analytical form (Appendix) and computed with the simplest parameters of the distributions of the test result:
\[g_{up}\ =\ \mathit{{\bar{X}}_{H}}\ +\ \mathit{z}_{{\upsilon}}\mathit{s}_{\mathit{H}}\ and\ g_{low}\ =\ \mathit{{\bar{X}}_{D}}\ {-}\ \mathit{z}_{{\lambda}}\mathit{s}_{\mathit{D}}\ =\ \mathit{{\bar{X}}_{D}}\ {-}\ \mathit{kz}_{{\lambda}}\mathit{s}_{\mathit{H}}\]
where X̄H(sH) and X̄D (sD = ksH) are the sample means (standard deviations) of the test results for subjects without and with the disease, respectively (we suppose the test gives higher values in subjects with the disease); and zυ and zλ are the upper (1 − υ)th and (1 − λ)th quantiles of the standard normal distribution.
The width of the grey zone is directly and positively dependent on the overall variability of the test results, the level of the risks υ (risk of false positive) and λ (risk of false negative); and negatively dependent on the true difference between the means of the distributions (Appendix).
The grey zone for two or several tests
second (or n + 1th) test by the post-test probability of the first (or nth test), to construct the partition. When the results of the tests are obtained simultaneously, a multidimensional graphical display of the grey zones can be constructed. We illustrate this case with two tests for screening for iron deficiency in children.
The authors of the study detailed above11 reported data for several markers of iron deficiency in children. We used these data to establish distributions in iron-deficient and healthy children, and further to construct the grey zones according to the screening context (pre- and post-tests probabilities and LR requisites as detailed above). Figure 5 shows the graphical representation of the grey zones for both CHr ([22.0–28.2 pg], see above) and mean corpuscular haemoglobin (MCH, [20.9– 28.8 pg]): the CHr grey zone is thinner than the MCH grey zone, for which the expected proportion of grey values is 0.92! Plotting the data for individuals (not available in this report) onto this Figure would have shown that the proportion of subjects ‘grey’ for both tests (in the central grey intersection zone) is smaller than that observed with each test individually.
The same approach can be used to evaluate the effect of the repetition of the same test. However, a more complete approach to repetition is presented in the next section.
The grey zone and the evaluation and minimization of the measurement error
Quantitative tests are subject to measurement imprecision due to the limitations of the observer or the method or both. Moreover, within-subject biological variation (e.g. postprandial or circadian) may be significant.15 Our method can also be used to display the reliability of the test and scrutinize the components of variance. Assessment of reliability is based on the analysis of variance and the computation of intraclass correlation coefficients (ICC).16–18 The normality of distributions and similarity of variances, which are generally obtained after a suitable transformation (usually log-transformation), allow the use of this approach.
The components of variance and the grey zone
The total variance of a test, \({\sigma}^{2}_{\mathit{TOT}}\) \({\sigma}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{w}}\) \(({\sigma}^{2}_{\mathit{TOT}}\ =\ {\sigma}^{2}_{\mathit{B}}\ +\ {\sigma}^{2}_{\mathit{w}})\)
Therefore, we can construct and delimit two sub-zones of uncertainty inside the grey zone, which reflect these two components:
(1) The first subzone, associated with \({\sigma}^{2}_{\mathit{B}}\) \(\mathit{s}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{TOT}}\) \({\sigma}^{2}_{\mathit{H}}\)
\[g_{up,DARK}\ =\ \mathit{{\bar{X}}_{H}}\ +\ \mathit{z}_{{\upsilon}}\mathit{s}_{\mathit{B}}\ and\ g_{low,DARK}\ =\ \mathit{{\bar{X}}_{D}}\ {-}\ \mathit{kz}_{{\lambda}}\mathit{s}_{\mathit{B}}.\]
(2) The second subzone associated with \({\sigma}^{2}_{w}\) \(\mathit{s}^{2}_{\mathit{B}}\)
The graphical representation of the grey zone in reliability studies
As graphical representations of the grey zone are easily understandable, it would be convenient to couple its representation with the graphical method to evaluate reliability described by Bland and Altman (Appendix).19,20
Indeed simultaneous representation of the differences between assessments and the grey zone and its sub-zones would allow the reliability of a quantitative test to be analysed, and in particular, visualization of the proportion of subjects inside the grey zone.
The tuberculin skin test
A recent study21 evaluated the reliability of two techniques of tuberculin skin test measurement. The diameter of skin induration was measured along the long axis of the forearm both by the customary palpation method (P) or by the ballpoint-pen technique (BP).
The differences between the measures recorded by the two observers for both techniques in 69 patients with non-null values are shown in Figure 6 (panels P1 and BP1). There were relationships between the differences and the means, so that log-transformations were needed. The mean (SD) of differences on the log-scale was 0.01 (0.29) for palpation, and was −0.04 (0.25) for BP giving the limits of agreement shown in Figure 5 (panels P2 and BP2). The values of the ICC were 0.84 for palpation, and 0.88 for BP.
We estimated \({\sigma}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{B}}\)
Further uses of the grey zone: to scrutinize and minimize the components of variance
As the within-subject variance may include inter-observer, intra-observer, instrumental, and possibly biological components of variance, different Bland and Altman analyses of the differences in measurement are therefore possible: difference between observers, between evaluations for a single observer, between times for a single subject, etc. These analyses can be coupled to the construction of subzones of uncertainty, reflecting each component of the variability of the test.
If we consider a test with inter- and intra-observer (or residual) components of variability (as for example, in the tuberculin skin test analysis presented above) the within-subject variance \({\sigma}^{2}_{\mathit{W}}\) \({\sigma}^{2}_{\mathit{W}}\ =\ {\sigma}^{2}_{\mathit{INTER}}\ +\ {\sigma}^{2}_{\mathit{INTRA}}\) \(\frac{\mathit{w}_{\mathit{LIGHT/INTER}}}{\mathit{w}_{\mathit{LIGHT}}}\ =\ \frac{\mathit{s}_{\mathit{INTER}}}{\mathit{s}_{\mathit{W}}}\)
The magnitude of the components of variance as displayed by the width of their associated sub-zones may help optimize strategies to limit measurement error.
(1) The mean of measures of a repeated test can be used, instead of individual values, to decrease the intra-observer component of the within-subject variability, and therefore to shrink the compressible light grey zone.
(2) Using a sole observer allows the ‘measurement component’ of variability to be decreased to the intra-observer variability. The intra-observer reliability value is considered as the ‘asymptotic value’ when there are several observers who are similarly trained and experienced with the measurement method.
The tuberculin skin test
For the ballpoint technique, a two-way random effect analysis of variance allowed \(\frac{\mathit{s}_{\mathit{INTER}}}{\mathit{s}_{\mathit{W}}}\)
The consequences for the light grey zone of using the mean of two measures was tested. The shrinkage of the grey zone was minor (Figure 7, panel 2), a result which was expected, because the intra-observer variability constituted only a small proportion of the within-subject variability. In contrast, using a single observer allowed the light grey zone to be reduced substantially, and to limit the width of the grey zone to close to the dark incompressible component (Figure 7, panel 3).
Discussion
We propose a method to construct a three-zone partition for quantitative test results. This partition intentionally includes a grey zone between positive and negative conclusions about the condition tested; this grey zone is defined according to the requirements of the clinical or screening context in terms likelihood ratios (LR). Its width depends on the difference between the means of test results for subjects with and without the disease, the variability of the test results, and the level of the misclassification risks (false positive, false negative) required by the context of use. The visual aspects of this method are useful: (1) for discrimination, as they help in the choice of the limits of the zones according to the context; and (2) to assess the components of variability of a quantitative diagnostic test. Due to its simplicity and its graphical representations, we hope it will be useful during the development of diagnostic and screening tests.
Above all, our approach allows the binary constraint of a ‘black or white’ decision to be avoided, as this is often inappropriate to clinical or screening practice. A test result falling in the grey zone is not uninformative as it could lead one to seek further evidence, thereby transforming the test result from a decisive to contributory role. Several controversies concerning suitable thresholds for quantitative tests would have probably been avoided if such an approach had been used. A good example is the recent debate concerning the change in the criteria for the diagnosis of type 2 diabetes, and the shift in the threshold from 7.8 mmol/l to 7.0 mmol/l of fasting plasma glucose.22
Our approach also provides a complementary or alternative representation to effect scores23 and especially to ROC curves for the evaluation of the discriminatory performance of a quantitative test and the choice of thresholds. The conventional ROC curves give symmetrical parts to sensitivity and specificity, and only recent refinements of the ROC curve methodology have dealt with unequal costs of misclassification; however, these refinements are complex.24
Another advantage of our method is that it gives a visual representation of both the relationship between the width of the grey zone and the range of possible values, and the proportion of observations within this zone. This can be done by coupling the grey zone construction to the Bland and Altman method to assess reliability, a method now familiar to many clinicians and biologists. A quantitative test whose grey zone width contains one-third or a half of observed values (as was the case for the two examples) is obviously of little value in practice.
In assessing reliability by this method, the light grey zone reflects the ‘measurement component’ of variability in a given design. Thus, the subzones give a simple representation of the components of variance of a measurement method. In the absence of transformation, the width of the compressible light grey zone is proportional to the within-subject standard deviation for the design. A simultaneous representation of the light grey zone and the limits of agreement provided by the Bland-Altman method exploit this proportionality.
The main difficulty in implementing the grey zone approach is determining appropriate values of LR. This involves analysis of the clinical or screening context (expressed in terms of pre-test probabilities) and requirements (expressed in terms of post-test probabilities) and may be difficult. In particular, pre-test probabilities may vary according to the epidemiological context, the care facility, information already gathered about diagnostic or risk factors, and other factors; furthermore ‘subjective probabilities’ produced by clinicians or experts may be unreliable. (Post-test probabilities requisites may also vary, albeit to a lesser extent.) The rule of thumb proposed by the Evidence-Based Medicine group i.e. to consider LR+ over 10 and LR− below 0.1 as indicating conclusive tests25 may be used as a first approximation although much higher/lower values of LR+/LR− (however seldom attained by current screening or diagnostic tests) would be required in many contexts. Another approach would be to consider the sensitivity of the LR values and the resulting limits of the grey zone associated with various scenarios or hypotheses. A two-way sensitivity analysis, varying pre- and post-test probabilities simultaneously and studying the effect on LR should be performed. The location of the resulting interval of values on the LR curves would further indicate the stability of the grey zone limits: their location in (or near) the straight vertical parts of the LR curves would be reassuring (as in our second example, see above). Sensitivity analyses would also allow the stability of the grey zone limits to be tested when empirical data concerning the test are limited and cannot provide reliable estimates of LR (i.e. when the confidence interval for LR are large) and/or do not include many cut-off points.
Another limitation of this method is its reliance on several assumptions for evaluation and minimization of the measurement error. The use of analysis of variance and ICC requires: the distributions of the test results to be normal in both healthy and diseased subjects; and the measurement error to be constant across the range of test values. Logarithmic transformations may in general allow these requirements to be satisfied, but render the computation more complex and assessment of the graphical representations less immediate. Further investigation with non-parametric ICC is needed before the grey zone can be adapted for the evaluation and minimization of the measurement error, when distributions cannot be normalized or measurement cannot be made constant across the range of test values. For a simple application to evaluation of diagnostic or screening discrimination, no assumption is necessary: the grey zone construction only requires plotting both LR+ and LR− against the values of the test. Otherwise, the methodology is non-specific and the recommendations of Reid et al.26 must be followed to avoid the various biases (spectrum bias, verification bias, review bias) affecting the evaluation of the performance of screening and diagnostic tests.
In conclusion, our method allows simple graphical representation of both the discriminatory performance and the components of variability of quantitative diagnostic and screening tests. These representations may be useful supports during the development, evaluation and publication of the performances of such tests.
Appendix
Limits and width of the grey zone
Let X and Y be the interval-scaled results of a candidate diagnostic or screening test in subjects without and with the disease, X ∼ N (μH, \({\sigma}^{2}_{\mathit{H}}\) \({\sigma}^{2}_{\mathit{D}}\)
\[g_{up}\ =\ {\mu}_{\mathit{H}}\ +\ \mathit{z}_{{\upsilon}}{\sigma}_{\mathit{H}}\ and\ g_{low}\ =\ {\mu}_{\mathit{D}}\ {-}\ \mathit{z}_{{\lambda}}{\sigma}_{\mathit{D}}\]
where zυ and zλ are the (1 − υ)th and (1 − λ)th quantiles of the standard normal distribution (Figure 4). Replacing population values of means and standard deviations by their sample estimates, we obtain:
\[g_{up}\ =\ \mathit{{\bar{X}}_{H}}\ +\ \mathit{z}_{{\upsilon}}\mathit{s}_{\mathit{H}}\ and\ g_{low}\ =\ \mathit{{\bar{X}}_{D}}\ {-}\ \mathit{z}_{{\lambda}}\mathit{s}_{\mathit{D}}\]
If we let Δ = μD − μH, and σD = kσH, the width, w, of the grey zone is:
\[\mathit{w}\ =\ g_{up}\ {-}\ g_{low}\ =\ (\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}}){\sigma}_{\mathit{H}}\ {-}\ {\Delta}\]
(1)
Components of variance and the light and dark grey zones
The variance of a test, \({\sigma}^{2}_{\mathit{TOT}}\) \({\sigma}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{W}}\) \({\sigma}^{2}_{\mathit{TOT}}\ =\ {\sigma}^{2}_{\mathit{B}}\ +\ {\sigma}^{2}_{\mathit{W}}\)
Let ρI be the one-way random effect intraclass correlation coefficient (ICC)16 \({\rho}_{\mathit{I}}\ =\ \frac{{\sigma}^{2}_{\mathit{B}}}{{\sigma}^{2}_{\mathit{TOT}}}\ =\ \frac{{\sigma}^{2}_{\mathit{B}}}{{\sigma}^{2}_{\mathit{B}}\ +\ {\sigma}^{2}_{\mathit{W}}}\) \({\sigma}_{\mathit{TOT}}\ =\ \frac{1}{\sqrt{{\rho}_{\mathit{I}}}}{\sigma}_{\mathit{B}}\)
\[\mathit{w}\ =\ \left[(\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}})\frac{1}{\sqrt{{\rho}_{\mathit{I}}}}{\sigma}_{\mathit{B}}\right]\ {-}\ {\Delta}\]
(2)
When ρI → 1 or σW → 0, w → wDARK = (zυ + kzλ)σB − Δ which is the incompressible dark grey zone, the limits of which are:
\[g_{up,DARK}\ =\ {\mu}_{\mathit{H}}\ +\ \mathit{z}_{{\upsilon}}{\sigma}_{\mathit{B}}\ and\ g_{low,DARK}\ =\ {\mu}_{\mathit{D}}\ {-}\ \mathit{kz}_{{\lambda}}{\sigma}_{\mathit{B}}\]
The width of the compressible light grey zone is therefore:
\[\mathit{w}_{\mathit{LIGHT}}\ =\ \mathit{w}\ {-}\ \mathit{w}_{\mathit{DARK}}\ =\ (\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}})\ (\frac{1}{\sqrt{{\rho}_{\mathit{I}}}}\ {-}\ 1){\sigma}_{\mathit{B}}\]
(3)
Note that since \({\sigma}_{\mathit{B}}\ =\ \frac{{\rho}_{\mathit{I}}}{\sqrt{1\ {-}\ {\rho}_{I}}}{\sigma}_{\mathit{W}}\)
\[\mathit{w}_{\mathit{LIGHT}}\ =\ (\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}})\frac{1\ {-}\sqrt{{\rho}_{\mathit{I}}}}{\sqrt{1\ {-}\ {\rho}_{\mathit{I}}}}{\sigma}_{\mathit{W}}\]
(4)
The estimations of gup,DARK, glow,DARK, wDARK and wLIGHT require prior computation of the estimated component of variance \({\sigma}^{2}_{\mathit{B}}\) \({\sigma}^{2}_{\mathit{W}}\) \({\hat{{\rho}}}_{\mathit{I}}\)
The Bland and Altman method and the grey zone
The Bland and Altman method is based on the construction of a residual-like plot of the difference between the results of two measures against their mean. The mean d̄ and standard deviation sd of differences between pairs of repeated measurements are combined to define the limits of agreement d̄ ± 2sd, which correspond to a 95% range for the difference between two repeated measurements. The method assumes that sd is constant across the range of measurements, and, in the frequent case of the measurement error being proportional to the mean, requires a log-transformation: the limits of agreement antilogged back into the natural scale give a range of proportional agreement between repeated measurements.
Since \(\mathit{s}_{\mathit{d}}\ =\ {\surd}2\ \mathit{s}_{\mathit{W}}\)
Components of variance and the inter- and intra-observer light grey zones
The within-subject variance \({\sigma}^{2}_{\mathit{W}}\) \({\sigma}^{2}_{\mathit{INTER}}\) \({\sigma}^{2}_{\mathit{INTRA}}\) \({\sigma}^{2}_{\mathit{W}}\ =\ {\sigma}^{2}_{\mathit{INTER}}\ +\ {\sigma}^{2}_{\mathit{INTRA}}\)
If we let \({\tau}\ =\ \frac{{\sigma}^{2}_{\mathit{INTER}}}{{\sigma}^{2}_{\mathit{INTER}}\ +\ {\sigma}^{2}_{\mathit{INTRA}}}\)
\[\mathit{w}_{\mathit{LIGHT}}\ =\ (\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}})\left[\frac{1\ {-}\ \sqrt{{\rho}_{\mathit{I}}}}{\sqrt{1\ {-}\ {\rho}_{\mathit{I}}}}\right]\left(\frac{1}{\sqrt{{\tau}}}\right){\sigma}_{\mathit{INTER}}\]
(5)
When σINTRA → 0 or τ → 1,
\[\mathit{w}_{\mathit{LIGHT}}\ {\rightarrow}\mathit{w}_{\mathit{LIGHT/INTER}}\ =\ (\mathit{z}_{{\upsilon}}\ +\ \mathit{kz}_{{\lambda}})\left[\frac{1\ {-}\ \sqrt{{\rho}_{\mathit{I}}}}{\sqrt{1\ {-}\ {\rho}_{\mathit{I}}}}\right]{\sigma}_{\mathit{INTER}}\]
(6)
Thus \(\frac{\mathit{w}_{\mathit{LIGHT/INTER}}}{\mathit{w}_{\mathit{LIGHT}}}\ =\ \frac{{\sigma}_{\mathit{INTER}}}{{\sigma}_{\mathit{W}}}\ =\ \sqrt{{\tau}}\)
The estimation of wLIGHT/INTER requires prior computation of both the estimated components of variance \({\sigma}^{2}_{\mathit{INTER}}\) \({\sigma}^{2}_{\mathit{INTRA}}\) \({\hat{{\sigma}}}_{\mathit{B}}\) \({\hat{{\sigma}}}_{\mathit{W}}\) \({\hat{{\rho}}}_{\mathit{I}}\)
KEY MESSAGES
Most quantitative tests do not perfectly discriminate between subjects with and without a given disease.
We propose a method to construct a three-zone partition for quantitative tests which intentionally includes a grey zone between positive and negative conclusions.
This method allows the binary constraint of a ‘black or white’ decision to be avoided, as this is often inappropriate to clinical or screening practice.
This method can be used both to display the discriminatory performance of a quantitative test in a variety of contexts and to scrutinize its components of variability.
Figure 1
Panel A: Histograms of tuberculin skin test results (non-null values) in subjects with (n = 3826) and without tuberculosis (n = 643 694) according to Rose et al.10 Panel B: Construction of the grey zone for the tuberculin test for LR+ = 44 and LR− = 0.022, using a plot of both LR− and LR+ for different values of the test. Panel C: Determination of the risks of misclassification associated with the grey zone for the tuberculin test using a plot of both sensitivity and specificity; υ = 0.025, λ = 0.025, υ′ = 0.435, λ′ = 0.295
Open in new tabDownload slide
Figure 2
Panel A: Histograms of reticulocyte haemoglobin content (CHr) results in children with (n = 43) and without iron deficiency (n = 167) drawn from data reported by Brugnara et al.11 Panel B: Construction of the grey zone for the CHr test for LR+ = 20 and LR− = 0.01, using a plot of both LR− and LR+ for different values of the test. Panel C: Determination of the risks of misclassification associated with the grey zone for the CHr test using a plot of both sensitivity and specificity λ = 0.04, υ = 0.02, λ′ = 0.46, υ′ = 0.83 (In this example, where test values are lower in diseased subjects, the risks λ and λ′ depend on the value at the upper limit of the grey zone and the risks υ and υ′ on the values at the lower limit)
Open in new tabDownload slide
Figure 3
Interpretation of the grey zone for a quantitative test. The area under the curve of probability density for subjects without the disease over gup, the upper limit of the grey zone, represents the risk υ; the area under the curve of probability density for subjects with the disease under glow, the lower limit of the grey zone, represents the risk λ. The risk of a subject with the disease being classified as ‘grey’ is υ′ and the risk of a subject without the disease being classified as ‘grey’ is λ′
Open in new tabDownload slide
Figure 4
Determination of the risks of misclassification υ, λ, υ′, λ′ associated with the grey zone for a quantitative test using a plot of both sensitivity and specificity
Open in new tabDownload slide
Figure 5
Grey zones for reticulocyte haemoglobin content (CHr) and mean corpuscular hemoglobin (MCH) results in children with (n = 43) and without iron deficiency (n = 167) drawn from data reported by Brugnara et al.11 In this example, test values are both lower in diseased subjects. Due to the strong positive correlation of the tests, two of the nine possible combinations of results, (+/−) and (−/+), are very unlikely
Open in new tabDownload slide
Figure 6
Plots of the difference between measures of two observers against the average of measures, n = 69 pairs of non-null measures by palpation (P) and the ballpoint-pen method (BP). TST means tuberculin skin test21
Upper panels (P1 and BP1) use the original scales (mm). Upper middle panels (P2 and BP2) use log-transformed values (base e). The horizontal lines indicate the mean difference and mean difference ± 2 standard deviations of differences. Lower middle panels (P3 and BP3) show the grey zones and subzones superimposed (dark grey for the dark grey zone, and light grey for the light grey zone) for υ = λ = 0.025 in log-scales.
Lower panels (P4 and BP4) show the grey zones and subzones superimposed (as above) for υ = λ = 0.025 in the original scale.
Open in new tabDownload slide
Figure 7
Dark grey zones and subzones of the light grey zones superimposed on plots of difference between measures against the average of measures, for the ballpoint-pen method. TST means tuberculin skin test. Computations were conducted on log-transformed values (not shown) and results were antilogged back to natural scales
Upper panel (1) is the same as panel BP4 of Figure 5, but shows the relative parts that are intra-observer and inter-observer components of variability of the light grey zone. Lower left panel (2) shows the influence on the grey zone of using means of two repeated measures for each observer to minimize intra-observer variability. Lower right panel (3) shows the influence on the grey zone of using a single observer to avoid inter-observer variability.
Open in new tabDownload slide
References
1
Begg CB. Advances in statistical methodology for diagnostic medicine in the 1980’s.
Stat Med
1991
;
10
:
1887
–95.
2
Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
Clin Chem
1993
;
39
:
561
–77.
3
Feinstein AR. The inadequacy of binary models for the clinical reality of three-zone diagnostic decisions.
J Clin Epidemiol
1990
;
43
:
109
–13.
4
Simel DL, Samsa GP, Matchar DB. Likelihood ratios for continuous test results, making the clinicians’ job easier or harder?
J Clin Epidemiol
1993
;
46
:
85
–93.
5
Jamart J. Chance-corrected sensitivity and specificity for three-zone diagnostic tests.
J Clin Epidemiol
1992
;
45
:
1035
–39.
6
Subcommittee of the Joint Tuberculosis Committee of the British Thoracic Society. Guidelines on the management of tuberculosis and HIV infection in the United Kingdom.
BMJ
1992
;
304
:
1231
–33.
7
Pape JW, Jean SS, Ho JL, Hafner A, Johnson WD Jr. Effect of isoniazid prophylaxis on incidence of active tuberculosis and progression of HIV infection.
Lancet
1993
;
342
:
268
–72.
8
Bass JB Jr, Farer LS, Hopewell PC et al. Treatment of tuberculosis and tuberculosis infection in adults and children.
Am J Respir Crit Care Med
1994
;
149
:
1359
–74.
9
De co*ck KM, Grant A, Porter JD. Preventive therapy for tuberculosis in HIV-infected persons: international recommendations, research, and practice.
Lancet
1995
;
345
:
833
–36.
10
Rose DN, Schechter CB, Adler JJ. Interpretation of the tuberculin skin test.
J Gen Intern Med
1995
;
10
:
635
–42.
11
Brugnara C, Zurakowski D, DiCanzio J, Boyd T, Platt O. Reticulocyte hemoglobin content to diagnose iron deficiency in children.
JAMA
1999
;
281
:
2225
–30.
12
Kassirer JP, Kopelman RI. Learning Clinical Reasoning. Baltimore: Williams & Wilkins,
1991
.
13
Morrison AS. Screening. In: Rothman KJ, Greenland S (eds). Modern Epidemiology, 2nd Edn. Philadelphia: Lippincott, Williams & Wilkins,
1998
.
14
Reid MC, Lane DA, Feinstein AR. Academic calculations versus clinical judgments: practicing physicians’ use of quantitative measures of test accuracy.
Am J Med
1998
;
104
:
374
–80.
15
Healy MJ. Measuring measuring errors.
Stat Med
1989
;
8
:
893
–906.
16
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability.
Psychol Bull
1979
;
86
:
420
–28.
17
Müller R, Büttner P. A critical discussion of intraclass correlation coefficients.
Stat Med
1994
;
13
:
2465
–76.
18
Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement.
Comput Biol Med
1990
;
20
:
337
–40.
19
Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies.
The Statistician
1983
;
32
:
307
–17.
20
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement.
Lancet
1986
;
i
:
307
–10.
21
Pouchot J, Grasland A, Collet C, Coste J, Esdaile JM, Vinceneux P. The reliability of tuberculin skin test measurement.
Ann Intern Med
1997
;
126
:
210
–14.
22
Davidson MB, Schriger DL, Peters AL, Lorber B. Relationship between fasting plasma glucose and glycosylated hemoglobin: potential for false-positive diagnoses of type 2 diabetes using new diagnostic criteria.
JAMA
1999
;
281
:
1203
–10.
23
Blakeley DD, Oddone EZ, Hasselblad V, Simel DL, Matchar DB. Noninvasive carotid artery testing. A meta-analytic review.
Ann Intern Med
1995
;
122
:
360
–67.
24
Hilden J, Glasziou P. Regret graphs, diagnostic uncertainty and Youden’s Index.
Stat Med
1996
;
15
:
969
–86.
25
Jaeschke R, Guyatt G, Sackett DL and the Evidence-Based Medicine Working Group. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. Are the results of the study valid?
JAMA
1994
;
271
:
389
–91.
26
Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. Getting better but still not good.
JAMA
1995
;
274
:
645
–51.
© International Epidemiological Association 2003