Sensitivity and specificity

Sensitivity and specificity are statistical measures of the sensitivity of a test — or how well it works in reality. These measures are important because the effectiveness of a test may actually be very counter-intuitive. So while common sense says a positive HIV blood test or a shop-lifting alarm going off is sure to be right, the reality is that it may actually be wrong most of the time.

Definitions

 * Sensitivity is essentially how good a test is at finding something if it's there. It is a measure of how often the test will correctly identify a positive among all positive by the gold standard test. For example, a blood test for a virus may have sensitivity as high as 99% or more &mdash; meaning that for every 100 infected people testing, 99 or more of them will be correctly identified. This is a good figure to take note of, but doesn't necessarily reflect a test's true effectiveness, as will become apparent.


 * Specificity is a measure of how accurate a test is against false positives. A sniffer dog looking for drugs would have a low specificity if it is often led astray by things that aren't drugs &mdash; cosmetics or food, for example. Specificity can be considered as the percentage of times a test will correctly identify a negative result. Again, this can be 99% or more for good tests, although a particularly unruly and easily distracted sniffer dog would be much, much lower.

Both of these figures and the rates of false positives and false negatives are established against a "gold standard". This is a test that is unambiguous or so close to unambiguous that it makes no difference. In the sniffer dog case, this could be a through examination by a trained human police officer; for medical diagnosis, it can be further and more intensive tests and observations such as biopsies. These "gold standard" tests may be too costly or time-consuming to carry out all the time — hence why you do other tests that may not be 100% effective.

These stats may also be subject to what is known as a "window period" — that is, a time when the ability for the test to be successful is either significantly better or much more diminished. The HIV anti-body blood test is a common example, with a window period of up to three months where it will not work, while nucleic acid testing has a window period of around 12 days.

Sensitivity and specificity can be easily confused with positive predictive value (PPV) and negative predictive value (NPV), which indicate the proportion of positive tests that are true positives and the proportion of negative tests that are true negatives, respectively. Sensitivity and specificity are useful in evaluating the value of a test applied to many, while PPV and NPV are useful to assess the merit of a specific test result.

Counter-intuitive false positives
High sensitivity and specificity may be good, but as anyone who has been buzzed at airport security without carrying a bomb will tell you, they produce a disproportionate number of false positives — their effectiveness is counter-intuitively low. This is because the natural rate of positives (according to a gold standard described above) has a massive effect on the percentage of false positives and a test's apparent effectiveness. In short, if this natural rate is high, the test appears better, but if the rate is low, the number of false positives may well exceed the number of real positives — making the test almost useless. A sensitivity of 100% can be achieved by assuming everyone has a bomb, while a specificity of 100% follows from assuming no one has a bomb.

Example
The concept is best illustrated and described by an example. Imagine a book shop installing an anti-theft device — a scanner at the entrance that looks for magnetic tags on stolen goods. Out of all the potential thieves that would walk through the scanner with stolen goods, the scanner will buzz 99.9% of the time (this is the sensitivity). Only 1 in a 1000 thieves will ever get away with it, which is quite good. Every so often, it will buzz someone who is innocent; they didn't deactivate a tag, their phone set off the scanner, it just went crazy, or for whatever reason. Assume a specificity that has a similar level of 99.9% - for every 1000 innocent people walking through the scanner, it will buzz once (this is specificity).

If the shoplifting rate is reasonably low, it can be assumed that sensitivity is pretty much 100% - so every shop-lifter will be caught. But immediately, it becomes apparent that not everyone is out to steal goods, and this must be factored in to the evaluation of the scanner's performance! What if the shop is in a high crime area? 1 in every 100 visitors to the store tries to steal something; that's 10 in 1000. As the sensitivity is above 99%, all 10 thieves will be caught, but out of those innocent thousand or so, one will trigger a false positive; 10 sticky fingered bandits and 1 innocent bystander will set off the alarm. It can be seen that this 99.9% sensitive test is actually only, really, ~90% accurate when it comes to the results it produces.

If the background rate was lower, the test becomes more ineffective - if 1 in 1000 customers are actually shop-lifters, then the rate of true positives is equal to the rate of false positives. The alarm will still buzz for every 1 in 1000 people innocently and with near certainty grab the thief. In this case, only 50% of the alarms are real. If the rate of shop-lifting is even lower, if it is less than the false positive rate, it can easily be the case that the vast majority of alarms are, in fact, false positives. This is usually the case in airport security where, despite scaremongering, terrorism is remarkably rare — and the vast majority of alarms will be due to keys, belt buckles, and steel capped boots and very few, if any, will be due to weapons. This doesn't negate the deterrent effect of such devices, but it does mean their practical use "as intended" is significantly lower (see Security theatre).

Dangers
The problem is that this is very, very counter-intuitive. Not everyone will be able to realise and calculate the real effectiveness of a test; whether it be a medical diagnosis or a forensic instrument. It is easy to determine the sensitivity of a test &mdash; they're often evaluated in controlled conditions that make this easier. Often products and procedures will be marketed on just their sensitivity alone, so their rate of false positives may be high and no one is aware of it. This problem has also made its way into some political or crime policy, and as a result more innocent than guilty people can be adversely affected.

Often, tests are only improved by working on their sensitivity alone. But as was shown in the example above, when the background rate of positives is low, the sensitivity is almost meaningless. Some may prefer the increased number of false positives, however, as they can prompt people to go for additional or more thorough testing. This is usually the case with medical diagnosis.

It is also important to remember that for every testing application, there is an optimal point (which varies with the specifics of the application) beyond which sensitivity and specificity become a tradeoff, and neither can be increased without sacrificing the other. Very few, if any, tests applied to subjects as complex as human beings can ever exclude both false positives and false negatives to a high degree of completeness.

This can lead to strong value conflicts. Consider a test which identifies sexual offenders for registry. Few would contest the argument that a low rate of false negatives is desirable; equally few would contest the argument that a low rate of false positives is desirable. Once such a test passes its optimal point, though, registering more genuine sex offenders entails branding more innocent people with the label.

Examples of where this can apply are also numerous. Airport security, medical testing, computer security, spam filtering, psychological evaluation, parole evaluation, anti-bacterial products… all these areas of life are at the mercy of the problems posed by not correctly understanding sensitivity and specificity and how they relate to real world results.

Relationship to statistics
In statistical terminology, specificity and sensitivity are related to the main types of error in hypothesis testing. The sensitivity of a test controls the rate of false negatives, which are known as Type II errors. The specificity, on the other hand, controls the rate of false positives, known as Type I errors.

Log odds
The calculations can be simplified, to a degree, by expressing the odds as "log odds", that is, a formulation of probability that works logarithmically &mdash; like the decibel scale. Repeating a test multiple times allows you to increase your confidence interval by a certain amount (the likelihood ratio), and expressed logarithmically &mdash; this is akin to simply adding the likelihood ratio each time the test is taken rather than multiplying and having to repeat the calculations multiple times.

In the example given by Brian Lee and Jacob Sanders, a hypothetical test correctly diagnoses (sensitivity) 99% of the time and gives a false positive (specificity) 3% of the time, with a background incidence of 1 in 10,000 &mdash; translated in real numbers as 1 correctly identified individual and 300 false positives for a single test. The application of multiple tests is pretty much Bayes' theorem being applied repeatedly, known as an "update", as the probability is being updated based on new evidence or a new iteration of a test. This simple likelihood ratio works out as 33 (99/3), and 10 × log10(33) = 15.19. Each iteration of the test adds 15.19 to the log odds of a person being correctly diagnosed.


 * -40.00 dB = .01% (1 in 10,000)
 * -40.00 dB + 15.19 dB = -24.81 dB,
 * -24.81 dB + 15.19 dB = -9.62 dB.
 * -9.62 dB + 15.19 dB = 5.57 dB = 78.3%

So we need three iterations of the test to reach 78.3% certainty.

Visualisation


Assume you want to locate all the blue dots in the above diagram. It makes sense to have a test that correctly identifies them as blue &mdash; imagine they represent people infected with a disease for example. The ones identified as such are grouped within the smaller, darker red circle. But the background incidence and the rate of false positives affects the true effectiveness of the test. On both diagrams, roughly 1 in 10 white spots are incorrectly grouped in with the blue &mdash; the false positives. On the left-hand side, the incidence of blue spots is low, so only 50% of those that pass the test and are grouped in the dark red circle are actually blue. On the right-hand diagram, even though the specificity and selectivity hasn't changed (1 in 10 white spots are incorrectly identified and almost all blue ones are) the relative incidence of false positives is much lower, and 85% of those passing the test have been correctly identified.