When I was at medical school, I was taught that the four most important words you can say to a patient are ‘You don’t have cancer’. I’m sure you would agree that this statement would be of utmost importance to get right. In order to make this statement, the observer must be as certain as is feasibly possible that the statement is indeed correct at that moment in time (that cancer has been proven to be absent). In statistical terminology, this statement qualifies what is known as a True Negative. Unfortunately, medicine is not black and white, and doctors can get it wrong. This doesn’t happen too often (well, more often than we would like, but not often enough for everyone to just give up), but when it does - the results can be catastrophic for patients.

When it comes to AI systems making diagnoses or reporting findings, it is important to ensure that novel technologies are also accurate, or at least as accurate as humans. This is done by a process of clinical validation, the aim of which is to assess the ‘accuracy’ of a system. We certainly don’t want AI to start incorrectly telling people that they do or don’t have cancer more than humans! In this article, I’ll go over the most relevant metrics for reporting accuracy for AI developers in a clinical setting.

What is Accuracy anyway?

‘Accuracy’ is not a defined scientific statistical term, but it essentially encompasses the notion of how well an Index Test (the system being tested) performs, often compared to another pre-existing system, which in the context of AI is usually human doctors. ‘Accuracy’ can be assessed using surrogate measurements such as reliability, sensitivity, precision and error-rate, to name but a few.

‘Accuracy’ of an Index Test is ideally measured by comparison to a Gold Standard error-free test. However, in the field of medical diagnosis (or other areas such as radiology or pathology findings), there is often no error-free Gold Standard. Therefore, the best available method for establishing a ‘ground truth’, known as a Clinical Reference Standard (CRS), should be used. The CRS could be defined as ‘the consensus from a group of qualified doctors’, or more simply ‘the expert opinion of one qualified doctor’. The whole point of the CRS is to be as close as possible to a perfect system. The definition of the CRS is important when approaching clinical validation, as it will affect the methodology, analysis and results of any statistical tests. Too often I see published literature on the accuracy of AI systems compared to only one or two doctors, and this may not be enough. However, it is probably unreasonable to take up hundreds of doctors’ time to, so do your best!

Several statistical methodologies exist to measure ‘accuracy’. However, it must be pointed out that most statistical tests are designed to assess ‘accuracy’ of a test for one target condition in a binary classification (a given disease, condition or physiological measurement), not a large number of variables.

Alternatively, inter-test comparisons (kappa agreement, concordance, reliability) may be used to assess for ‘accuracy’, especially if there is a lack of a specifically defined target condition. However, these tests are only a measure of agreement between AI and humans, and do not reflect ‘accuracy’. e.g. Poor agreement does not tell you which of the two tests is better, only that they disagree. Therefore, I’ll be ignoring these tests in this article.

Finally, any methodological assessment of ‘diagnostic accuracy’ should, as far as possible, remove bias, be applicable to the context, be transparent, and appropriately powered. For this reason, the international STARD Criteria for reporting ‘diagnostic accuracy’ were produced, a peer-reviewed framework that enables researchers to adequately document and rationalise their studies and findings.

Measuring ‘Accuracy’ of Binary Classifications

In environments where results are dichotomous (binary, either present or absent), without indeterminate results, statistical binary classifiers can be used.

By charting the Index Test as compared to a Gold Standard, a simple 2x2 tabulation known as a Confusion Matrix or Contingency Table can be created.

Standard confusion matrix

Calculations based on true and false positive/negatives can provide surrogate measures of ‘accuracy’.

In general, the more ‘accurate’ a system, the greater number of True Positives and True Negatives occur, and the fewer False Positives and False Negatives.

Two Worlds Collide

There is considerable overlap between medical & data science metrics when reporting diagnostic ‘accuracy’. In the clinical world, the main ratios used are the true column ratios — True Positive Rate and True Negative Rate (Sensitivity and Specificity). In data science, the main ratios are the true positive ratios — Positive Predictive Value (PPV) and True Positive Rate (TPR) known respectively as Precision and Recall.

Traditionally, AI systems in non-medical sectors are rated on Precision and Recall only. This is because the True Negatives may not matter when applying a model to a non-clinical problem (e.g. a system that is designed for document or information retrieval).

Precision

Precision (green cells) is a measurement of how relevant a positive result is, and is also known as the Positive Predictive Value (PPV).

It is calculated as follows:

Precision = TP / (TP + FP)

Precision is a useful measurement of ‘accuracy’ when the Gold Standard is entirely error-free. However, caution must be taken when there is possible error in the Gold Standard. This is because the Index Test may actually be more ‘accurate’ (e.g. the AI is better than a human doctor), but because it is only being compared to the CRS, any True Positives produced by the Index Test may erroneously be classed as False Positive, artificially decreasing the reported Precision. (e.g the consensus opinion of doctors is that there is a cancer, when there isn’t. The AI system correctly says there isn’t a cancer. In this case, the AI system is correct, but because we are only comparing the result to the incorrect consensus opinion of doctors, the Precision of the AI system is affected negatively).

Recall / Sensitivity

Recall (red cells) is a measurement of the proportion of correct positive results, and is also known as the True Positive Rate, or Sensitivity.

It is calculated as follows:

Recall = TP / (TP + FN)

Recall / Sensitivity can only be truly useful when the Gold Standard is error-free, as it assumes that all positive results from the Gold Standard are indeed positive. If a human is incorrect and a diagnosis is actually a True Negative, then reported recall will be artificially decreased.

It is worth noting that neither Precision nor Recall / Sensitivity take into account the True Negatives, and therefore do not provide a complete picture of ‘accuracy’. For this reason, the traditional data science metrics of Precision and Recall are not sufficient in a clinical setting for assessing accuracy. Remember — ‘You don’t have cancer’ is a True Negative, and neither Precision nor Recall tell us anything about whether an AI system can make that statement.

Specificity

Specificity (yellow cells) is a measurement of the proportion of correct negative results, and is also known as the True Negative Rate.

It is calculated as follows:

Specificity = TN / (TN + FP)

True Negative Rate is crucial in diagnostic tests, as it is a measure of how often an Index Test correctly negates a diagnosis. This is important clinically, as ruling out a diagnosis has a large impact on the level of triage / further investigations / treatment required, not to mention the emotional impact a False Negative can have on a patient.

However, the True Negative rate can only be measured if the Gold Standard produces true negatives, which doctors tend not to do when making clinical diagnostic decisions. Imagine a doctor’s differential diagnosis list had to include all possible negative diagnoses — it would be almost impossible to do so! In practice, it can be assumed that if a doctor does not include a diagnosis in a differential that all other possible diagnoses are automatically excluded.

Negative Predictive Value

Negative Predictive Value (NPV, blue cells) is a measurement of how relevanta negative result is.

It is calculated as follows:

NPV = N / (FN + TN)

As for the True Negative Rate, it can only be calculated if the Gold Standard produces True Negatives.

An AI system with a high NPV and Specificity and Sensitivity would be very likely to approved for clinical use.

Combining Binary Classifications

ROC Curve

Plotting 1-Specificity and Sensitivity pairs for each diagnostic cut-off gives a Receiver Operating Characteristic

Curve (ROC Curve).

The higher the curve towards the upper left corner, the more ‘accurate’ the diagnostic test is.

The Area Under the Curve (AUC) gives a single numeric indicator of the discriminative power of a diagnostic test.

AUC is generally more useful when comparing two different diagnostic tests, rather than comparing an Index Test against a Gold Standard, as it does not give information on how good a test is at ruling-in or excluding a diagnosis or finding. It also requires a statistical significance calculation (p value) to be performed in order to place the result into context. (I’m not even going to touch on the arguments for and against p values!).

If two ROC curves do not intersect, then one method dominates the other, as it outperforms the other in every sense. If two curves intersect, then that indicates that there is a balance point at which one system is better than the other for a certain task, and above which the other system is better. For instance, AI systems may be far more specific than humans, but humans may be more sensitive. This might manifest as two curves intersecting at the point at which both metrics are performed equally well by both systems.

It is interesting to note than an AI system, when compared to a perfect Gold Standard, can never have a better AUC. This is because it is impossible to improve on the perfect Gold Standard. The best you could hope for is equivalency. Luckily for AI developers, but not for patients, humans are error prone!

F1 Score

The weighted average of Precision and Recall (purple cells) is known as the F1 score, and is a commonly used measure of classification systems in machine learning.

It is calculated as follows:

F1 score = 2x ((Precision x Recall)/(Precision + Recall))

Importantly however, this score does not take into account True Negatives (TN). The negative rate of AI systems would be important to ascertain (the diagnostic ‘miss’ rate), and hence specificity and/or NPV should also be attempted to be reported.

Likelihood and Diagnostic Odds Ratios

In order to encompass both positive and negative rates, a combined single indicator incorporating the probability of detection and the probability of false alarm is required. False Positives are ‘false alarms’, and False Negatives are ‘false re-assurances’.

The Positive Likelihood Ratio (LR+ = TPR / FPR) is the probability of detection, and a high LR+ value is a good indicator for including a diagnosis.

The Negative Likelihood Ratio (LR- = FNR / TNR) is the probability of false alarm, and a low LR- value is a good indicator for ruling out a diagnosis.

These two ratios (incorporating all 4 pink cells) can be combined to form the Diagnostic Odds ratio (DOR) which is a measurement of the diagnostic accuracy of a test.

It is calculated as follows:

DOR = LR+ / LR- = (TPR/FPR) / (FNR/TNR)

The advantages of the DOR are that it does not depend on underlying prevalence, is a single indicator that doesn’t require statistical significance, has a simple calculation for 95% Confidence Intervals, and is medically recognised.

Effectiveness

The effectiveness of a test is probably the closest to the lay-definition of ‘accuracy’, in that it is a measure of the proportion of correct outputs by the Index Test from all cases.

It is calculated as follows:

Effectiveness = (TP + TN) / (TP = TN + FP + FN)

However, effectiveness is considerably affected by underlying prevalence, making it an inferior statistical test. For example, given a set sensitivity and specificity, a disease with high prevalence in a population is more likely to be diagnosed by the Index Test than one with a low prevalence. For that reason, effectiveness is never reported as a single indicator, and is always reported alongside other measures such as PPV and NPV. In the case of testing an AI system, extremely high prevalence may be an issue when calculating effectiveness, as every single case tested has a diagnosis.

Summary

Studies not meeting strict methodological standards usually over- or under-estimate the indicators of test performance, as they limit the applicability of the results of the study. There are several statistical methodologies available to AI developers, each with their own advantages and disadvantages. When validating AI systems in a clinical setting it is important to consider whether True Negatives will be reported, as this will affect which statistical tests can be applied.

Currently, doctor’s differential diagnoses, reports or findings do not include all possible negatives (this would be too exhaustive), so an assumption will have to be made that an excluded diagnosis/finding is a negative. However, this could give some erroneous results by incorrectly counting diagnoses/findings were excluded due to error/fatigue/knowledge bias as negatives. Conversely, without True Negatives, only Precision, Recall and F1 score can be deduced, which do not provide the complete picture in terms of medically relevant negative diagnoses/findings and omissions.

The final message is that there is no 100% correct way of reporting how accurate an AI system is in clinical settings, as each one has its own quirks! My advice would be to report ROC curves, AUC and DOR — that way you are covering all aspects of positive and negative rates using statistical language that medics understand. Thankfully ROC curves are gaining popularity in machine learning circles, so for those of you applying your skills to medical problems, you’ll be able to make your results understood to the doctors you encounter.

References:

http://www.stard-statement.org/

http://bmjopen.bmj.com/content/6/11/e012799

http://www.ifcc.org/ifccfiles/docs/190404200805.pdf

http://gim.unmc.edu/dxtests/roc3.htm