In recent years, we have seen a sharp increase in publications concerning machine learning and deep learning in radiology. Consequently, some journals report that around a quarter of all their publications in 2018 related to these topics, one way or another. Of course, with so much research around, it is important to be able to assess concerns of scientific quality. To help authors, reviewers and readers and guide them on how to evaluate AI-related research, Radiology’s Editorial Board published a brief guide that could serve as an interim until more formal guidelines on AI research are published.
So, what are the suggested items to look for in publications regarding machine learning and artificial intelligence?
1. All image sets (i.e. training, validation and testing set) should be clearly defined.
The image sets for training, internal validation and independent testing should be carefully selected and without overlap. Inclusion and exclusion criteria should be clearly mentioned.
2. An external dataset should be used for final statistical reporting.
Validation of the AI model’s results on an external and independent dataset is useful to exclude overfitting and document the model’s generalizability.
3. Multivendor images should be used for all datasets.
Images from different vendors may look different even to the radiologist’s eye. To prevent an AI-model from being vendor-specific, images from a variety of vendors should be included in all steps.
4. Size of training, validation and testing sets should be justified.
Estimating the number of images needed can be difficult. However, if a clear estimate cannot be given, at least some evaluation of model performance, depending on training images, should be done.
5. A widely accepted reference standard is mandatory.
An established gold standard should be used as labels to the images. The radiological report as produced in clinical routine may not always be optimal (e.g. an enlarged lymph node in CT may have been reported as malignant, but only histopathological analysis can reliably determine the cause of enlargement).
6. Preparation of image data should be described.
Was manual interaction needed to prepare the images for the AI algorithm (e.g. definition of bounding boxes)? Or, did the algorithm simply consume all images of a specific DICOM-series? These considerations are important to estimate the usability of the algorithm in clinical routine.
7. The performance of the AI system should be compared to that of a radiology expert.
To really be able to determine an algorithm’s potential impact on clinical routine, the algorithm’s performance should be compared to that of a radiology expert’s. Outperforming students, for example, might be nice, but if inferior to an expert in the field, it is unlikely the algorithm will have any practical impact.
8. The AI algorithm’s performance and decision making should be clear.
To alleviate the fear that an AI algorithm may be a black box, so-called saliency maps may be used to indicate which parts of the image were deemed relevant. But, more importantly, clinically useful performance metrics such as sensitivity, specificity, PPV and NPV should be reported instead of a single AUC value.
9. In order to verify claims of performance, the AI algorithm should be accessible in some form.
The AI algorithm should be publicly available in some form so that independent researchers could potentially validate any performance claims made. This does not necessarily mean that it should be freely available, but researchers should be given some form of access to the algorithm so that the results can be verified.
Until more formalized reporting guidelines, such as CONSORT-AI and SPIRIT-AI , are published, these suggestions published by the RSNA could help authors, reviewers and readers to evaluate the scientific quality of AI-related research.
- Are all three image sets (training, validation and test sets) defined?
- Is an external test set used for final statistical reporting?
- Have multivendor images been used to evaluate the AI algorithm?
- Are the sizes of the training, validation and test sets justified?
- Was the AI algorithm trained using a standard of reference that is widely accepted in our field?
- Was the preparation of images for the AI algorithm adequately described?
- Were the results of the AI algorithm compared with those of a radiology expert’s and/or pathology?
- Was the manner in which the AI algorithm makes decisions demonstrated?
- Is the AI algorithm publicly available?
 Xiaoxuan Liu, Livia Faes, Melanie J Calvert, Alastair K Denniston on behalf of the CONSORT/SPIRIT-AI Extension Group (2019). The Lancet, DOI: 10.1016/S0140-6736(19)31819-7