A recent paper published in Nature Machine Intelligence has found several common yet serious flaws with machine learning (ML) made for COVID-19 diagnosis and prognosis.
At the start of the worldwide pandemic, companies such as DarwinAI, as well as major U.S. chipmaker Nvidia and groups such as the American College of Radiology kickstarted efforts to produce technology that can discover the COVID-19 virus from simple CT scans, X-rays, and other forms of medical imaging.
The technology is meant to help health workers on the frontlines to differentiate between COVID and pneumonia, while presenting a clearer patient diagnosis; some of these technologies were said to predict if a person will die or needed a ventilator based on a CT scan.
However, a consortium of artificial intelligence (AI) researchers along with an assorted healthcare professional from the universities of Cambridge and Manchester specialized in infectious diseases, radiology, and ontology have uncovered a number of shortfalls within the tech; claiming that major changes are needed before this form of machine learning can be used in a clinical setting.
Authors of the scientific paper examined more than 2,200 papers, which was later narrowed down to 320, given the focus group full text review for quality, following the removal of duplicates and irrelevant titles.
At the end, only 62 papers were considered fit to be part of what the authors called a systematic review of published research and preprints shared on open research paper repositories such as arXiv, bioRxiv, and medRxiv.
During the analysis, roughly half of the 62 papers made no attempt to perform external validation of training data, did not assess model sensitivity or robustness, and did not report the demographics of people represented in training data.
According to the scientific paper, “Frankenstein” datasets – which are made with duplicate images assembled from other datasets and redistributed under a new name – were a cause for concern since only one in five COVID-19 diagnosis or prognosis models shared their code so others can reproduce results claimed in literature.
“This repackaging of datasets, although pragmatic, inevitably leads to problems with algorithms being trained and tested on identical or overlapping datasets while believing them to be from distinct sources,” the authors noted.
Authors later highlighted that in their current reported form, none of the machine learning models included in this review are likely candidates for clinical translation for the diagnosis/prognosis of COVID-19.
“Despite the huge efforts of researchers to develop machine learning models for COVID-19 diagnosis and prognosis, we found methodological flaws and many biases throughout the literature, leading to highly optimistic reported performance,” the research paper read.
Researchers also found that with ML models developed via medical imaging data was virtually no assessment for bias and generally being trained with sufficient images. It is important to note that nearly every paper reviewed was revealed to be at high or uncertain risk of bias, with only six considered at low risk of bias.
In parallel, publicly available datasets are known to suffer from low quality image formats, while not being large enough to train reliable AI models. According to the paper’s methodology, researchers used the checklist for artificial intelligence in medical imaging (CLAIM) and radiomics quality score (RQS) to help assess the datasets and models.
“The urgency of the pandemic led to many studies using datasets that contain obvious biases or are not representative of the target population. Before evaluating a model, it is crucial that authors report the demographic statistics for their datasets, including age and sex distributions,” the paper said.
The authors stressed that higher-quality datasets, manuscripts with sufficient documentation to be reproducible and external validation are required to increase the likelihood of models being taken forward and integrated into future clinical trials to establish independent technical and clinical validation as well as cost-effectiveness.