Identifying the Best Image Classification Algorithm for COVID-19 Diagnosis with a Small, Imbalanced Chest X-Ray Dataset


With more than 82 million COVID-19 cases worldwide [1], automated chest radiograph interpretation could provide substantial benefit for efficient and accurate diagnosis of COVID-19 patients. In this project, families of deep learning neural networks are trained on publicly available chest X-ray datasets to identify the best image classification algorithm for automating the diagnosis of respiratory illnesses. The objective of this Project is to train families of deep learning neural networks on publicly available chest X-ray datasets to automate the diagnosis of respiratory illnesses. Lack of large chest X-ray datasets for novel respiratory diseases is a challenge for this Project, which is common when applying deep learning to medical diagnosis. Through extensive experimentation with different algorithms, this Project contributes to the knowledge base that sheds light on the power of pre-training for transfer learning in healthcare applications, as well as the importance of addressing data imbalance via different sampling methods. As in many real-world medical imaging applications, publicly available chest X-ray datasets are not abundant, and ground truth data of COVID-19 diagnosis is especially hard to come by [2]. To overcome this issue, neural networks are pre trained using the ChestX-ray14 database generated by NIH. Transfer learning from ImageNet has become a widespread method for deep learning applications to medical imaging. However, pretraining on a domain-specific dataset significantly improves performance during transfer learning. Next, to address the imbalance within training data, I implemented two alternative data sampling methods. The fixed-fraction-per-batch sampling is the only method which ensures that within every batch, COVID-19 chest X-ray images are present. Third, three families of neural networks that represent state-of-the-art image classification architecture are analyzed: DenseNet, EfficientNet and ResNet. Precision and Recall were measured to evaluate performance of each algorithm, defined by the combination of use of domain-relevant dataset for pretraining; method of data sampling; and choice of neural network architecture. Based on extensive experimentation, the algorithm with pre training on the ChestX-ray14 database, fixed-fraction-per-batch sampling method, and DenseNet169 model has been identified to have the highest Recall and Precision for COVID-19 chest X-ray images, at 87% and 84% respectively.


On February 11, 2020, CDC reported that COVID-19 is a respiratory disease caused by SARS-CoV-2, a new coronavirus discovered in 2019 [3]. Doctors and infectious disease specialists soon recognized that early diagnosis is crucial for triage and treatment in order to slow down the spread of the disease. In the face of this highly contagious disease, health organizations turned to nasopharyngeal swabs, a procedure for the collection of specimens from the surface of the respiratory mucus, for diagnosis. However, their accuracy, up to 73% based on research [2], is influenced by the severity of the disease and the time from symptoms onset. CT is a sensitive tool for early detection of peripheral ground glass opacities; however, there are infection control issues related to patient transport to CT suites and the inefficiencies introduced in CT room decontamination. On the other hand, through conventional chest radiography, patterns indicating the presence of COVID-19 in a patient such as ground glass densities and peripheral air space opacities can be identified. An abundance of recent studies has illustrated the growing interest and usage of chest X-ray (CXR) imaging around the globe; some studies predict a greater reliance on portable chest X-rays and the high value of portable chest X-rays for critically ill patients. In comparison to CT scanners, chest X-ray imaging is widely available around the world due to their relatively low cost and comparatively lower contamination rates [2]. In addition, the existence of portable chest X-ray units allows for imaging to occur in an isolated room, thus reducing the risk of infection.

Prior Related Works

A number of approaches towards this both diagnostic and data scarcity challenge have been advanced by universities and related organizations. CheXNet, an algorithm developed by Stanford’s Machine Learning Group, claims to detect pneumonia from chest X-rays at a level exceeding practicing radiologists. CheXNet is trained on ChestX-ray14 which includes over 100,000 frontal X-ray images of 14 diseases and achieves a F1 metric of 0.435. This 121-layer convolutional neural network’s score outperforms the radiologist average of 0.387

Tartaglione et al. performed extensive experimental evaluation of different combinations for pretraining and transfer learning of standard CNN models [2]. The authors discuss the effect of different pre-training datasets, ChestXRay and RSNA, and training datasets, CORDA and COVID-ChestXRay, on performance which is measured using balanced accuracy and diagnostic odds ratio. A research team led by Pingkun Yan of Rensselaer Polytechnic Institute developed an algorithm to successfully predict whether or not a COVID-19 patient would need ICU intervention. Yan et al. combined chest computed tomography (CT) images that assess the severity of a patient’s lung infection with non-imaging data, such as demographic information, vital signs, and laboratory blood test results [4]. Yan agrees that the impact of such works could go well beyond COVID diseases; the algorithm could better predict a patient of another lung disease’s mortality risk to manage their condition. COVID-Net, an open source initiative, is a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images. Wong et al. aimed to assess the feasibility of computer-aided severity scoring of SARS-CoV-2 using deep learning and used transfer learning, a technique to improve the performance of the deep neural networks, to initialize the deep neural network parameters in this study using the parameters from deep neural networks trained on COVIDx, a dataset introduced in the Wang study.


The challenges of using chest X-rays to diagnose COVID-19 comes in two-fold. As in most real-world applications, publicly available chest X-ray image datasets are not abundant, and ground truth data of COVID-19 diagnosis is especially hard to come by. Most standard CNN’s, Convolutional Neural Networks, are trained on ImageNet which includes images such as flower species, dog breeds, furniture, and other household objects. Previously, medical image classification with convolutional neural networks has been applied to pneumonia as well as a variety of other lung diseases with large datasets of CXRs. In transfer learning, we use what well-trained, well-constructed networks have learned over large sets, and apply them to boost the performance of a detector on a smaller data set [1]. In this work, the main contribution is an extensive experimentation of different combinations of the usage of pretraining datasets, data sampling methods, and neural network architecture. Furthermore, other hyperparameters such as learning rate, image resolution, and batch size are implemented to analyze possible methods of improvement in performance.

Specifically, each algorithm is determined by a combination of the following three variables. First, due to the naturally occurring shortage of COVID-19 chest X-ray images, would pretraining on a domain-relevant dataset be beneficial, as compared to using ImageNet pretrained weights? Second, which data sampling method to use in order to address the issue of an imbalanced training dataset, which might lead to underperformance on COVID-19 prediction? Third, among the neural network architectures of DenseNet, EfficientNet and ResNet, which would be the best choice?


In this project, I study families of deep learning neural networks that are trained on publicly available chest X-ray datasets to automate diagnosis of respiratory illnesses. Specifically, the learned networks will be used to classify anonymised chest X-ray images to three classes: healthy, COVID-19 and non-COVID pneumonia.

First, three families of neural networks that represent state-of-the-art image classification architecture are analysed: DenseNet, EfficientNet and ResNet. DenseNets have been shown to be the best architecture for X-rays predictive models [1]. Three variations of DenseNets are used in this project: DenseNet121, DenseNet169 and DenseNet201. The EfficientNet family is known for its high accuracy and low footprint in other applications. Three variations of EfficientNets are used: EfficientNet4, EfficientNet5 and EfficientNet6. ResNet, a well-studied convolutional neural network, is included to calibrate the predictive performance of this project against other research. Three variations of ResNets are used: ResNet34, ResNet50 and ResNet101.

As in most real-world applications, publicly available chest X-ray image datasets are not abundant, and ground truth data of COVID-19 diagnosis is especially hard to come by. To overcome this issue, researchers have been relying on transfer learning, a technique that uses the weights from a pretrained network to speed up the learning process. This technique has been proven to be effective in several computer vision tasks [2], even when transferring weights from completely different domains. In this project, the first improvement to the predictive power of the neural networks is through pre-training the neural network on an independently acquired, non-overlapping dataset, which must be domain-relevant and much larger than the transfer learning dataset. “Independently acquired” indicates that this dataset was gathered by an independent organization through their own means. For my pre-training dataset, I use the ChestX-ray14 database generated by NIH [2]. It contains over 100,000 images and includes 13 different classes of respiratory viruses, such as Hernia, Fibrosis, and Pneumonia, as well as normal patients. To accommodate the computing resources used in this project, I make sure that the original dataset is downsized and well-balanced across all 14 classes.

The second improvement in this project is to address the issue of natural shortage of COVID-19 chest X-ray images. Based on my research, this issue causes imbalanced training data and makes the neural networks to underperform for underrepresented classes, which is the case for COVID-19. Two alternative methods are implemented to customise data sampling configuration. First is the equal-weight-per-epoch method, where the sampling probabilities for each class are changed such that during one training epoch, the overall number of images from each class will be equal to a third of the total number of images. Another is the fixed-fraction-per-batch method, where a fixed number of images from COVID-19 class will be included in each training batch, based on specified fraction, and the rest of images is distributed to the other two classes in proportion to their population.

Given both improvements, I repeat transfer learning for each neural network using two hyperparameters, which is a manually controlled parameter external to the neural network and is used to affect the learning process. The first hyperparameter is for selection of pretrained weights, and the second hyperparameter is for customisation of sampling configuration. As control for each neural network, pretrained weights learned from the classic ImageNet dataset are used, and no customised training data sampling method is applied. As a result, for each neural network, six sets of performance metrics are collected along with their hyperparameters, including one control set, and five experiment sets.


COVID-Net [2], an ongoing open-source project on the detection of COVID-19, is used in this project to aggregate the chest X-ray image dataset with ground truth labels, containing 13,975 images across 13,870 patient cases, from normal and healthy patients, COVID-19 patients, as well as patients infected with other respiratory viruses. The dataset is constructed from five different open-source chest radiography datasets, but the two main sources are:

  1. The “COVID-19 Image Data Collection”, also known as the Cohen dataset [4], is a dataset of anonymized COVID-19 images collected from public sources and hospitals/physicians.
  2. The RSNA Pneumonia Detection Challenge [5], created by RSNA, NIH, and the Society of Thoracic Radiology, offers images for healthy and non-COVID pneumonia cases.

To generate the datasets, follow the steps below:

  1. git clone
  2. git clone
  3. git clone
  4. For the Cohen dataset, download only the COVID-19 image folder and metadata file from this link:
  5. For the RSNA Pneumonia Detection Challenge database, download only stage_2_train_images, stage_2_detailed_class_info, stage_2_train_labels from this link:
  6. In a Jupyter notebook, combine the five datasets and map patients from each to their respective classes: COVID-19, normal, and pneumonia.
  7. Split the combined dataset into the first train and test by patient ID first, and then split the train dataset into train and validation datasets.


Throughout this project, for each of the nine neural networks, these are the main steps for transfer learning:

  1. Import its pretrained weights
  2. Modify the neural network to include a classifier based on multi level perceptron (MLP)
  3. Define the layers for which weights will be updated during the learning process
  4. Implement suitable loss function and optimizer algorithm
  5. Optionally customise sampling method of the chest X-ray image dataset
  6. Train the modified neural network
  7. Measure predictive power based on a hold-out validation dataset

As control for each neural network, pretrained weights learned from the classic ImageNet dataset are used (step 1), and no customized training data sampling method is applied (step 5). Keeping the neural network (steps 2, 3 and 4) as well as validation method (step 7) the same, I then vary the pretrained weights (step 1) and configure the sampling method (step 5). Therefore, the neural network, the validation method as well as the training data are the controlled variables.

The pretraining steps are very similar to transfer learning in general, except that all layers of the neural network are trained (step 3), and the loss function and optimiser algorithm are implemented for multi-class multi-label classification (step 4). To implement domain-specific pretraining, download the ChestX-ray14 database generated by NIH and train all layers for each neural network. Save checkpoint paths for each neural network.

Here are the significant implementation details for transfer learning:

  1. Define Model class for each neural network. Specify architecture and number of classes.
    • Import from torchvision.models package nine different neural networks, optionally with pretrained weights from the ImageNet dataset.
    • Modify the neural network to include a feed forward MLP classifier layer using ReLU activation and dropout. Adjust input, hidden and output features such that the dimension of the input tensor is 1000 (number of classes in ImageNet) and the dimension of the output tensor is 3 (for healthy, COVID-19 and non-COVID pneumonia).
    • Add a condition for either training all neural network layers or freezing all model parameters except the newly built classifier, so the layers for which weights will be updated during training are defined.
  2. Define a Trainer class that can receive parameters for neural network used, choice of sampling method (optional), and path to saved pretrained weights (optional).
    • To optionally load a model from a prior checkpoint, create a method to receive a model checkpoint, verify any missing or mismatched keys, and load the model state dictionary.
    • For equal-weight-per-epoch sampling method, use WeightedRandomSampler from package.
    • For fixed-fraction-per-batch sampling method, use 0.3 as default value of a hyperparameter for fraction of COVID-19 images in each batch. (Experiments will be done to tune the default fraction based on top performance metrics of the neural networks.)
  3. Train the modified neural network:
    • Load train, test and validation datasets from a CSV file.
    • Specify image transformations.
    • Use a loss function of Negative Log Likelihood (NLL) and an Adam optimizer
    • Iterate through the number of epochs, calculate and backpropagate the loss, compute accuracy per batch, and log time taken.
  4. At the end of each epoch, measure predictive power based on a hold-out validation dataset:
    • Calculate the average precision, F1 score, and confusion matrix.
    • Save the checkpoint based on the average precision of the COVID-19 class.
  5. Add a hyperparameter to allow toggling between training only the MLP classifiers and training all layers of the neural network. By default, only weights for the MLP classifier are updated during transfer learning. Additional experiments will be done to train all layers of the neural network and compare the performance metrics with top performers of pure transfer learning.


For both the pretraining of neural networks and transfer learning, a number of classification performance metrics are evaluated. In regard to transfer learning, two scikit-learn functions, average precision and F1 score, are computed during training. Scikit-learn is an open-source software library which provides metric computation among numerous machine learning tools, two of which are F1 and average precision. Then Precision and Recall are manually calculated based on the confusion matrix for each neural network along with the hyperparameters.

In regard to the pretraining of neural networks, Average Precision and ROC were the metrics being used. ROC (Receiver Operating Characteristic) represents the area under the ROC curve and provides an aggregate measure of performance across all possible classification thresholds.

For formal definition:

  1. Recall (COVID-19) – ; in other words, how many of the total samples of a certain class does the model predict correctly?
  2. Precision (COVID-19) ; in other words, out of all predicted by the model to be a certain class, how many are correct?
  3. F1 score – ; harmonic mean of precision and recall
  4. Average Precision – area under precision recall curve
  5. ROC curve – area under curve of precision and recall values at different thresholds.

Results and Discussion

Pretrain or not Pretrain?

To begin to understand the distinction and substantial gap in performance between pre-training on CXR14 and ImageNet, the Precision and Recall metrics were utilized to contrast control and experimental groups. Here, Control consists of image classification algorithms that are pre trained on ImageNet and trained on raw, imbalanced data. Experimental groups include algorithms that are pre trained on CXR14 and have implemented data sampling methods. For the purpose of this section, the main focus is ImageNet pretraining versus domain-specific CXR14 pretraining.

It is key to point out that an extremely high Precision metric with a low Recall metric is misleading in terms of the algorithm’s performance; concurrently, a high Recall metric in combination with a low Precision metric does not immediately indicate ideal performance in an algorithm. A high Precision essentially means that out of all of the samples the algorithm predicted as COVID-19, a majority of the samples have a true label of COVID-19; however, a low recall signifies that in all of the true COVID-19 samples, the algorithm correctly predicted only a few. Ideally, an algorithm has both high Precision and high Recall; this would indicate that not only did it classify a majority of all the true COVID-19 classes correctly but also in all of its predicted COVID-19 samples, a majority of them are in fact COVID-19 classes.

With that said, the control group (in grey) is isolated to the top left corner; while Precision is extremely high, reaching values of 0.96 and 0.95, the Recall metric for the control group is consistently below a threshold of 0.52. The two experimental groups (in red) with ImageNet pretraining and data sampling methods is mostly contained in the bottom right corner. Recall metrics for this group are high at 0.86; however, the Precision metric from the ImageNet pretrained result is consistently below a threshold of 0.8, significantly lower than that of the Chest X-Ray 14 pre-trained result. This comparison of Precision in CXR14’s favor is more obvious at the same value of Recall.

Which Neural Network?

In this experiment, nine neural networks from three families were trained. For the DenseNet family, DenseNet121, DenseNet169, and DenseNet201 were utilized. For the ResNet family, ResNet34, ResNet50, and ResNet101 were utilized. Lastly, from the EfficientNet family, EfficientNetB4, EfficientNetB5, and EfficientNetB6 were utilized. Each of these CNN’s are pre-trained models that have been previously trained on ImageNet and contain the weights and biases that represent the features of ImageNet. While the data points for each neural network are slightly more dispersed and less clustered, notable details can be extracted from the scatter plot.

The most significant observation is that the Recall metric for the control group is consistently less than 0.6, which is a clear indicator of the underperformance on COVID-19 prediction. The family with significantly higher performance is DenseNet; surprisingly, DenseNet169 has one of the highest Precision and Recall of 0.870 and 0.837. EfficientNet has much lower performance; its Precision is below a threshold of 0.7 for most of its trained models.

Which Sampling Method?

Three distinct sampling methods are evaluated:

  1. Imbalanced
  2. Fixed-Frac (fixed-fraction-per-batch)
  3. Equal-Weight (equal-weight-per-epoch)

One notable observation is the Recall of the imbalanced configuration is never greater than 0.7; keep in mind that control uses the imbalanced method. Given the same level of Recall, the fixed-fraction-per-batch shows slightly higher Precision results than equal-weight-per-epoch. Overall, their results overlap.


Currently, COVID-19 diagnosis heavily relies on nasopharyngeal swabs; however, it’s accuracy, reported up to 73%, can be influenced by the severity of the disease and time from symptoms onset. Due to its widespread availability and comparatively lower contamination rates, chest X-rays can be utilized to identify COVID-19 in a patient. At the same time, automating the diagnosis of chest X-rays would bring substantial benefits in many medical settings.

In this experiment, we aimed to identify the best image classification algorithm to diagnose COVID-19; in particular, we take on this challenge with a small, imbalanced chest X-ray dataset. Algorithms including a pre training dataset, sampling method configuration, and neural network were tested and, Precision and Recall were the main metrics utilized to evaluate performance. At 87% Recall and 83.7% Precision for the COVID-19 class, the highest performing image classification algorithm used the Chest X-ray 14 pre training, fixed-fraction-per-batch sampling method, and DenseNet169 model.

Transfer learning from ImageNet using standard CNN models and corresponding pretrained weights has become a widespread method for deep learning applications to medical imaging. However, there are fundamental differences in data sizes, features and task specifications between natural image classification and the target medical tasks. Pretraining on a domain-specific dataset, in this case chest X-ray images, gives weights that are customized for the task at hand; thus, this significantly improves performance during transfer learning. To contrast the control group where the CNN is trained on ImageNet and our experimental groups where the CNN is trained on Chest X-ray 14, while the control never broke the threshold of 0.8 for Precision, the experimental group data points reach Precisions higher than 0.9. This considerable gap between vanilla transfer learning and pretraining on a domain specific dataset is caused by the fundamental differences between ImageNet and chest X-ray images. While ImageNet data usually contains a clear goal subject, chest X-rays start with a bodily region of interest such as the local white opaque patches that indicate lung consolidations and ground glass densities. So, many of the learned weights of a CNN after being trained on ImageNet are not applicable for chest X-rays; instead, only the low-level layers responsible for detecting edges are carried over.

The fixed-fraction-per-batch sampling configuration is the only method which ensures that within every batch, COVID-19 positive chest X-ray images are present. In both the imbalanced and equal-weigh-per-epoch, because COVID-19 images are not present in many batches, the performance during transfer learning is severely affected. Class imbalance is a common challenge for medical image diagnosis; often, there is a scarcity in images with the target disease as positive in the patient. When the distribution of classes is skewed, it results in poor predictive power, especially for the minority class. With the fixed-fraction-per-batch sampling configuration, the network is guaranteed to see images of the minority class in every batch during training.


[1] CDC. “Coronavirus (COVID-19) frequently asked questions.” Centers for Disease Control and Prevention, 2021, ncov/faq.html. Accessed 23 August 2020.

[2] Tartaglione, Enzo. “Unveiling COVID-19 from Chest X-ray with deep learning: a hurdles race with small data.” Cornell University, 11 April 2020, Accessed 12 October 2020.

[3] CDC. “United States COVID-19 Cases and Deaths by State.” Centers for Disease Control and Prevention, 2021, tracker/#cases_casesper100klast7days. Accessed 3 September 2020.

[4] Yang, Yang. “Laboratory Diagnosis and Monitoring the Viral Shedding of SARS-CoV-2 Infection.” ScienceDirect, 25 November 2020, Accessed 15 December 2020.

[5] Jacobi, Adam. “Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review.” NCBI, 8 April 2020, 10.1016/j.clinimag.2020.04.001. Accessed 12 January 2021.

[6] Devopedia. “ImageNet.” Devopedia, 2019, Accessed 23 January 2021.

[7] Galiatsatos, Panagis. “What Coronavirus Does to the Lungs.” Johns Hopkins Medicine, 13 April 2020, Accessed 12 September 2020.

Leave a Comment

Your email address will not be published. Required fields are marked *