Machine learning powered by artificial neural networks (or variations thereof), has become increasingly important in recent years, with applications ranging from facial recognition to ECG analysis . Applications in automated diagnosis have become increasingly important in a country with a growing shortage of qualified physicians .
In this paper, we focus on optimisation of the structure and properties of an artificial neural network, built from scratch, in order to analyse (and diagnose pathology on) ECGs. We used an open database, ECG-ViEW II , containing 979,273 ECG samples from 461,178 patients over the years 1994 to 2013. We then built a neural network capable of analysing a subset of these ECG data in Python, with a number of variable learning criteria as explained below. We then varied 3 of such criteria (momentum, batch size, and alpha; see section 2.1.3) experimentally, finding optima in their numerical values. These criteria could serve as a future model for criteria in analysing more diagnoses on an ECG. Using these optima for the final learning model, we then performed 10 test-runs with the full database, finding the average error rate for diagnosis. These diagnoses differentiated between Angina Pectoris and non-heart-related diagnoses, achieving an average of 92.5% of ECGs correctly diagnosed. We then evaluated our neural network model and discussed possible future implications and improvements.
The objective of this research is to experimentally find optimum learning criteria for neural networks specifically trained for ECG analysis, in order to maximise learning rate and accuracy of such programs and serve as a model for future research.
Keywords: Neural Network, Machine Learning, ECG, Cardiology, NHS, Diagnosis, Deep Learning, Optimisation, Biology, Mathematics, Electrocardiography, Python.
The hypothesis for the project is that significant improvements can be made to neural network performance in ECG analysis through optimisation of network structure.
A neural network is a form of data processing that runs data through a process similar to that of the
human brain. A neural network is given a set of inputs. These inputs are sent along weighted synapses, which multiply the data by a set amount, to layers of artificial neurons, which sum the values fed into them. Each neuron, a point at which synapses meet, sums the weighted inputs from the previous layer, and puts this sum number through an “activation function,” specific to the neural network, to help process the data. They then pass on values to the next layer through more synapses. This process continues through the layers until the output layer is reached, which returns single values, predicted outcomes of the network. For example, if a neural network was created with the purpose of recognising written numbers, the inputs would be pixels. These values would be passed through the network to give an output value of the number that had been written. Neural network programs use a process called deep learning in order to generate networks that give the correct outputs for the given task. The number and size of the layers are known as the “structure” of the neural network. See Figure 2.1.1-1 for visual structure; the synapses are represented by vertices in the diagram, and the neurons are represented by the coloured circles. Green circles represent input data, while purple represent hidden layers and pink represents the output. A visual understanding of the structure of the neural network is helpful to understand the choices relating to network structure later in the paper.
Backpropagation is a deep-learning technique often applied to teach neural networks. This process involves randomly generated networks being fed test cases (data with expected outputs). The backpropagation algorithm then compares the network’s results with those expected, giving a value for the network’s error. Given the error values, the algorithm adjusts the networks according to a gradient descent function, which attempts to change networks in a direction that will reduce error as much as possible. Those networks which produce lower error values are adjusted to a lesser extent, and those that are more error-prone are changed more. In this way, the networks converge on a solution, a minimum of error . However, this comes with a problem (see Figure 2.1.2-1). In many cases, the non-linear nature of the questions at hand leads to an issue called Local Minima. This is where a Backpropagation algorithm converges on a solution that appears to be the “optimal network”: a minimum of error.
However, this is only a local minimum, and there is a better solution elsewhere in the “search space” (the mathematical area in which all solutions lie). By this time, the algorithm is trapped at the local minimum, never finding the global one. This means that the true solution will never be uncovered. That is highly problematic, especially in an area such as ECG analysis because even a small error can lead to incorrect diagnosis, which is troublesome in a real-life medical environment.
Such neural networks also have the problem of “overfitting” ; the neural network becomes so optimised on the training cases that it uses random patterns (which are not true in the general case) to reach the correct output. This means that when the neural network is tested on other data, these patterns do not hold and it does not perform as well.
In order to prevent the learning algorithm from becoming trapped in local minima, the gradient descent function operates “stochastically\’\’. That is to say, instead of descending the “error landscape” (i.e. the higher-dimensional graph with all the linear variables and error rates) perfectly, the gradient descent function moves around with a degree of randomness and momentum, as we will expand upon later in 2.3. This means that the gradient descent function is less likely to become trapped in small local minima, as it might spontaneously leave these due to stochastic movement, only to return to this minimum if no lower minima exists, i.e. if the minimum is the global minimum . However, the degree of this stochastic movement has to be balanced with preventing the learning algorithm from being unable to settle in any minima at all. This fine-tuning of values known as “learning criteria” is a factor we focused on optimising for neural networks.
To tackle the problem of overfitting as outlined in 2.1.2, different “batch sizes” can be used. The “batch size” is the number of sets of inputs on which the neural network trains at any one time. A larger batch size will mean less random patterns are likely to exist amongst the data, preventing overfitting . Larger batch sizes also help provide more accurate and less “noisy” error rates to help update the neural network in the next generation. However, smaller batch sizes allow for easier processing, especially in computers with limited memory/RAM. Smaller batch sizes also have the advantage that the “noisy” feedback helps to allow a descent into a global minimum, as discussed before.
After each batch has been passed through the forward and backward propagation steps, the network updates the weights according to two criteria referred to as “alpha” and “momentum”. The alpha is a factor used to determine how much the weights should change with each update: a higher alpha corresponds to a greater change. If the value of alpha is too high, the neural network will jump too quickly between values, never settling on a minimum, but if the value is too low, the learning process will slow down significantly . Momentum controls how the previous update affects the current one; higher values of momentum can be used to reduce noise, prevent local minima, and speed up training, but if momentum is too high it can lead to overfitting. Higher values of momentum are typically used alongside low values of alpha.
The network will run each batch (the number of which is (total number of training cases)/(batch size)) in the training cases, updating the neural network each time. The training cases are then randomly shuffled into different batches, and this process is repeated a number of times. This is known as the number of “epochs”. More epochs can often lead to better results, but this is also more computationally heavy, and can lead to overfitting past a certain number.
Electrocardiograms, or ECGs, are a measure of heart activity. More specifically, these measure the electrical activity of the heart in volts with time, often from a number of locations on the body. While the specifics of how these are taken are not necessary, the result is a graph of voltage against time that looks like the one right. This has a number of variables, as shown in Figure 2.2.1-1; these variables can be used as indicators of various cardiac diseases, and for this reason ECGs are commonly used as diagnostic tools.
While diagnosis of some conditions such as atrial fibrillation and ventricular fibrillation from ECG is trivial and does not require extensive medical training, interpretation of ECGs is often difficult, and can be error prone, even for experienced cardiologists.
The ECG data which was used as input for the neural net, as well as the other input criteria, are outlined in the table below (some of the information is from “Making Sense of the ECG, Third edition” . Such information has been put in italics and text has been given quotation marks; this was used to provide a more accurate overview of pathologies and description than through citing individual papers). This is meant to provide context to the data fed into the neural net.
|Input Characteristics||Input||Description||Possible pathology||Expected Duration|
|Patient type parameters||Sex||Male/Female||Different sexes are predisposed to different cardiovascular conditions ||N/A|
|Birth Year Category||0-9, 10-19, etc||Age has effect on likely cardiovascular conditions ||N/A|
|ECG Parameters||RR Interval||As shown in figure 2.2.1-1||A too long or too short RR interval indicate tachycardia and bradycardia respectively||0.6-1.2 s |
|PR Interval||“A PR interval shorter than 120 ms suggests that the electrical impulse is bypassing the AV node, as in Wolf-Parkinson-White syndrome. A PR interval consistently longer than 200 ms diagnoses first degree atrioventricular block. The PR segment (the portion of the tracing after the P wave and before the QRS complex) is typically completely flat, but may be depressed in pericarditis.”||120 to 200 ms|
|QRS Duration||“If the QRS complex is wide (longer than 120 ms) it suggests disruption of the heart\’s conduction system, such as in LBBB, RBBB, or ventricular rhythms such as ventricular tachycardia. Metabolic issues such as severe hyperkalemia, or tricyclic antidepressant overdose can also widen the QRS complex. An unusually tall QRS complex may represent left ventricular hypertrophy while a very low-amplitude QRS complex may represent a pericardial effusion or infiltrative myocardial disease.”||80 to 100 ms|
|QT Interval||360 -440 ms|
|QTc Interval||“A prolonged QTc interval is a risk factor for ventricular tachyarrhythmias and sudden death. Long QT can arise as a genetic syndrome, or as a side effect of certain medications. An unusually short QTc can be seen in severe hypercalcemia.”||<440 ms|
|P Axis||Direction of P wave/angle on graph (relative to lead)||“The P wave is typically upright in most leads except for aVR; an unusual P wave axis (inverted in other leads) can indicate an ectopic atrial pacemaker. If the P wave is of unusually long duration, it may represent atrial enlargement. Typically a large right atrium gives a tall, peaked P wave while a large left atrium gives a two-humped bifid P wave.”||N/A|
|QRS Axis||Direction of QRS complex /angle on graph (relative to lead)||“If the QRS complex is wide (longer than 120 ms) it suggests disruption of the heart\’s conduction system, such as in LBBB, RBBB, or ventricular rhythms such as ventricular tachycardia. Metabolic issues such as severe hyperkalemia, or tricyclic antidepressant overdose can also widen the QRS complex. An unusually tall QRS complex may represent left ventricular hypertrophy while a very low-amplitude QRS complex may represent a pericardial effusion or infiltrative myocardial disease.”||N/A|
|T axis||Direction of T wave/angle on graph (relative to lead)||“Inverted T waves can be a sign of myocardial ischemia, left ventricular hypertrophy, high intracranial pressure, or metabolic abnormalities. Peaked T waves can be a sign of hyperkalemia or very early myocardial infarction.”||N/A|
The ECG-ViEW II Database is an open database of ECGs containing 979,273 ECG samples from 461,178 patients over the years 1994 to 2013. This database contains RR interval, PR interval, QRS duration, QT interval, QTc interval, P axis, QRS axis, and T axis for each ECG, the meaning of which is outlined above. This data is shown alongside other information on the patient, importantly including diagnosis. This is available at http://www.ecgview.org. This database was chosen because it contains a large number of cases, and is freely available without a special license.
We selected two potential diagnoses (Angina Pectoris and non-heart-related diagnosis) from the database; these were chosen to be distinguishable on an ECG, but not trivially so, in order to provide a good sample of possible diagnoses for which our diagnostic tool would be applicable. We then selected all records with our selected diagnosis criteria and where all other relevant data was complete (which was not true in all cases). This left us with 7514 cases with which to train the neural network.
The overall goal of the project is to use an example of differentiating Angina Pectoris from unrelated diseases to help find general optimum learning values for training a neural network for ECG diagnosis. With the context in which we were able to do this now clear, we can provide an overview of the project:
- Create a basic neural network model, and select appropriate ECG data from ECG-ViEW II with which to feed the neural network
- Conduct unrecorded pre-tests on the neural network to find ballpark estimates for learning criteria
- Train the neural network on the data, successively changing each of 3 learning criteria (batch size number, momentum and alpha), finding the value of each learning criteria yielding the lowest error rate (i.e. misdiagnosis rate)
- Test the optimised neural network (with each of the optimum learning criteria) on the data again, measuring error rate
- Construct a conservative estimate for the highest error rate
- Conclude on implications & limitations
To maximise the control we had over the learning process, we built a library in Python for training and testing neural networks. The decision was made to use Python as the programming language for this study on account of its readable syntax and fast development speed. Although other programming languages can perform computations faster in many cases, we made use of the NumPy library for linear algebra which provides Python bindings for exceptionally fast C code. This allowed us to work with matrices and tensors in Python to optimise the efficiency of our networks. The library we created is heavily customisable, which enabled us to easily tweak hyperparameters such as the structure and learning criteria of the network, while still keeping the algebra and calculus at the core of the program behind a layer of abstraction. Using our own library also allowed us to debug our neural networks effectively, so we had complete transparency about the state and structure of a neural network at any stage of learning. The library we created is freely available on GitHub (https://github.com/Genora51/neural) under the MIT Open Source License. Inspiration for the structure of this library was taken from a previously cited work by Silipo and Marchesi.
This is a sample of the code we wrote for this study. In this sample, a neural network is created according to a set of hyperparameters. The training data is fetched from a database, and a neural network is instantiated according to the parameters specified in Section 3.2:
|# Fetch data
inputs, outputs = fetch_data(\’ecgview.db\’, OUTPUT_SIZE)
# Split into training and testing data
train_in, test_in = split_array(inputs, -TEST_SIZE)
train_out, test_out = split_array(outputs, -TEST_SIZE)
# Create a neural network
nn = NeuralNetwork(
# layer sizes
sizes=[INPUT_SIZE, HIDDEN_SIZE, HIDDEN_SIZE, OUTPUT_SIZE],
# activation function for each layer
# what error function (quadratic)
The network is then trained on the data from the ECG-ViEW II Database, using those parameters and our learning criteria (HP_ALPHA, HP_MOMENT etc.), using the backpropagation algorithm provided by the custom library to compute training:
alpha=HP_ALPHA, momentum=HP_MOMENT, batch_size=HP_BSIZE, epochs=HP_EPOCHS
The results of this training are tested against some test data which was not used in training, and the program calculates and outputs how many of these cases were misdiagnosed:
|# Run neural net with testing data
result = nn(test_in)
# Round results to get \’choice\’ as one-hot vector
result = np.max(result, axis=-1)[:, None] – result
result = np.floor(1 – result)
# Compare with expected results
diff = np.abs(test_out – result)
# Output number of errors
errs = np.sum(diff) // 2
print(errs, \’incorrect out of\’, test_out.shape)
We had to choose some aspects of the neural network to keep constant in order to provide a valid experiment. Simultaneous variation of all variables with a large enough sample size would have taken an unreasonable amount of time and computational power. So for the constant aspects of the neural network, values which were likely to be well-optimised (through various unrecorded pre-tests and based on previous experience & research) were chosen. The aspects of the neural network which were chosen as constant are outlined as below:
- Structure of the neural network: The structure of the neural network is multifaceted, as the number of neurons in each layer can be varied, along with the number of layers. For this reason, we did a number of unrecorded tests to find a suitable structure to the neural network, and used this as a constant throughout testing. This was a “fully connected feed-forward network”, with 2 hidden layers, of which the first had 8 neurons and the second also had 8 neurons.
- Number of Epochs: This was chosen to be 10, a good balance between thorough training and not overfitting/using too much computational power.
- Activation Function of the neurons: We chose a sigmoid function as the activation function; this has in other papers shown to be most suitable for pattern-recognition based networks.
For our independent variables, we chose the learning criteria (alpha and momentum, outlined in 2.1.3) and the batch size. We optimised the learning criteria (momentum then alpha) before the batch size, as we thought that the learning criteria may influence the optimum batch size, but that the opposite was less likely to be true.
The independent variables were each measured 3 times at 10 appropriate different values, and the value giving the lowest average error rate (i.e. lowest proportion of incorrect diagnoses between Angina Pectoris vs unrelated conditions to heart) for each variable was chosen as optimum. Once all the optimum criteria had been found in the order above, a final neural network was set up in the configuration found to be best. This was tested 10 times, and an average was taken on the error rate. Statistical analysis was then conducted (for p < 0.05) on the 10 final results to yield a conservative estimate for error rate.
All but 1000 of the data were used as training cases. The final 1000 in each experiment were used as test cases, i.e. kept unseen from the network until the training was complete, then the trained network was tested on these. This means each trial yielded an error rate with ±0.1% accuracy. This was repeated 3 times for each trial, and an average taken, as above. Between each trial, the neural network was completely reset to prevent trials interfering with one another.
The experiments were conducted on a Macbook Air (1.4 GHz Intel Core i5, 4GB RAM), roughly representative of the software capabilities available in a common hospital.
Note on significant figures throughout results
All error rates are given to 4 significant figures; these are extremely precise as the exact number of errors can be calculated by the computer, and so 4 significant figures accurately represents the data, and allows us to draw the most accurate conclusions possible, and back it up with appropriate data. Error rates are a (counted) integer, and so do not have significant figures, as these are just given to the counted value.
The momentum value in the neural Nenwork’s training procedures specifies how much the previous generation affects the current one. Lower values can cause slow and local training, while values that are too high can lead to overfitting. It is thus important to find an alpha value which balances fast and global learning.
Alpha controls the rate of learning for each generation. Lower values of alpha can lead to slower and less successful learning, while higher values can lead to non-convergent training. Like momentum, it is important to find a value which balances speed with effectiveness.
Batch size, much like alpha and momentum, must be balanced to produce successful learning. This hyperparameter controls how many test cases are analysed in each iteration, and can be used to increase speed and improve convergence. However, high batch sizes can be computationally intensive and can lead to local minima, so once again the correct balance is required to optimise learning.
Lowest error rate is highlighted black on graph and bold on table.
|Momentum Value||Trial 1 Error rate||Trial 2 Error rate||Trial 3 Error rate||Average error rate (%)||Maximum (Y/N)|
|Alpha Value||Trial 1 Error rate||Trial 2 Error rate||Trial 3 Error rate||Average error rate||Maximum (Y/N)|
For batch size, instead of using evenly spaced batch sizes, we decided to use batch sizes that were roughly similarly spaced on a log scale. This is more appropriate as variations in batch size such as 20-50 have a larger effect than variations in batch size such as 1000-1030; order of magnitude differences are more important.
|Batch size||Trial 1 Error rate||Trial 2 Error rate||Trial 3 Error rate||Average error rate||Maximum (Y/N)|
For the final network, the following criteria were used, as established in 4.1.1-4.2 :
- Batch size: 500
- Alpha: 0.012
- Momentum: 0.6
- Structure: 2 layers, each 8 neurons
- Activation function for hidden layers: Sigmoid
- Epochs: 10
With these final criteria, the performance of the neural network is shown below:
|Trial No||Accuracy (%)||% Majority Diagnosis|
Here, the % Majority diagnosis is the percentage of test data in the given trial number which fall under the same (majority) diagnosis, i.e. the largest proportion of the results which have the same diagnosis.
The accuracy value here denotes the percentage of correct responses from the neural network for each trial.
The fact that the % Accuracy is significantly higher than the % Majority diagnosis in each trial indicates that the neural network has learned a reliable strategy for diagnosis rather than selecting the same diagnosis in all cases. This is encouraging, as it suggests that even the limited ECG data available for this study was sufficient for accurate diagnosis with the learning methods used. This is further discussed in 5.2 and 5.3.
It is worth noting that the results do not show a clear downward or upward trend for any of the learning criteria, instead showing minima. This is not anomalous, instead implying that too much or little of any learning criteria can hinder network performance, as expected and explained in 2.1.3.
Performing prior optimisation of learning criteria such as alpha and momentum proved particularly useful to the neural network’s performance. The curves produced by the initial hyperparameter observations (Section 4.2) each suggest that the fine-tuning of these parameters greatly improved the effectiveness of the learning procedure. This supports the existing literature (see Section 2.1.3) relating to the effects of each hyperparameter on neural network performance, and provides experimental evidence for the trends predicted in that section.
There were a number of innate limits/problems with the method that could be improved upon in future such research:
- Only a small proportion of possible optimisation criteria were considered. A future study might automate the process of finding optima in each individual criterion, and might therefore be able to consider more variables.
- The computational power of the computer was limited. This machine learning would not actually need to be conducted with each diagnosis, so a more powerful computer doing the deep learning would not mean use of such a diagnostic tool is any less clinically feasible. Research at a university might be able to use such a more powerful computer, allowing the use of more complicated network structure + large epoch values.
- The number of repeats for each trial criteria was only 3. With more repeats, a more accurate average could be calculated, yielding more precise results. This might in turn allow better optimisation.
- Only a small proportion of possible diagnoses were considered. Due to computational limits, only one possible ECG diagnosis was considered. This in turn limited both the size of the test cases, and also the scope of application of this possible network. In the future, a more powerful computer might be able to consider more diagnoses.
Despite the methodological limits described above, the method was still fairly successful at identifying the optimum learning criteria; from the rough estimate learning criteria used at the start, the error rate was reduced from its highest of 19.633% (for α=0.002) to 7.50% (final average). This is a reduction in error rate of 61.8% from the original network, by changing only the 3 variables described. Overall, this would indicate that the method used for optimisation of learning criteria is effective, and could be applicable in future studies optimising a greater number of learning criteria, without some of the limits above.
Our final trials produced an average rate of 92.5% successful diagnosis. Given therefore that the diagnosis data follows a binomial distribution with n = 1000, we can calculate a conservative estimate with p < 0.05 for the minimum diagnosis success rate. This calculation yields a result of 0.909; in other words, we may safely take the diagnosis success rate to be ≥ 91.0% with a 95% confidence. This suggests that the neural network trained in this study with a limited dataset, training time, and available computational power, would still be able to successfully diagnose more than 91% of new cases given only the ECG parameters used in this study.
The hypothesis for this paper was that significant improvements can be made to neural network performance in ECG analysis through optimisation of network structure. The 95% confidence that a 91% increase in diagnostic success rate has been achieved, as outlined in 5.1.3, is sufficient to reject the null hypothesis (that structure optimisation has no effect on success rate) and accept the hypothesis for the experiment above, using a p value of 0.05.
The research has a number of implications for future application of neural networks to ECG diagnosis. The ones we thought were most pertinent are outlined below:
- The learning criteria found are applicable to other similar neural networks. While not all learning criteria were optimised, those that were now might serve as a basis for future research that requires learning criteria for similarly structured neural networks. This might be helpful in the creation of a similar, but more powerful, neural network capable of differentiating between many ECG diagnoses.
- The methodology used is effective in finding optimum learning criteria for neural networks more generally. The reduction in error rate through the optimisation techniques used was substantial, and so the methodology in this paper may serve as a basis for future, larger scale, optimisation of neural networks for similar tasks. This future optimisation may, however, be more automated as discussed in 5.1.1.
- The promising results of this approach at a small scale suggest that more advanced diagnoses from limited ECG data would be possible with greater amounts of time and computing power. With only approximately 6500 ECG cases used in the short training process, achieving > 90% successful diagnosis is not only significant but encouraging that, given more time and data, a similar approach could provide a highly accurate diagnosis of multiple discernible conditions. These results might therefore be used to advise medical professionals in future applications.
- The learning criteria optimised were important in performance of the neural network. The large reduction in error rate through the optimisation would suggest that the learning criteria optimised were important in the overall learning performance; this is important in future research as it suggests a degree of priority should be placed on these criteria.
As mentioned above, future research could:
- Follow a similar methodology to this paper, in optimising other network criteria such as network structure. The more labour-intensive nature of this might make an automated optimization program suitable. A hypothesis might be that the same method used in the paper will also be effective in optimising other criteria.
- Using the learning criteria in this paper, train a network to either have more input variables, or be able to distinguish between more possible diagnoses. A hypothesis might be that the optimised learning criteria found in this paper will also improve network performance over pseudorandom criteria, even when distinguishing between many diagnoses, or using more input variables.
We would like to thank Stuart Cork, Micky Bullock, and Chris Reynolds for providing both of us with continual support throughout our progression in computer science and mathematics.
We would also like to thank the creators of the ECG-ViEW II database for allowing open access to their database, facilitating research such as this.
 Sekhon, Abhjeet, and Pankaj Agarwal. 2016. “Face Recognition Using Artificial Neural Networks.” International Journal of Computer Science and Information Technologies 7 (2): 896–99. https://pdfs.semanticscholar.org/4655/8789edd8359b5addf89a96727e381c85f7bf.pdf.
 Silipo, R., and C. Marchesi. 1998. “Artificial Neural Networks for Automatic ECG Analysis.” IEEE Transactions on Signal Processing 46 (5): 1417–25. https://doi.org/10.1109/78.668803.
 BBC News. 2019. “NHS Staff Shortage: How Many Doctors and Nurses Come from Abroad?” Reality Check, May 12, 2019. https://www.bbc.co.uk/news/world-48205445.
 Kim, Young-Gun, Dahye Shin, Man Young Park, Sukhoon Lee, Min Seok Jeon, Dukyong Yoon, and Rae Woong Park. 2017. “ECG-ViEW II, a Freely Accessible Electrocardiogram Database.” Edited by Christian Schultz. PLOS ONE 12 (4): e0176222. https://doi.org/10.1371/journal.pone.0176222.
 Custódio, Caio. 2017. “How to Add Bias and Weight to Neural Network Diagram?” TeX – LaTeX Stack Exchange. November 16, 2017. https://tex.stackexchange.com/questions/401681/how-to-add-bias-and-weight-to-neural-network-diagram.
 Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
 Laradji, Issam. 2015. “Non-Convex Optimization.” University of British Columbia. https://www.cs.ubc.ca/labs/lci/mlrg/slides/non_convex_optimization.pdf.
 Kawaguchi, Kiyoshi. 2000. “2.4.5 Local Minimum Problem.” University of Texas at El Paso. June 17, 2000. http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node23.html.
 Caruana, Rich, Steve Lawrence, and Lee Giles. 2000. “Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping.” MIT Press Cambridge. https://papers.nips.cc/paper/1895-overfitting-in-neural-nets-backpropagation-conjugate-gradient-and-early-stopping.pdf.
 Saad, David. 1998. Online Learning in Neural Networks. S.L.: Cambridge University Press. https://archive.org/details/onlinelearningin0000unse.
 Hinton, Geoffrey, Nish Srivastava, and Kevin Swersky. 2016. “Overview of Mini-Batch Gradient Descent.” http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
 Zhang, Aston, Zack Lipton, Mu Li, and Alex Smola. 2019. “Dive into Deep Learning Release 0.7.” https://en.d2l.ai/d2l-en.pdf.
 Helvete, Hank van. 7AD. Das Normale EKG Und Seine Anteile. Wikipedia. https://commons.wikimedia.org/wiki/File:EKG_Komplex.svg.
 Guglin, Maya E., and Deepak Thatai. 2006. “Common Errors in Computer Electrocardiogram Interpretation.” International Journal of Cardiology 106 (2): 232–37. https://doi.org/10.1016/j.ijcard.2005.02.007.
 Houghton, Andrew R, and David Gray. 2008. Making Sense of the ECG. 3rd ed. London: Hodder Arnold. http://med-mu.com/wp-content/uploads/2018/06/Making-Sense-of-the-ECG-A-Hands-On-Guide-4EChy-Yong.pdf.
 Maas, A H E M, and Y E A Appelman. 2010. “Gender Differences in Coronary Heart Disease.” Netherlands Heart Journal : Monthly Journal of the Netherlands Society of Cardiology and the Netherlands Heart Foundation 18 (12): 598–602. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018605/.
 Dhingra, Ravi, and Ramachandran S. Vasan. 2012. “Age As a Risk Factor.” Medical Clinics of North America 96 (1): 87–91. https://doi.org/10.1016/j.mcna.2011.11.003.
 NHS. 2019. “Supraventricular Tachycardia (SVT).” NHS.Uk. 2019. https://www.nhs.uk/conditions/supraventricular-tachycardia-svt/.
 Christensen, Buck. 2019. “Normal Electrocardiography (ECG) Intervals: Normal Electrocardiography Intervals.” Medscape.Com. MedScape. April 18, 2019. https://emedicine.medscape.com/article/2172196-overview.
 Enyinna Nwankpa, Chigozie, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. 2018. “Activation Functions: Comparison of Trends in Practice and Research for Deep Learning.” https://arxiv.org/pdf/1811.03378.pdf.
- All error rates are given as percentages (over 1000 test cases). ↑
- It is also worth noting that lower batch sizes significantly increase training time, so a higher batch size is more ideal for efficient learning. ↑
- Future research may look at which individual criteria from a larger range play the greatest role in network performance. ↑
About the author
Noah Grodzinski and Geno Racklin Asher (coauthored paper) are both A-Level students at University College School, London. Noah is applying to study natural sciences, and Geno is applying to study mathematics.