## Parallel Neural Networks in Golang
Daniela Kalwarowskyj and Erich Schikuta
University of Vienna
Faculty of Computer Science, RG WST A-1090 Vienna, Währingerstr. 29, Austria dkalwarowskyj@yahoo.com erich.schikuta@univie.ac.at
Abstract. This paper describes the design and implementation of parallel neural networks (PNNs) with the novel programming language Golang. We follow in our approach the classical Single-Program Multiple-Data (SPMD) model where a PNN is composed of several sequential neural networks, which are trained with a proportional share of the training dataset. We used for this purpose the MNIST dataset, which contains binary images of handwritten digits. Our analysis focusses on different activation functions and optimizations in the form of stochastic gradients and initialization of weights and biases. We conduct a thorough performance analysis, where network configurations and different performance factors are analyzed and interpreted. Golang and its inherent parallelization support proved very well for parallel neural network simulation by considerable decreased processing times compared to sequential variants.
Keywords: Backpropagation Neuronal Network Simulation · Parallel and Sequential Implementation · MNIST · Golang Programming Language
## 1 Introduction
When reading a letter our trained brain rarely has a problem to understand its meaning. Inspired by the way our nervous system perceives visual input, the idea emerged to write a mechanism that could 'learn' and furthermore use this 'knowledge' on unknown data. Learning is accomplished by repeating exercises and comparing results with given solutions. The neural network studied in this paper uses the MNIST dataset to train and test its capabilities. The actual learning is achieved by using backpropagation. In the course of our research, we concentrate on a single sequential feed forward neural network (SNN) and upgrade it into building multiple, parallel learning SNNs. Those parallel networks are then fused to one parallel neural network (PNN). These two types of networks are compared on their accuracy, confidence, computational performance and learning speed, which it takes those networks to learn the given task.
The specific contribution of the paper is twofold: on the one hand, a thorough analysis of sequential and parallel implementations of feed forward neural
network respective time, accuracy and confidence, and on the other hand, a feasibility study of Golang [9] and its tools for parallel simulation.
The structure of the paper is as follows: In the next section, we give a short overview of related work. The parallelization approach is laid out in section 4 followed by the description of the Golang implementation. A comprehensive analysis of the sequential and parallel neural networks respective accuracy, confidence, computational performance and learning speed is presented in section 5. Finally, the paper closes with a summary of the findings.
## 2 Related Work and Baseline Research
Artificial neural networks and their parallel simulation gained high attention in the scientific community. Parallelization is a classic approach for speeding up execution times and exploiting the full potential of modern processors. Still, not every algorithm can profit from parallelization, as the concurrent execution might add a non-negligible overhead. This can also be the case for data parallel neural networks, where accuracy problems usually occur, as the results have to be merged.
In the literature a huge number of papers on parallelizing neural networks can be found. An excellent source of references is the survey by Tal Ben-Nun and Torsten Hoefler [1]. However, only few research was done on using Golang in this endeavour.
In the following only specific references are listed, which influenced the presented approach directly. The authors of [8] presented a parallel backpropagation algorithm dealing with the accuracy problem only by using a MapReduce and Cascading model. In the course of our work on parallel and distributed systems [16,2,14] we developed several approaches for the parallelization of neural networks. In [6], two novel parallel training approaches were presented for face recognizing backpropagation neural networks. The authors use the OpenMP environment for classic CPU multithreading and CUDA for parallelization on GPU architectures. Aside from that, they differentiated between topological data parallelism and structural data parallelism [15], where the latter is focus of the presented approach here. [10] gave a comparison of different parallelization approaches on a cluster computer. The results differed depending on the network size, data set sizes and number of processors. Besides parallelizing the backpropagation algorithm for training speed-up, alternative training algorithms like the Resilient Backpropagation described in [13] might lead to faster convergence. One major difference to standard backpropagation is that every weight and bias has a different and variable learning rate. A detailed comparison of both network training algorithms was given in [12] in the case of spam classification.
## 3 Fundamentals
In the following we present the mathematical fundamentals of neural networks to allow for easier understanding and better applicability of our implementation approach described afterwards.
Forwardpropagation To calculate an output in the last layer, the input values need to get propagated through each layer. This process is called forward propagation and is done by applying an activation function on each neuron's corresponding input sum. The input sum z for a neuron k in the layer l is the sum of each neuron's activation a from the last layer multiplied with the weight w :
$$z _ { k } ^ { l } = \sum _ { j } ( w _ { k j } ^ { l } a _ { j } ^ { l - 1 } + b _ { k } ^ { l } ) \quad ( 1 )$$
The additional term + b stands for the bias value, which allows the activation function to be shifted to the left or to the right. For better readability, the input sums for a whole layer can be stored in a vector z and defined by:
$$z ^ { l } = W ^ { l } x ^ { l - 1 } + b ^ { l } \quad ( 2 )$$
Here, W l is a weight matrix storing all weights to layer x l . To obtain the output of a layer, or, in case of the last layer x L , the output of a neural network, an activation function Ï• needs to be applied:
$$x ^ { l } = \varphi ( z ^ { l } ) = \varphi ( W ^ { l } x ^ { l - 1 } + b ^ { l } ) \quad ( 3 )$$
Activation functions do not have to be unique in a network and can be combined. The implementation presented in this paper uses the rectifier activation function
$$\varphi _ { r e c t i f i e r } ( z ) = \begin{cases} 0 & i f z < 0 \\ z & i f z \geq 0 \end{cases} ( 4 )$$
for hidden neurons and the softmax activation function
$$\varphi _ { s o f t \max } ( z _ { i } ) = \frac { e ^ { z _ { i } } } { \sum _ { j } e ^ { z _ { j } } } \quad ( 5 )$$
for output neurons. For classification, each class is represented by one neuron in the last layer. Due to the softmax function, the output values of those neurons sum up to 1 and can therefore be seen as the probabilities of being that class.
Backpropagation For proper classification the network has to be trained beforehand. In order to do that, a cost function tells us how well the network performs, like the cross entropy error with expected outputs e and actual outputs x ,
$$C = - \sum _ { i } e _ { i } \log ( x _ { i } ) & & ( 6 )$$
The aim is to minimize the cost function by finding the optimal weights and biases with the gradient descent optimization algorithm. Therefore, a training instance gets forward propagated through the network to get an output. Subsequently, it is necessary to compute the partial derivatives of the cost function with respect to each weight and bias in the network:
$$\frac { \partial C } { \partial w _ { k j } } = \frac { \partial C } { \partial z _ { k } } \frac { \partial z _ { k } } { \partial w _ { k j } }$$
$$\frac { \partial C } { \partial b _ { k } } = \frac { \partial C } { \partial z _ { k } } \frac { \partial z _ { k } } { \partial b _ { k j } }$$
As a first step, ∂C ∂z k needs to be calculated for every neuron k in the last layer L :
$$\delta _ { k } ^ { L } = \frac { \partial C } { \partial z _ { k } ^ { L } } = \frac { \partial C } { \partial x _ { k } ^ { L } } \varphi ^ { \prime } ( z _ { k } ^ { L } )$$
In case of the cross entropy error function, the error signal vector δ of the softmax output layer is simply the actual output vector minus the expected output vector:
$$\delta ^ { L } = \frac { \partial C } { \partial z ^ { L } } = x ^ { L } - e ^ { L } & & ( 1 0 )$$
To obtain the errors for the remaining layers of the network, the output layer's error signal vector δ L has to be propagated back through the network, hence the name of the algorithm:
$$\delta ^ { l } = ( W ^ { l + 1 } ) ^ { T } \delta ^ { l + 1 } \odot \varphi ^ { \prime } ( z ^ { l } ) \quad ( 1 1 )$$
( W l +1 ) T is the transposed weight matrix, denotes the Hadamard product or entry-wise product and ϕ ′ is the first derivative of the activation function.
Gradient Descent Knowing the error of each neuron, the changes to the weights and biases can be determined by
$$\Delta w _ { k j } ^ { l } = - \eta \frac { \partial C } { \partial w _ { k j } ^ { l } } = - \eta \delta _ { k } ^ { l } x _ { j } ^ { l - 1 } \quad ( 1 2 )$$
$$\Delta b _ { k } ^ { l } = - \eta \frac { \partial C } { \partial b _ { k } } = - \eta \delta _ { k } ^ { l }$$
The constant η is used to regulate the strength of the changes applied to the weights and biases and is also referred to as the learning rate, x l -1 j stands for the output of the j th neuron from layer l -1 . The changes are applied by adding them to the old weights and biases. Depending on the update frequency, a distinction is made between stochastic gradient descent, batch gradient descent and minibatch gradient descent. In the case of the first-mentioned, the weights and biases are updated after every training instance (by repeating all of the aforementioned steps instance-wise). In contrast, batch gradient descent stands for updating only once after accumulating the gradients of all training samples. Mini-batch gradient descent is a combination of both. The weights and biases are updated after a specified amount, the mini-batch size , of training instances. As with batch gradient descent, the gradients of all instances are averaged before the updates.
## 4 Parallel Neuronal Networks
This section describes the technology stack, the parallelization model and implementation details of the provided PNN.
## 4.1 Technology Stack
Go, often referred to as Golang, is a compiled, statically typed, open source programming language developed by a team at Google and released in November 2009. It is distributed under a BSD-style license, meaning that copying, modifying and redistributing is allowed under a few conditions.
As Andrew Gerrand, who works on the project, states in [9], Go grew from a dissatisfaction with the development environments and languages that they were using at Google. It is designed to be expressive, concise, clean and efficient. Hence, Go compiles quickly and is as easy to read as it is to write. This is partly because of gofmt, the go source code formatter, that gives Go programmes a single style and relieves the programmers from discussions like where to set the braces. As uniform presentation makes code easier to read and therefore to work on, gofmt also saves time and affects the scalability of programming teams [11]. The integrated garbage collector offers another great convenience and takes away the time consuming efforts on memory allocation and freeing known from C/C++. Despite the known overhead and criticism about Java's garbage collector, the author of [11] claims that Go is different, more efficient and that it is almost essential for a concurrent language like Go because of the trickiness that can result from managing ownership of a piece of memory as it is passed around among concurrent executions. That being said, built-in support for concurrency is one of the most interesting aspects of Go, offering a great advantage over older languages like C++ or Java. One major component of Go's concurrency model are goroutines, which can be thought of as lightweight threads with a negligible overhead, as the cost of managing them is cheap compared to threads. If a goroutine blocks, the runtime automatically moves any blocking code away from being executed and executes some code that can run, leading to highperformance concurrency [9]. Communication between goroutines takes place over channels, which are derived from "Communicating Sequential Processes" found in [5]. A Channel can be used to send and receive messages from the type associated with it. Since receiving can only be done when something is being sent, channels can be used for synchronization, preventing race conditions by design.
Another difference to common object oriented programming languages can be found in Go's object oriented design. Its approach misses classes and type-based inheritance like subclassing, meaning that there is no type hierarchy. Instead, Go features polymorphism with interfaces and struct embedding and therefore encourages the composition over inheritance principle. An Interface is a set of methods, which is implemented implicitly by all data types that satisfy the interface [11].
For the rest, files are organized in packages, with every source file starting with a package statement. Packages can be used by importing them via their unique path. If a package path in the form of an URL refers to a remote repository, the remote package can be fetched with the go get command and subsequently imported like a local package. Additionally, Go will not compile, if unused packages are being imported.
## 4.2 Parallelization Model
For the parallelization of neural network operations we apply the classical SingleProgram Multiple-Data (SPMD) approach well known from high-performance computing [3]. It is a programming technique, where several tasks execute the same program but with different input data and the calculated output data is merged to a common result. Thus, based on the fundamentals of single feed forward neural network we generate multiple of these networks and set them up to work together in parallel manner.
Fig. 1. Design of a Parallel Neural Network
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Distributed Training with Feedforward Neural Networks
### Overview
The image depicts a diagram illustrating a distributed training setup for Feedforward Neural Networks (FNNs) using the MNIST dataset. The diagram shows a single dataset being fed into three separate FNNs, with the output of these FNNs converging into a single, higher-level FNN. The connections between the components are indicated by dashed lines.
### Components/Axes
The diagram consists of the following components:
* **MNIST Dataset:** Located at the bottom center, labeled "MNIST Dataset".
* **FNNs (x3):** Three identical Feedforward Neural Networks positioned in a row at the bottom. Each is labeled "FNN".
* **Higher-Level FNN:** A single Feedforward Neural Network positioned at the top center, labeled "FNN".
* **Connections:** Dashed lines connecting the MNIST Dataset to each of the three FNNs, and dashed lines connecting each of the three FNNs to the higher-level FNN.
There are no axes or scales present in this diagram.
### Detailed Analysis or Content Details
The diagram shows a parallel processing architecture. The MNIST dataset is the input source. This dataset is distributed to three identical FNNs. Each of these FNNs processes the data independently. The outputs of these three FNNs are then combined and fed into a single, higher-level FNN.
Each FNN appears to have approximately 4 layers, with the number of nodes decreasing in each successive layer. The exact number of nodes per layer is difficult to determine due to the density of the connections. The connections within each FNN are fully connected, as indicated by the numerous lines between nodes in adjacent layers.
The dashed lines indicate the flow of data. The dashed lines from the MNIST dataset to the FNNs suggest a data distribution process. The dashed lines from the FNNs to the higher-level FNN suggest a data aggregation or fusion process.
### Key Observations
* The diagram illustrates a distributed training approach, where the workload is divided among multiple FNNs.
* The use of identical FNNs suggests that each FNN is learning the same task, and their outputs are being combined to improve the overall performance.
* The higher-level FNN likely performs a meta-learning task, such as combining the outputs of the individual FNNs to make a final prediction.
* The diagram does not provide any information about the training process, such as the learning rate, the optimization algorithm, or the loss function.
### Interpretation
This diagram represents a distributed learning architecture, likely used to improve the robustness and scalability of a machine learning model. By distributing the training workload across multiple FNNs, the system can potentially reduce the training time and improve the generalization performance. The higher-level FNN acts as an aggregator, combining the knowledge learned by the individual FNNs. This architecture could be used for tasks such as image classification, object detection, or natural language processing. The use of the MNIST dataset suggests that the model is being trained to recognize handwritten digits. The diagram is a conceptual representation and does not provide any specific details about the implementation or performance of the system. It is a high-level overview of the architecture, intended to convey the general idea of distributed training.
</details>
The parallel-design is visualized in figure 1. On the bottom it shows the dataset which is divided into as many slices as there are networks, referred to as child-networks (CN). Each child-network learns only a slice of the dataset. Ultimately the results of all parallel child-networks are merged to one final parallel
neural network (PNN). The combination of those CNs can be done in various ways. In the presented network the average of all weights, calculated by each parallel CN by a set number of epochs, is used for the PNNs weights. For the biases the same procedure is used, e.g. averaging all biases for the combined biases value.
In Golang it is important to take into consideration that a program, which is designed parallel does not necessarily work in a parallel manner, as a concurrent program can be parallel, but doesn't have to be. This programming language offers a goroutine, which 'is a function executing concurrently with other goroutines in the same address space' and processes with Go runtime. To start a goroutine a gofunc is called. It can an be equipped with a WaitGroup, that ensures that the process does not finish until all running processes are done. More about the implementation is explained in the next section.
## 4.3 Implementation Details
The main interface to which any trainable network binds is the TrainableNetwork interface. This interface is used throughout the whole learning and testing process. Parallel - as well as simple neural networks implement this interface. This allows for easy and interchangeable usage of both network types throughout the code. Due to the fact that a parallel neural network is built from multiple sequential neural networks (SNN) we start with the implementation of an SNN. The provided implementation of an SNN allows for a flexible network structure. For example, the number of layers and neurons, as well as the activationfunctions used on a layer, can be chosen freely. All information, required for creating a network is stored within a NeuroConfig struct on a network instance. These settings can easily be adjusted in a configuration file, the default name is config.yaml , located in the same directory as the executable.
A network is built out of layers. A minimal network is at least composed of an input layer and an output layer. Beyond this minimum, the hidden depth of a network can be freely adjusted by providing a desired number of hidden layers. Internally layers are represented by the NeuroLayer struct. A layer holds weights and biases which are represented by matrices. The Gonum package is used to simplify the implementation. It provides a matrix implementation as well as most necessary linear algebraic operations.
In the implementation, utility functions are provided for a convenient creation of new layers with initialized weights and biases. The library rand offers a function NormFloat 64 , where the variance is set 1 and the mean 0 as default. Weights are randomly generated using that normal distribution seeded by the current time in nanoseconds.
The provided network supports several activation functions. The activation function is defined on a per layer basis which enables the use of several activations within one network.
A PNN is a combination of at least two SNN. The ParallelNetwork struct represents the PNN in the implementation. As SNNs are trained individually before being combined with the output network of a PNN, it is necessary to
keep the references to the network managed in a slice. In the context of a PNN the SNNs are referred to as child networks (CN).
In a PNN the training process is executed on all CNs in parallel using goroutines. First, the dataset is split according to the amount of CNs. Afterwards, the slices of the training dataset and CNs are called with a goroutine. The goroutine executes minibatches of every CN in parallel. Within those minibatches, another mutexed concurrent goroutine is started for forwarding and backpropagating. Installing a mutex ensures safe access of the data over multiple goroutines.
The last step of training is to combine those CNs to one PNN. The provided network uses as combination function the "average" approach. After training the CNs for a set number of epochs, weights, and biases are added onto the PNN. Ultimately these weights and biases are scaled by the number of CNs. The result is the finished PNN.
## 5 Performance Evaluation
At first a test with one PNN, consisting of 10 CNs, and an SNN are tested using different activation functions on the hidden layer, while always using the softmax function on the output layer. After deciding on an activation function, network configurations are tested. While the number of neurons is only an observation, but not thoroughly tested, the number of networks is evaluated on different sized PNNs. Finally, the performance of both types of networks are compared upon time, accuracy, confidence and costs.
## 5.1 MNIST Dataset
For our analysis, we use the MNIST dataset which holds handwritten numbers and allows supervised learning. Using this dataset the network learns to read handwritten digits. Since learning is achieved by repeating a task, the MNIST dataset has a 'training-set of 60,000 examples, and a test-set of 10,000 examples' [7] . Each dataset is composed of an image-set and a label-set, which holds the information for the desired output and makes it possible to verify the networks output. All pictures are centered and uniform by 28x28 pixels. First, we start the training with the training-set. When the learning phase is over the network is supposed to be able to fulfill its task [8]. To evaluate it's efficiency it is tested by running the neural network with the test-set since the samples of this set are still unknown. It is important to use foreign data to test a network since it is more qualified to show the generalization of a network and therefore its true efficiency. We are aware that MNIST is a rather small data set. However, it was chosen on purpose, because it is used in many similar parallelization approaches and allows therefore for relatively easy comparison of results.
## 5.2 Activation Functions in Single- and Parallel Neuronal Networks
To elaborate which function performes best in terms of accuracy for the coded single- and parallel neural network a test using the same network design and
settings for each network is performed while changing only the function used on the hidden layer. Used settings were one hidden layer built out of 256 neurons, working with a batchsize of 50 and a learningrate η of 0.05 and an output layer calculated with softmax. This is used on a single FNN and a PNN each consisting of 10 child-networks. Figure 2 presents the performance results of the activation functions. Each networks setup is one hidden layer on which either the tangent hyperbolic-, leaky ReLU-, ReLU- or sigmoid-function was applied.
Fig. 2. Compare Accuracy of a parallel- vs simple-NN with different activation functions and a softmax function for the output layer. The networks have one hidden layer with 256 neurons and the training was performed with a learningrate of 0.05 and a batchsize of 50 over 20 epochs.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Bar Chart: Activation Function Performance Comparison
### Overview
This image presents a bar chart comparing the performance of four different activation functions (Tanh, LReLU, ReLU, and Sigmoid) across two experimental setups: training 10 neural networks versus training a single neural network. Performance is measured as a percentage, likely representing accuracy or a similar metric.
### Components/Axes
* **Y-axis:** Lists the activation functions: Tanh, LReLU, ReLU, Sigmoid.
* **X-axis:** Represents the performance percentage, ranging from 90% to 100%, with markings at 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, and 100%.
* **Legend:** Located at the bottom-center of the chart.
* Dark Gray Bars: Represent the performance of "10 Networks".
* Light Gray Bars: Represent the performance of "1 Network".
### Detailed Analysis
Let's analyze each activation function's performance:
* **Tanh:**
* 10 Networks: Approximately 94.5% (dark gray bar extends to roughly 94.5 on the x-axis).
* 1 Network: Approximately 97.5% (light gray bar extends to roughly 97.5 on the x-axis).
* **LReLU:**
* 10 Networks: Approximately 94.2% (dark gray bar extends to roughly 94.2 on the x-axis).
* 1 Network: Approximately 95.5% (light gray bar extends to roughly 95.5 on the x-axis).
* **ReLU:**
* 10 Networks: Approximately 94% (dark gray bar extends to roughly 94 on the x-axis).
* 1 Network: Approximately 98% (light gray bar extends to roughly 98 on the x-axis).
* **Sigmoid:**
* 10 Networks: Approximately 91.5% (dark gray bar extends to roughly 91.5 on the x-axis).
* 1 Network: Approximately 95% (light gray bar extends to roughly 95 on the x-axis).
The bars for "1 Network" consistently outperform the bars for "10 Networks" across all activation functions.
### Key Observations
* ReLU achieves the highest performance when training a single network (approximately 98%).
* Sigmoid consistently shows the lowest performance, especially when training 10 networks (approximately 91.5%).
* The difference in performance between training 10 networks and 1 network is most pronounced for ReLU and Tanh.
* The performance differences between the activation functions are relatively small, all falling within a 6.5% range.
### Interpretation
The data suggests that, in this experimental setup, training a single neural network yields better performance than training an ensemble of 10 networks. This could be due to factors like the specific dataset used, the network architecture, or the training parameters. The superior performance of ReLU when training a single network indicates that it might be a more effective choice for this particular task. The consistently lower performance of Sigmoid suggests it may be less suitable for this type of problem, potentially due to the vanishing gradient problem. The fact that the performance gap between single and multiple networks is larger for ReLU and Tanh could indicate that these activation functions are more sensitive to the number of networks in the ensemble. Further investigation would be needed to understand the underlying reasons for these observations and to determine the generalizability of these findings.
</details>
In this comparison the single neural network that learned using the ReLUfunction, closely followed by TanH-function, has reached the best result within 20 epochs. While testing different configurations it showed that most activation functions reached higher accuracy when using small learning rates. Sigmoid is one function that proved itself to be most efficient when the learningrate is not too small. By raising the learningrate to 0.6 the sigmoid-functions merit grows significantly on both network types. In the process of testing ReLU on hidden layers in combination with Softmax for the output layer has proven to reliably deliver good results. That is why in further sections ReLU has applied on all networks hidden layers and on the output layer Softmax.
## 5.3 Network Configurations
Number of Neurons. Choosing an efficient number of neurons is important, but it is hard to identify. There is no calculation which helps to define an effectively working number or range of neurons for a certain configuration of a
neural network. Varying the number of neurons between 20 to 600 delivered great accuracy. These are only observations and need to be studied with a more sophisticated approach.
Number of Networks. To evaluate the performance of PNNs in terms of accuracy, PNNs with different amounts of CNs are composed and trained. The training runs over 20 epochs with a learning rate of 0.1 and a batchsize of 50. All CNs are built with one hidden layer consisting of 256 neurons. On the hidden layer the ReLU-function and on the output layer the Softmax-function is used. After every epoch, the networks are tested with the test-dataset. The results are visualized in figure 3.
Fig. 3. Accuracy of PNNs, built with different amount of CNs, over 20 epochs
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Line Chart: Performance vs. Epochs for Different PNN Sizes
### Overview
This image presents a line chart illustrating the performance (in percentage) of different Proximity Neural Network (PNN) sizes (2, 10, 20, and 30) across 20 epochs. The chart visually demonstrates how the performance of each PNN size evolves with increasing training epochs.
### Components/Axes
* **X-axis:** "Epochs" - ranging from 0 to 20, with markers at integer values.
* **Y-axis:** "Percentage" - ranging from 93 to 98, with markers at 0.5 percentage intervals.
* **Legend:** Located at the bottom of the chart, identifying each line with its corresponding PNN size.
* Blue Line: "2 PNN"
* Dark Blue Line: "10 PNN"
* Orange Line: "20 PNN"
* Gray Line: "30 PNN"
### Detailed Analysis
The chart displays four distinct lines, each representing the performance of a PNN with a different number of neurons.
* **2 PNN (Blue Line):** This line starts at approximately 93.2% at Epoch 0 and rapidly increases to around 97.8% at Epoch 2. It then plateaus, fluctuating between approximately 97.5% and 97.9% for the remaining epochs (2-20).
* Epoch 0: ~93.2%
* Epoch 1: ~96.8%
* Epoch 2: ~97.8%
* Epoch 5: ~97.6%
* Epoch 10: ~97.7%
* Epoch 15: ~97.6%
* Epoch 20: ~97.8%
* **10 PNN (Dark Blue Line):** This line begins at approximately 94.2% at Epoch 0 and steadily increases, reaching around 97.2% at Epoch 20. The increase is less rapid than the 2 PNN line, but more consistent.
* Epoch 0: ~94.2%
* Epoch 1: ~95.8%
* Epoch 2: ~96.5%
* Epoch 5: ~96.8%
* Epoch 10: ~97.0%
* Epoch 15: ~97.1%
* Epoch 20: ~97.2%
* **20 PNN (Orange Line):** Starting at approximately 93.8% at Epoch 0, this line exhibits a slower initial increase compared to the 2 and 10 PNN lines. It reaches around 96.5% at Epoch 20.
* Epoch 0: ~93.8%
* Epoch 1: ~94.8%
* Epoch 2: ~95.4%
* Epoch 5: ~95.8%
* Epoch 10: ~96.1%
* Epoch 15: ~96.3%
* Epoch 20: ~96.5%
* **30 PNN (Gray Line):** This line shows the slowest increase in performance, starting at approximately 93.4% at Epoch 0 and reaching around 95.7% at Epoch 20.
* Epoch 0: ~93.4%
* Epoch 1: ~94.0%
* Epoch 2: ~94.4%
* Epoch 5: ~94.7%
* Epoch 10: ~95.0%
* Epoch 15: ~95.3%
* Epoch 20: ~95.7%
### Key Observations
* The 2 PNN model achieves the highest performance and reaches a plateau relatively quickly.
* Increasing the PNN size beyond 2 neurons results in diminishing returns in terms of performance improvement.
* The 30 PNN model consistently exhibits the lowest performance across all epochs.
* All models show an initial rapid increase in performance, followed by a diminishing rate of improvement.
### Interpretation
The data suggests that for this particular task, a smaller PNN size (specifically 2 neurons) is more effective than larger sizes. This could be due to overfitting with larger networks or the inherent simplicity of the problem. The rapid initial improvement indicates that the models quickly learn the basic patterns in the data. The plateauing performance suggests that further training epochs do not significantly improve the model's accuracy, and the model has likely converged. The decreasing performance with increasing PNN size indicates that the complexity of the model is not beneficial and may even be detrimental to generalization. This could be because the larger networks are more prone to overfitting the training data, leading to poorer performance on unseen data. The chart provides valuable insights into the optimal PNN size for this specific application, highlighting the importance of model selection and avoiding unnecessary complexity.
</details>
Figure 3 illustrates a clear loss in accuracy of PNNs with a growing number of CNs. The 94.5% accuracy, for example, is reached by a PNN with 2 CNs after only one epoch, while a PNN with 30 CNs achieves that after 12 epochs. In respect to the number of networks this graph shows that more is not always better. Considering, that this test was only performed over a small number of epochs, it is not possible to read the potential of a PNN with more CNs. To find out how good a PNN can perform, a test was run with three PNNs running 300 epochs:
Table 1 shows a static growth until 200 epochs. After that, there is only a small fluctuation of accuracy, showing that a local minimum has been reached. Over the runtime of 300 epochs the difference of the performance regarding the accuracy of PNNs has been reduced significantly. Still the observation of the ranking of the PNNs has not been changed. The PNNs built out of a smaller
Table 1. Accuracy behaviour for different epochs
| CNs of PNN | Accuracy after... | Accuracy after... | Accuracy after... | Accuracy after... | Accuracy after... | Accuracy after... |
|--------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------|
| | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs | 20 Epochs 100 Epochs 150 Epochs 200 Epochs 250 Epochs 300 Epochs |
| 2 | 97.76 | 98.08 | 98.13 | 98.16 | 98.14 | 98.17 |
| 10 | 96.58 | 97.43 | 97.96 | 98.03 | 98.09 | 98.05 |
| 20 | 95.69 | 97.50 | 97.71 | 97.92 | 97.90 | 97.97 |
number of CNs perform slightly better. Since the provided PNNs are built by using averaging of weights and biases it also seemed interesting to compare the average accuracy of the CNs with the resulting PNN, to grade the used combination function. The results are illustrated in figure 4.
Fig. 4. Compare the average accuracy of all CNs, out of which the final PNN is formed, with that PNNs accuracy
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Bar and Line Chart: Accuracy vs. Epochs for Different Models
### Overview
This chart compares the accuracy of three different models (2PNN, 10PNN, and 20PNN) over 300 epochs. Accuracy is measured in percentage, and the chart displays both the accuracy at specific epochs (100, 200, 300) using bar graphs and the average accuracy of Convolutional Neural Networks (CNs) over epochs using lines.
### Components/Axes
* **X-axis:** Epochs (labeled 100, 200, 300)
* **Y-axis:** Accuracy in Percentage (scale from 97.5 to 98.5)
* **Models:** 2PNN, 10PNN, 20PNN
* **Data Series:**
* 2PNN (represented by white bars)
* Average-CNs for 2PNN (represented by a dark gray line)
* 10PNN (represented by gray bars)
* Average-CNs for 10PNN (represented by a black line)
* 20PNN (represented by dark gray bars)
* Average-CNs for 20PNN (represented by a light gray line)
* **Legend:** Located at the bottom-center of the chart, clearly associating colors and patterns with each model and its corresponding average CN line.
### Detailed Analysis
The chart is divided into three sections, one for each model (2PNN, 10PNN, 20PNN). Each section contains a bar graph representing the accuracy at epochs 100, 200, and 300, and a line graph representing the average accuracy of CNs over those epochs.
**2PNN:**
* At Epoch 100: Accuracy ≈ 98.1%
* At Epoch 200: Accuracy ≈ 98.15%
* At Epoch 300: Accuracy ≈ 98.15%
* Average-CNs Line: Starts at approximately 98.1% at Epoch 100, increases slightly to approximately 98.2% at Epoch 200, and remains relatively stable at approximately 98.2% at Epoch 300.
**10PNN:**
* At Epoch 100: Accuracy ≈ 97.7%
* At Epoch 200: Accuracy ≈ 98.0%
* At Epoch 300: Accuracy ≈ 98.0%
* Average-CNs Line: Starts at approximately 97.8% at Epoch 100, increases to approximately 98.1% at Epoch 200, and continues to approximately 98.15% at Epoch 300.
**20PNN:**
* At Epoch 100: Accuracy ≈ 97.8%
* At Epoch 200: Accuracy ≈ 97.85%
* At Epoch 300: Accuracy ≈ 97.9%
* Average-CNs Line: Starts at approximately 97.7% at Epoch 100, increases to approximately 97.9% at Epoch 200, and continues to approximately 97.95% at Epoch 300.
### Key Observations
* The 2PNN model consistently exhibits the highest accuracy across all epochs, with minimal improvement from Epoch 100 to Epoch 300.
* The 10PNN model shows a noticeable increase in accuracy from Epoch 100 to Epoch 200, but plateaus afterward.
* The 20PNN model demonstrates the lowest accuracy among the three, with a gradual increase over the 300 epochs.
* The Average-CNs lines generally follow the trend of the corresponding bar graphs, indicating a correlation between the model's accuracy and the average accuracy of CNs.
* The difference between the bar graph and the line graph for each model is relatively small, suggesting that the average accuracy of CNs is a good indicator of the model's performance.
### Interpretation
The data suggests that the 2PNN model performs best in terms of accuracy, achieving a high level of performance early on and maintaining it throughout the training process. The 10PNN model shows improvement with more epochs, but eventually reaches a plateau. The 20PNN model consistently underperforms compared to the other two, indicating that it may require further optimization or a different architecture.
The consistent trend between the bar graphs (model accuracy) and the lines (average CN accuracy) suggests that the CNs are a reliable metric for evaluating the performance of these models. The slight differences between the two could be attributed to the specific training data or hyperparameters used for each model.
The plateauing of accuracy for both the 2PNN and 10PNN models after a certain number of epochs suggests that further training may not yield significant improvements. This could indicate that the models have converged or that the training data is not sufficient to drive further learning.
</details>
It shows that the efficiency of an average function grows with the number of CNs. The first graph drawn with 2 CNs shows, that the resulting PNN is performing worse than the average of the CNs, it has been built from. By growing the number of CNs to 10, the average of CNs approximates towards the PNN. The last graph of this figure shows that a PNN composed of 20 CNs outperforms the average of its CNs after 200 epochs, and after 300 epochs levels with it. It has to be noted that the differences in accuracy are very small, as it is only a range of 0.1 to 0.2 percent. Overall it can be said that this combination function is working efficiently.
## 5.4 Comparing the Performances
Time. Time is the main reason to have a network working in parallel. To test the effect of parallelism on the time required to train a PNN, the provided neuronal
network is tested on three systems. The first system is equipped with 4 physical and 4 logical cores, an Intel i7-3635QM processor working with a basic clock rate of 2.4GHz, the second system holds 6 physical cores and 6 logical cores working with 2.9GHz and an Intel i9-8950HK processor and last the third system works with an AMD Ryzen Threadripper 1950X with 16 physical and 16 logical cores, which work with a clock rate of 3.4GHz. The first, second and third systems are referred to as 4 core, 6 core and 16 core in the following.
Fig. 5. Time in seconds, that was needed to train a PNN with a limited amount of one Goroutine per composed CN.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Bar Chart: Execution Time vs. Goroutines with Varying Core Counts
### Overview
This bar chart depicts the relationship between the number of goroutines and the execution time (in seconds) for different core counts (4, 6, and 16). The chart aims to illustrate how performance scales with increasing goroutines and core availability.
### Components/Axes
* **X-axis:** "Goroutines" - Represents the number of concurrent goroutines, with markers at 1, 2, 4, 6, 8, 12, 16, 32, and 64.
* **Y-axis:** "Seconds" - Represents the execution time in seconds, ranging from 0 to approximately 3000.
* **Legend:** Located in the top-right corner, identifies the three data series:
* "4 cores" (Light Gray)
* "6 cores" (Medium Gray)
* "16 cores" (Dark Gray)
### Detailed Analysis
The chart consists of nine bars for each core count, representing the execution time for each number of goroutines.
**4 Cores (Light Gray):**
* 1 Goroutine: Approximately 2750 seconds.
* 2 Goroutines: Approximately 1800 seconds.
* 4 Goroutines: Approximately 1000 seconds.
* 6 Goroutines: Approximately 600 seconds.
* 8 Goroutines: Approximately 400 seconds.
* 12 Goroutines: Approximately 300 seconds.
* 16 Goroutines: Approximately 200 seconds.
* 32 Goroutines: Approximately 250 seconds.
* 64 Goroutines: Approximately 200 seconds.
**6 Cores (Medium Gray):**
* 1 Goroutine: Approximately 1700 seconds.
* 2 Goroutines: Approximately 950 seconds.
* 4 Goroutines: Approximately 600 seconds.
* 6 Goroutines: Approximately 400 seconds.
* 8 Goroutines: Approximately 300 seconds.
* 12 Goroutines: Approximately 250 seconds.
* 16 Goroutines: Approximately 200 seconds.
* 32 Goroutines: Approximately 200 seconds.
* 64 Goroutines: Approximately 150 seconds.
**16 Cores (Dark Gray):**
* 1 Goroutine: Approximately 1000 seconds.
* 2 Goroutines: Approximately 500 seconds.
* 4 Goroutines: Approximately 300 seconds.
* 6 Goroutines: Approximately 200 seconds.
* 8 Goroutines: Approximately 150 seconds.
* 12 Goroutines: Approximately 120 seconds.
* 16 Goroutines: Approximately 100 seconds.
* 32 Goroutines: Approximately 100 seconds.
* 64 Goroutines: Approximately 100 seconds.
**Trends:**
* For all core counts, the execution time generally decreases as the number of goroutines increases up to a certain point.
* The 4-core series shows a significant decrease in execution time from 1 to 4 goroutines, then plateaus and even increases slightly at 32 goroutines.
* The 6-core series exhibits a similar trend to the 4-core series, but with lower execution times overall.
* The 16-core series shows the most consistent decrease in execution time with increasing goroutines, leveling off at 16, 32, and 64 goroutines.
### Key Observations
* Increasing the number of cores consistently reduces execution time.
* There appears to be an optimal number of goroutines for each core count. Beyond this point, adding more goroutines does not significantly improve performance and can even degrade it (especially noticeable with 4 cores).
* The 16-core system demonstrates the best scalability, maintaining relatively low execution times across all goroutine counts.
* The 4-core system shows the most pronounced performance degradation beyond a certain number of goroutines, suggesting contention or overhead becomes significant.
### Interpretation
The data suggests that utilizing goroutines can improve performance, but the benefits are heavily dependent on the number of available cores. A system with more cores can effectively handle a larger number of concurrent goroutines without significant overhead. The observed plateau and even increase in execution time with higher goroutine counts on the 4-core system likely indicates that the overhead of context switching and synchronization outweighs the benefits of parallelism.
The chart demonstrates the importance of balancing concurrency (goroutines) with available resources (cores) to achieve optimal performance. The 16-core system's consistent performance suggests it is well-equipped to handle a high degree of concurrency, while the 4-core system is more limited. This data could be used to inform decisions about resource allocation and application design to maximize efficiency. The leveling off of the 16 core system at 16 goroutines suggests that the application itself may be the limiting factor, rather than the number of cores.
</details>
In figure 5 the benefit in terms of time using parallelism is clearly visible. The results illustrated show the average time in seconds needed by each system for training a PNN consisting of one CN per goroutine. For the block diagram in 5 the percental time requirements in comparison with the time needed using one goroutine are listed in table 2.
The time in figure 5 starts on a high level and decreases with an increasing amount of goroutines for all three systems. Especially in the range of 1 to 4 goroutines, a formidable decrease in training time is visible and only starts to level out when reaching a systems physical core limitation. This means that the 4 core starts to level out after 4 goroutines, the 6 core after 6 goroutines and the 16 core after 16 goroutines, even though all systems support hyper threading. After reaching a systems core number the average time necessary for training a neural network decreases further with more goroutines. This should be due to the ability to work in parallel and in concurrency as one slot finishes and a
Table 2. Average time required to train a PNN in comparison to one goroutine, which represents 100 percent
| | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine | Time required compared to 1 goroutine |
|-------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|
| System/Goroutines | 1 | 2 | 4 | 6 | 8 | 12 | 16 | 32 | 64 |
| 4 | 100% | 58% | 38% | 38% | 37% | 37% | 37% | 36% | 35% |
| 6 | 100% | 61% | 31% | 24% | 24% | 23% | 23% | 23% | 22% |
| 16 | 100% | 51% | 26% | 18% | 14% | 11% | 9% | 10% | 9% |
waiting thread can start running immediately, without waiting for the rest of the running threads to be finished. All three systems show high time savings by parallelizing the neural networks. While time requirements decreased in every system, the actual time savings differ greatly as the 16 core system decreased 91 percent on average from 1 goroutine to 64 goroutines. In comparison, the 4 core system only took 65 percent less time. As the 16 core system is a lot more powerful than the 4 core system, it can perform an even greater parallel task and therefore displays a positive effect of parallelism upon time requirements. Based upon figure 5 and its table 2 parallelism within neural networks can be seen as a useful feature.
Fig. 6. Compare Accuracy and Confidence of a PNN composed of 10 CNs and an SNN with one Hidden Layer which holds 256 Neurons
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Line Chart: Neural Network Performance Comparison
### Overview
This image presents a line chart comparing the performance of different neural network configurations over epochs. The chart displays both accuracy and confidence as a percentage, plotted against the number of epochs. Three neural network types are compared: a 10-parallel Neural Network (NN), a 1-simple NN, and a baseline. The chart title indicates the network architecture used: "relu-784-256-10(softmax)".
### Components/Axes
* **X-axis:** "Epoch" - ranging from 0 to 10, with gridlines at integer values.
* **Y-axis:** "Percentage" - ranging from 0.0 to 1.0, with gridlines at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located at the top-right corner of the chart.
* 10-parallel NN (represented by a light purple line with circle markers)
* 1-simple NN (represented by a light blue line with circle markers)
* Accuracy (represented by a solid green line with circle markers)
* Confidence (represented by a dashed black line with circle markers)
* **Title:** "relu-784-256-10(softmax)" - positioned at the top-center of the chart.
### Detailed Analysis
The chart shows the performance of the neural networks over 10 epochs.
* **Accuracy (Green Line):** The accuracy line starts at approximately 0.0 at epoch 0 and rapidly increases to approximately 0.95 by epoch 2. It then plateaus, remaining relatively stable between 0.95 and 0.98 for the remaining epochs (2-10).
* **Confidence (Black Dashed Line):** The confidence line also starts at approximately 0.0 at epoch 0 and increases rapidly, reaching approximately 0.9 by epoch 2. It then plateaus, remaining relatively stable between 0.9 and 0.95 for the remaining epochs (2-10).
* **10-parallel NN (Purple Line):** This line starts at approximately 0.0 at epoch 0 and increases to approximately 0.85 by epoch 2. It then plateaus, remaining relatively stable between 0.85 and 0.9 for the remaining epochs (2-10).
* **1-simple NN (Blue Line):** This line starts at approximately 0.0 at epoch 0 and increases to approximately 0.95 by epoch 2. It then plateaus, remaining relatively stable between 0.95 and 0.98 for the remaining epochs (2-10).
### Key Observations
* Both Accuracy and 1-simple NN achieve the highest performance, reaching approximately 0.98 by epoch 2 and maintaining that level.
* The 10-parallel NN consistently underperforms compared to both Accuracy and 1-simple NN, reaching a maximum of approximately 0.9.
* Confidence closely follows the Accuracy trend, indicating a strong correlation between the two metrics.
* All lines exhibit a steep initial increase in performance within the first two epochs, followed by a plateau.
### Interpretation
The data suggests that the "relu-784-256-10(softmax)" neural network architecture, when trained, quickly achieves high accuracy and confidence. The 1-simple NN configuration performs as well as the accuracy metric, indicating it is a strong performer. The 10-parallel NN, while still achieving good performance, does not reach the same level as the other two configurations. The rapid initial increase in accuracy and confidence suggests that the network learns quickly in the early stages of training. The plateauing of the lines after epoch 2 indicates that the network has likely converged and further training may not yield significant improvements. The correlation between accuracy and confidence suggests that the network is not only making accurate predictions but is also confident in those predictions. The choice of architecture (relu-784-256-10(softmax)) appears to be effective for this particular task, as evidenced by the high accuracy and confidence levels achieved.
</details>
Accuracy and Confidence of Networks. In this section the performance in terms of accuracy and confidence is compared between a PNNs and an SNN.
For the test, illustrated by figure 6, both types of networks have been provided with the same random network to start their training. They have the exact same built, except that one is trained as SNN and the other is cloned 10 times to build a PNN with 10 CNs.
In figure 6 the SNN performs better than the PNN in both accuracy and confidence. While the SNNs accuracy and confidence overlap after 8 epochs, the PNN has a gap between both lines at all times. This concludes that the SNN is "sure" about its outputs, while the PNN is more volatile. The SNNs curve of confidence is a lot steeper than the PNNs and quickly approximates towards the curve of accuracy. Both curves of accuracy start off almost symmetric upwards the y-axis, but the PNN levels horizontally after about 90 percent while the SNN still rises until about 94 percent. After those points both accuracy curves run almost horizontally and in parallel towards the x-axis. The gap stays constantly until the end of the test. Even small changes within the range of 90 to 100 percent are to be interpreted as significant. This makes the SNN perform considerable more efficient in terms of accuracy and costs than the PNN.
Cost of Networks. To see how successful the training of different PNNs are, the costs of 3 parallel networks with a varying number of CNs have been recorded for 300 epochs. The results are illustrated in figure 7.
Fig. 7. Average Costs of PNNs over 300 epochs. The vertical lines show the lowest cost for each PNN.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Chart: Average Cost vs. Epoch
### Overview
This image presents a line chart illustrating the relationship between 'Average Cost' and 'Epoch' for three different values of PNN (Potentially Nearest Neighbors). The chart displays how the average cost changes as the training progresses through epochs.
### Components/Axes
* **X-axis:** 'Epoch', ranging from approximately 0 to 300.
* **Y-axis:** 'Average Cost', ranging from approximately 0.01 to 0.03.
* **Legend:** Located at the bottom-center of the chart, identifying three lines:
* "-2 PNN" (Blue line)
* "-10 PNN" (Gray line)
* "-20 PNN" (Orange line)
### Detailed Analysis
The chart shows three lines representing the average cost for different PNN values over 300 epochs.
* **-2 PNN (Blue Line):** This line initially shows a steep decrease in average cost from epoch 0 to approximately epoch 50, dropping from around 0.028 to approximately 0.016. After epoch 50, the decrease slows down, and the line oscillates around an average cost of approximately 0.015 to 0.017. There is a sharp increase around epoch 95, rising to approximately 0.018, before returning to the previous level.
* **-10 PNN (Gray Line):** This line also exhibits a decreasing trend, but the decrease is less pronounced than the -2 PNN line. It starts at approximately 0.027 and gradually decreases to around 0.017 by epoch 300. The line is relatively smooth with minor fluctuations.
* **-20 PNN (Orange Line):** This line shows the most significant initial decrease in average cost, starting at approximately 0.029 and dropping to around 0.021 by epoch 50. The rate of decrease slows down after epoch 50, and the line continues to decrease, reaching approximately 0.016 by epoch 300. There is a noticeable increase in cost around epoch 95, similar to the -2 PNN line, rising to approximately 0.018 before decreasing again.
Approximate Data Points (extracted visually):
| Epoch | -2 PNN (Avg. Cost) | -10 PNN (Avg. Cost) | -20 PNN (Avg. Cost) |
|---|---|---|---|
| 0 | 0.028 | 0.027 | 0.029 |
| 50 | 0.016 | 0.022 | 0.021 |
| 100 | 0.018 | 0.019 | 0.018 |
| 150 | 0.016 | 0.018 | 0.017 |
| 200 | 0.017 | 0.017 | 0.016 |
| 250 | 0.016 | 0.017 | 0.016 |
| 300 | 0.016 | 0.017 | 0.016 |
### Key Observations
* The -20 PNN line consistently exhibits the highest average cost initially, but converges towards the lowest average cost by epoch 300.
* The -2 PNN line shows the most rapid initial decrease in average cost.
* Both the -2 PNN and -20 PNN lines show a temporary increase in average cost around epoch 95, suggesting a potential instability or local minimum during training.
* The -10 PNN line demonstrates the most stable and gradual decrease in average cost.
### Interpretation
The chart demonstrates the impact of different PNN values on the training process, as measured by average cost. A lower average cost generally indicates better model performance. The initial rapid decrease in cost for all lines suggests that the model is quickly learning from the training data. The convergence of the lines towards the end of the training period indicates that the model is approaching a stable state.
The temporary increase in cost around epoch 95 for the -2 and -20 PNN lines could be due to several factors, such as a change in the learning rate, encountering a difficult batch of training data, or the model getting stuck in a local minimum. The -10 PNN line's smoother descent suggests that it might be less sensitive to these fluctuations.
The choice of PNN value appears to influence the speed and stability of the training process. The -20 PNN line, while starting with a higher cost, ultimately achieves a similar or lower cost compared to the other lines, suggesting that it might be a good choice for achieving high accuracy with sufficient training. The -2 PNN line might be suitable for faster initial learning, but requires careful monitoring to avoid instability. The -10 PNN line offers a balance between speed and stability.
</details>
It shows that the costs of all three PNNs sink rapidly within the first 50 epochs. Afterwards, the error decreases slower, drawing a soft curve that flats out towards a line, almost stagnating. Apparently, all PNNs training moves fast towards a minimum at the beginning, then slows down and finally gets stuck, while only moving slightly up and down the minimums borders. Similar to earlier tests, a PNN built with less CNs performs better. More CNs leave the graph further up the y-axis, as the 2-PNN outperforms both the 10- and 20PNN. It also reaches its best configuration, e.g. the point where costs are lowest, significantly earlier than the other tested PNNs. Whereas the 10- and 20-PNNs work out their best performance regarding the costs at a relatively close range of epochs, they reach it late compared to the 2-PNN. Figure 7 clearly shows a decrease in quality with PNNs, formed with more CNs. This indicates that the combination function needs optimization to achieve a better graph. In the long term,F costs behave the same as accuracy. After 300 epochs the difference has almost leveled.
## 6 Findings and Conclusion
This paper presents and analyses PNNs composed of several sequential neural networks. The PNNs are tested upon time, accuracy and costs and compared to an SNN.
The parallelization approach on three different multicore systems show excellent speedup (the time necessary for training a PNN reduces constantly by increasing the number of CNs e.g. number of goroutines).
With all three tested systems the time necessary for training a PNN decreased constantly by increasing the number of CNs e.g. number of goroutines. While the difference in time was significant within the first few added goroutines it leveled out after reaching the systems number of cores. A PNN with 2 CNs takes 40% to 50% less time than a SNN and a PNN with 4 CNs takes 60% to 70% less time.
While time is a strong point of the PNN, accuracy is also dependent on the number of CNs a PNN is formed from. While a few CNs resulted in longer training times it generated better accuracy in fewer epochs. More CNs made the training time faster but the learning process slower. After 20 epochs a PNN composed of 2 CNs reached an accuracy of almost 98%, while a PNN composed of 20 CNs only slightly overcame the 96% line. When both PNNs were trained for a longer period this difference shrank dramatically. Trained for 300 epochs the accuracy only differed by 0.2% in favor of the PNN made out of 2 CNs. While this proved the ability to learn with a small data set it also demonstrated that bigger data sets deliver a better result faster. the PNNs can improve by 0.41% and 2.28% when training for a longer period. These results were achieved by using averaging as combination function. The chances of achieving an even better accuracy by improving the combination function are high. The costs of a PNN also depends on the number of CNs. It has the same behavior as accuracy and can also be improved by an optimized combination function. However, a
thorough analysis on the effects of improved combination functions is planned for future work and is beyond the scope of this paper.
Summing up, PNNs proved to be very time efficient but are still lacking in terms of accuracy. As there are plenty of other optimizations, e.g. adjusting learning rates [4], a PNN proved to be more time efficient than an SNN. However, until the issue of accuracy has been taken care of, the SNN surpasses the PNN in practice.
We close the paper with a final word on the feasibility of Golang for parallel neural network simulation: Data parallelism proved to be an efficient parallelization strategy. In combination with the programming language Go, a parallel neural network implementation is coded as fast as a sequential one, as no special efforts are necessary for concurrent programming thanks to Go's concurrency primitives, which offer a simple solution for multithreading.
## References
1. Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52 (4), 1-43 (2019)
2. Brezany, P., Mueck, T.A., Schikuta, E.: A software architecture for massively parallel input-output. In: Waśniewski, J., Dongarra, J., Madsen, K., Olesen, D. (eds.) Applied Parallel Computing Industrial Computation and Optimization. pp. 85-96. Springer Berlin Heidelberg, Berlin, Heidelberg (1996)
3. Darema, F.: The spmd model: Past, present and future. In: European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. pp. 1-1. Springer (2001)
4. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
5. Hoare, C.A.R.: Communicating sequential processes. In: The origin of concurrent programming, pp. 413-443. Springer (1978)
6. Huqqani, A.A., Schikuta, E., Ye, S., Chen, P.: Multicore and gpu parallelization of neural networks for face recognition. Procedia Computer Science 18 (Supplement C), 349 - 358 (2013), 2013 International Conference on Computational Science
7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), 2278-2324 (1998)
8. Liu, Y., Jing, W., Xu, L.: Parallelizing backpropagation neural network using mapreduce and cascading model. Computational intelligence and neuroscience 2016 (2016)
9. Meyerson, J.: The go programming language. IEEE Software 31 (5), 104-104 (Sept 2014)
10. Pethick, M., Liddle, M., Werstein, P., Huang, Z.: Parallelization of a backpropagation neural network on a cluster computer. In: International conference on parallel and distributed computing and systems (PDCS 2003) (2013)
11. Pike, R.: Go at google: Language design in the service of software engineering. https://talks.golang.org/2012/splash.article (2012), [Online; accessed 06January-2018]
12. Prasad, N., Singh, R., Lal, S.P.: Comparison of back propagation and resilient propagation algorithm for spam classification. In: 2013 Fifth International Conference on Computational Intelligence, Modelling and Simulation. pp. 29-34 (Sept 2013)
14. Schikuta, E., Weishaupl, T.: N2grid: neural networks in the grid. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541). vol. 2, pp. 1409-1414 vol.2 (2004)
13. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The rprop algorithm. In: Neural Networks, 1993., IEEE International Conference on. pp. 586-591. IEEE (1993)
15. Schikuta, E.: Structural data parallel neural network simulation. In: Proceedings of 11th Annual International Symposium on High Performance Computing Systems (HPCS'97), Winnipeg, Canada (1997)
16. Schikuta, E., Fuerle, T., Wanek, H.: Vipios: The vienna parallel input/output system. In: Pritchard, D., Reeve, J. (eds.) Euro-Par'98 Parallel Processing. pp. 953958. Springer Berlin Heidelberg, Berlin, Heidelberg (1998)