## Playing with neuronal networks

#### The Code and Basic Setup

At first let us consider an easy setup of a neuronal network. This can be systematically implemented by the following Python code which is a generalization of some work found in this Github article. It creates and trains a neuronal network consisting out of n layers (variable 'layers', see CONFIG section) with m neurons each (variable 'neurons_per_layer') and is visualized in Fig. 1.

It takes a single parameter as input and produces one output value. The network can be tested using the 'predict' method found at the end of the code. By running the 'train' method the network will be trained to reproduce the input number lying between 0 and 1 as output which conforms to the identity. For further explanations regarding the code structure read the article mentioned at the top first and remember that the following lines do nothing else but represent the generalization of the training process applying to any number of (hidden) layers, more than just 0 or 1.

import numpy as np ####### CONFIG ############## training_set_size = 20 layers = 3 #number of hidden layers + input layer + output layer, needs to be >= 2 (2 means there is no hidden layer) neurons_per_layer = 10 ####### END CONFIG ########## def nonlin(x,deriv=False): #Sigmoid function if(deriv==True): return x*(1-x) return 1/(1+np.exp(-x)) syn = [None] * (layers + 2) def train(): print ("Training: training set size: " + str(training_set_size) + " network: " + str(layers) + " layers " + str(neurons_per_layer) + " neurons/layer") #Create training set X = np.random.random((training_set_size,1)) y = np.random.random((training_set_size,1)) for i in range(0, training_set_size - 1): y[i][0] = X[i][0] * 1 #***TASK*** Her as an example: Identity for numbers between 0 and 1 np.random.seed(1) #Create random weights for all neurons syn[0] = np.array(2*np.random.random((1,neurons_per_layer)) - 1) for k in range(1,layers-1): syn[k] = np.array(2*np.random.random((neurons_per_layer,neurons_per_layer)) - 1) syn[layers - 1] = np.array(2*np.random.random((neurons_per_layer,1)) - 1) #Begin of backpropagation learning method using gradient descent for j in range(20000): #Number of iterations set to 20000 l = [None] * (layers + 1) l_error = [None] * (layers + 1) l_delta = [None] * (layers + 1) l[0] = X for k in range(1,layers+1): l[k] = (nonlin(np.dot(l[k-1],syn[k-1]))) l_error[layers] = y - l[layers] if (j% 5000) == 0: print ("Error:" + str(np.mean(np.abs(l_error[layers])))) l_delta[layers] = l_error[layers]*nonlin(l[layers],deriv=True) for k in range(1,layers): ki = layers - k l_error[ki] = l_delta[ki+1].dot(syn[ki].T) l_delta[ki] = l_error[ki] * nonlin(l[ki],deriv=True) for k in range(1,layers+1): ki = layers - k syn[ki] += l[ki].T.dot(l_delta[ki+1]) def predict(arg1): #Method for testing the trained (or untrained) network x = ([[arg1]]) ld = [None] * (layers + 2) ld[0] = x for m in range(1,layers+1): ld[m] = nonlin(np.dot(ld[m-1],syn[m-1])) return ld[layers]

#### Evaluating Examples

We start looking at Fig. 2 which shows the prediction results for a 3 layer network with 10 neurons per (input or hidden) layer in dependence of the training set size. Like in the lines of code shown above the network was asked to learn the identity function. It may seem like that a bigger training set used for teaching the network results in a smaller prediction error and therefore in a better outcome. This is a fallacy. Examining more examples with more training sets we'll see the prediction result worsening again.

The identity function is a very basic learning example for neuronal networks requiring only one input parameter. One further single input problem would be e.g. finding the square of a number x (x > 0 and x < 1): \(f(x) = x^2\).

It does not take much time to update the Python code and make the neuronal network capable of handling two input parameters so learning processes involving more complex tasks can be examined.

These include:

- The addition of two numbers x and y (< 1 and > 0): \(f(x,y) = x + y \).
- The multiplication of two numbers x and y (< 1 and > 0): \(f(x,y) = xy\).
- Some kind of distance calculation: \(f(x,y) = 1\) if \((x-y)^2 < 0.1\),\( f(x,y) = 0 \) else.

##### Addition and Multiplication

Watch the following animated gifs which show the prediction quality of the neuronal network being trained with an increasing amount of data.

In each gif the upper heat map indicates the actual prediction while the lower heat map visualizes the squared deviation between precisely this prediction and the exact value expected by conventional computation without neuronal structures (just run the function directly).

You will recognize that in these two basic examples the learning process goes well as soon as the size of the set used for training the network is big enough. From time to time for specific training set sizes there is some kind of sudden and short deterioration in the result observable. But the general trend appears to promise higher accuracy for bigger training set sizes again.

##### More Complicated Function with Discontinuity

Now we consider the function which assigns 0 if the squared distance of the two parameters is smaller than 0.1, respectively 1 if that's not the case, and try to make the neuronal network learn it. This task is much harder to train than the two preceding examples involving simple addition and multiplication. An advanced modified implementation which promises to improve the backpropagation algorithm and therefore improve the overall prediction result may be elucidated in future articles. But for now we'll stick to the Python code shown at the beginning.

There are three gif animations for this problem here. The first one shows the prediction results and errors for a network consisting out of 3 neurons (no hidden layer, input layer with 2 neurons plus output layer wth 1 neuron), the neuronal structure used for the second one has got 13 neurons (2 hidden layers plus input layer with 4 neurons per layer plus output layer) and for the third gif there were 31 neurons (4 hidden layers plus input layer with 6 neurons per layer plus output layer) contributing to the prediction.

Three observations can be written down:

- First of all, all three developments for the different number of neurons trying to learn the problem have in common that the prediction result becomes unstable and worsens for big training set sizes.
- While the 3 neurons in the first neuronal network obviously don't manage to come to an acceptable result the deeper networks including 13 and 31 neurons respectively archieve better for all parameters except for those whose squared distance approaches 0.1 which conforms to the discontinuity of the function.
- Also note that there are repeatedly randomly distributed specific training set sizes which locally lead to results which can be considered as good.

##### Square Function

For the task of finding the square of a number x (x > 0 and x < 1): \(f(x) = x^2\) we go one step further and additionally examine how the observations made in the previous part persist under different structures and magnitudes of the neuronal network (constant number of iterations). Subsequently, the number of layers n and the amount of neurons per layer m are of interest at this point.

The hue in the following animated diagram expresses the size of the total prediction error made by a neuronal network with n layers (horizontal axis) and m neurons per layer (vertical axis). The animation shows how this total prediction error depends on n, m over a rising training set size. A blue field means there is a small (equals good) error, a red field refers to a bad prediction result.

Like in the preceding paragraph we can make certain observations:

- For very small training set sizes the blue area indicating good result qualities propagates very fast in the horizontal as well as in the vertical direction.
- In the course of the animation both propagations stagnate and then even invert going into the direction of both low layer and neurons per layer numbers.
- The inversion of the horizontal propagation sets off much later than the vertical equivalent.

In a similar way we can examine the behavior of prediction results in dependence of the number of iterations when the traning set size is constant. It turns out that the same way of propagation can be observed excluding the inversion part.

But which field in the two diagrams above is the "bluest" one marking the perfect neuronal network structure for the given number of iterations or training units respectively?

Aiming for the lowest prediction errors and taking into account that the lowest prediction error occurs always for networks with a low number of layers n (from 2 up to 5) we can learn from Fig. 10 what is summarized in the following table:

shape of the neuronal network | training units | enough iterations |

a lof of neurons per layer | less | more |

just a few neurons per layer | more | less |

So adding additional layers to your network does not significantly affect the learning behavior (apart from slowing it down) and also does not affect the quality of the result. But the general trend suggests that by adding more neurons to each layer you have to do less training and more backpropagation iterations to get the lowest prediction error.