Most general view at artificial neural networks
In the previous post I have explained what artificial neural networks (ANNs) are. However that only covered most common ANN architecture which consists of artificial neurons organized into layers. To truly understand what an ANN really is it is important to dive deep into a more general case. In general an ANN doesn’t have to have layers. It doesn’t even have to have neurons. It really is just a function that may be very complex.
All an ANN is doing is mapping inputs to outputs. Inputs and outputs are composed of real numbers. To make the input/output data easier to work with for humans/programmers/researchers the numbers can be grouped into vectors, matrices or tensors (A tensor is just a more general form of a matrix. While matrices have 2 dimensions a tensor may have any number of dimensions.). However it doesn’t really matter how the numbers are grouped because every matrix or tensor with specific dimensions can just be represented as a vector of specific length. Again: you can group the numbers any way you like but it doesn’t really matter from the mathematical standpoint.
ANNs can use any mathematical function but since the fundamental purpose of creating ANNs is having it automatically learn to convert inputs to appropriate outputs there is one important component that can be considered mandatory. The function needs to have parameters other than x1..xn (input) and y1..yk (output) which can be arbitrarily changed by the training process. In the image above you can see that there are parameters there called w1..wm. Those are not coming in with the input data but rather are part of the ANN itself. Those parameters are usually called “weights”. You can think of them as being the internal memory of the network. This is how the network remembers stuff. The output depends both on the input and the weights.
An ANN can be considered as “trained” if it produces “correct” output for a given input. What “correct” means is completely determined by humans who are trying to solve some specific problem (for example distinguish cat images from dog images). When the network is created the weights have random values. The network will most likely not produce correct responses in this state. The weights have to be adjusted so that the ANN function produces correct outputs.
That input to output mapping is achieved with a function that converts input numbers to output numbers. Surprisingly that is all an ANN really is. However in practice only functions having specific properties allow you to create an algorithm which trains such a network/function in a finite time and with finite resources and in a way that produces correct outputs when presented with specific inputs.
Most common practical characteristics of artificial neural networks
The truth is most mathematical functions are practically useless for solving any input to output mapping problem. There might be multiple reasons for this uselessness:
- The function might not have enough parameters to be able to hold all the information necessary to correctly map all the inputs into correct outputs. As I wrote before the weights are the internal memory of the network. If there aren’t enough weights then it might not be possible for the network to remember everything that is needed. It is therefore important to make sure the number of weights is sufficient.
- The function might be too simple to model the relationship between the input and the output. For example if the input of the network was two dimensional (it would consist of x1 and x2 only) you could think of any instance of the input as point on a plane. Let’s say you have red points and blue points on a plane and you want to return a number below 0.5 for red points and above 0.5 for blue points. If the function is linear (it produces values which lie on a straight line for all possible weights) then it will only work if red points can be separated from blue points with a straight line but for more complex cases a “curved” looking function may be needed. It is therefore important to make sure the function is complex enough so that there exist a set of weights where all inputs are mapped to correct outputs (with errors that are low enough to be satisfactory)
- A practical training algorithm may not exist for such a function. ANNs are trained by changing the weights slightly for given input or set of inputs in a way that makes the output closer to expected (this is famously called a “gradient descent”). A better way of training them is currently unknown. However this method is only possible if the function is differentiable in relation to all of its parameters. Intuitive explanation of what this means is that whenever any of the inputs or weights change by a very small amount then the output also has to change by a very small amount. In other words the output will change gradually when inputs or weights change gradually. This is a very important requirement because if it’s met then it allows us to gradually change the weights to get closer and closer to the correct output for a given input. Another thing it allows is figuring out how exactly to change the weights to get closer to a desired output. If this requirement was not met then a very small change in parameters might result in big and apparently random jump of the output values. In such case it may not be possible or practical to create or run a training algorithm that causes the network to converge at the right set of weights.
- Even if all of the above points are taken care of the function might still not be good enough either because training it would take unreasonable amount of computational power or because it was constructed in such a way that it is possible to reach only a local maximum when searching for the best possible set of weights by changing them gradually.
Since there are so many practical issues with using “any” function for constructing an ANN what people usually do is that they use the approach where they arrange artificial neurons into multiple layers. It turns out that it is easier to solve above issues in ANN architectures that are based on this idea. It is probably because there is so many research papers available that discuss solutions based on this architecture and many solutions to common issues are already figured out, well tested and documented. However it doesn’t mean that there doesn’t exist some completely different kind of architecture that would be even easier to train.