Basic Neural Network Question: Why are the nodes in a neural network made to have values ranging between 00 and 11?

Metronome

Junior Member
Joined
Jun 12, 2018
Messages
128
I am brand new to studying machine learning. Why are the nodes in a neural network made to have values ranging between [imath]0[/imath] and [imath]1[/imath]? It seems like they should range between [imath]0[/imath] and [imath]\frac{1}{n}[/imath], where [imath]n[/imath] is the number of nodes in that same layer. Is it wrong to think of a node's value as a probability within the space of all nodes in its layer (either the output layer or a hidden layer), in which case the values within each entire layer should sum to [imath]1[/imath]?
 
Are they? Are you talking about some specific network? Or some particular AI paper?
I am basing what I said on this four part video series on neural networks, initially mentioned in the first video at 10:14. Within the input layer, each node value has the "physical" interpretation as the grayscale-value of a particular pixel. The output layer seems have the interpretation of a probability space, in which the probability of an image being a given digit is twice that of another digit iff the node value of the former is twice that of the latter (perhaps this is wrong, and the node values only purport to be monotonically increasing in the probabilities, such that you can use the output of the neural network to infer the most probable digit, but not a confidence level?). I misstated the problem when I said the values should range between [imath]0[/imath] and [imath]\frac{1}{n}[/imath]; it's just that the node values in the output layer should sum to [imath]1[/imath]. However, an approach of applying something like a sigmoid function to each node in the output layer independently in computing its value likely won't produce this. Instead, it seems to conceive of a single node as a probably space rather than the entire layer, which doesn't make sense to me.

A hidden layer is obviously harder to conceptualize than the output layer, so even given my conclusion, I'm not sure if I should expect each hidden layer to act as a probability space.

There might also be an asterisk to account for sampling error in the training data, such that a layer should only approach acting as a probability space in the limit toward infinite training data?
 
I think someone edited the question title using copy/paste and the LaTeX has come out wrong.
 
I am too lazy to watch the whole video at this point, so for now I'll sum up what (I think) I know on the topic.

Yes, it is customary to convert the output layer (usually of a classification-type network) to a set of positive numbers adding up to 1. This way they can indeed be interpreted as the probabilities of the input belonging to a specific class. This is often done using SoftMax function.

Here is an example of SoftMax use. Say you have a NN which determines whether there is a dog present in the input picture. The output layer has 2 values whose range is all over the place. If you get -5 for the dog and -20 for "no dog" is this a strong indication or not? How about 10 and 8? You want to get a reasonable range for making decisions. If you use softmax then (-5,-20) gets mapped to (0.99999969, 0.00000031) and (10,8) to (0.88079708, 0.11920292) -- much easier to interpret IMHO.

For hidden layers the outputs of "activators" (non-linear components like the sigmoid function) do not have to be in the (0,1) interval. Sigmoid is not the only non-linearity, in fact it is not even the most popular. For reasons I am not familiar with ReLU, which maps all values to the [imath](0,\infty)[/imath] interval, seems to be more popular. Its close relative LeakyReLU actually maps to [imath](-\infty,\infty)[/imath]. The use of the sigmoid function and mapping values to (0,1) in the video might be simply the choice of the lecturer (but I haven't watched the whole video).

As for the interpretation of values in hidden layers, it looks to my un-educated eye like a rich and difficult research subject. It might occasionally be easy in a small neural net, like the one presented in the video (again, don't know without watching the whole video), but for actual useful large networks with hundreds and thousands of hidden layers this might be quite a challenge.

Hope this is helpful, but if not I might try watching the video sometime later (next year?:)), and, hopefully, get better idea of the issues there.
 
I think that pretty much answers my questions, thanks! The videos also connect with some of what you said, such as the final hidden layer not actually corresponding to probabilities of human-intelligible facets of the hand-written numbers (loops, straight lines, etc.). I just wasn't sure if it should be understood as a probability space over some indescribable black box events, and should still obey probability axoims.
 
Top