I am too lazy to watch the whole video at this point, so for now I'll sum up what (I think) I know on the topic.
Yes, it is customary to convert the
output layer (usually of a classification-type network) to a set of positive numbers adding up to 1. This way they can indeed be interpreted as the probabilities of the input belonging to a specific class. This is often done using
SoftMax function.
Here is an example of SoftMax use. Say you have a NN which determines whether there is a dog present in the input picture. The output layer has 2 values whose range is all over the place. If you get -5 for the dog and -20 for "no dog" is this a strong indication or not? How about 10 and 8? You want to get a reasonable range for making decisions. If you use softmax then (-5,-20) gets mapped to (0.99999969, 0.00000031) and (10,8) to (0.88079708, 0.11920292) -- much easier to interpret IMHO.
For hidden layers the outputs of "activators" (non-linear components like the sigmoid function) do not have to be in the (0,1) interval. Sigmoid is not the only non-linearity, in fact it is not even the most popular. For reasons I am not familiar with
ReLU, which maps all values to the [imath](0,\infty)[/imath] interval, seems to be more popular. Its close relative
LeakyReLU actually maps to [imath](-\infty,\infty)[/imath]. The use of the sigmoid function and mapping values to (0,1) in the video might be simply the choice of the lecturer (but I haven't watched the whole video).
As for the interpretation of values in hidden layers, it looks to my un-educated eye like a rich and difficult research subject. It might occasionally be easy in a small neural net, like the one presented in the video (again, don't know without watching the whole video), but for actual
useful large networks with hundreds and thousands of hidden layers this might be quite a challenge.
Hope this is helpful, but if not I might try watching the video sometime later (next year?
), and, hopefully, get better idea of the issues there.