Deep Learning Lecture Series Update (11/20) and some notes on information theory

kthompson395 · November 21, 2021, 4:27am

Everyone,

I apologize for cancelling todays Deep Learning lecture. The material will be pushed back to the 27th. You do, however, get additional time to look into the ideas of information theory prior to the class.

I recommend this section from the appendix of amazon’s deep learning book 18.11. Information Theory — Dive into Deep Learning 0.17.0 documentation

The big conceptual leap (as opposed to a technical or mathematical leap) that you have to make in understanding information theory is to realize that “information” is a value-laden concept; it is not value neutral.

Information theory centers itself around the concept of “Shannon entropy”, where information is valued according to how surprising the contents of a message are to the receiver; however, there are alternative ways of understanding information. For example, you have Fisher information in statistics which concerns itself with the amount of information that a observable variable carries about some parameter. In information theory, we stick with Shannon’s conceptualization.

There are several connections between this and modeling. For example, we generally want our priors to maximize entropy since we want to assume that the least surprising outcome occurs given our assumptions (in other words, it best represents your current state of knowledge). The normal distribution (bell curve) is the maximum entropy distribution if we only assume that the data has a finite mean and finite variance. As we alter our assumptions about the data, so too does the maximum entropy distribution change. When we model according to what distribution is maximum entropy, we are following the principle of maximum entropy.

When we fully specify a model, we can choose how we wish to fit the model to the data. Maximum likelihood is a very common way for which we fit a broad class of discriminative models (more on what this means later). Linear models are usually fit using maximum likelihood in statistics. In machine learning, we actually use an iterative method called minibatch stochastic gradient descent.