It’s Only Natural: An Excessively Deep Dive into Natural Gradient Optimization

It’s Only Natural: An Excessively Deep Dive into Natural Gradient Optimization

If we use a KL divergence as a way of scaling our gradient steps, that means that we see two parameter configurations as “farther apart” in this space if they would induce predicted class distributions that are very different, in terms of a KL divergence, for a given input set of features. The Fisher Thing
So far, we’ve discussed why scaling the distance of our update step in parameter space is unsatisfyingly arbitrary, and suggested a less arbitrary alternative: scaling our steps such that only go, at maximum, a certain distance in terms of KL Divergence from the class distribution our model had previously been predicting. Gradient with respect to loss
Typically, your classification loss is a cross entropy function, but more broadly, it’s some function that takes as input your model’s predicted probability distribution and the true target values, and has higher values when your distribution is farther from the target.

Source: towardsdatascience.com