From Attention in Transformers to Dynamic Routing in Capsule Nets

From Attention in Transformers to Dynamic Routing in Capsule Nets

In the last step, the values of all the attention heads are concatenated and transformed linearly to compute the output of the multiple head attention component:

So, in terms of the parameters that are learned, for each layer, we have one transformation matrix,

, which is applied on the concatenation of the outputs from all the attention heads, and we have a set of three transformation matrices for each attention head, i.e., and

Capsule networks, in the first place, were proposed to processes images in a more natural way. In the convolutional capsule layers, the weight matrix of each capsule type is convolved over the input, similar to how kernels are applied in CNNs. The pose matrices for the primary capsules is simply a linear transformation of the outputs of the lower layer kernels, and their activation is the sigmoid of the weighted sum of the same set of lower-layer kernel outputs.

Source: staff.fnwi.uva.nl