Machine learning systems are stuck in a rut
New primitives that don’t fit into these existing kernels can be compiled into custom kernels using e.g. Tensor Comprehensions or PlaidML, but the current state-of-the-art only really supports small code fragments and frequently doesn’t get close to peak performance (e.g. a factor of 8x slower after a one hour search, for a conventional 2D convolution the authors used as an experiment). It might be hard to performance tune a single non-standard kernel, but full programs must typically evaluate a large graph of kernels. In order to make use of pre-optimised kernels, it’s necessary to use one of a small number of parameter layouts that have been chosen ahead of time to be optimal in isolation.