Fast Vector Broadcasting in Java, CPU and CUDA (2018)

Fast Vector Broadcasting in Java, CPU and CUDA (2018)

But, broadcasting is more interesting since it doesn’t come as a built-in in Neanderthal, so we can learn a lot trying to write an implementation that would be on par with the optimized one that comes with Nd4j. Except for the largest dimension, Nd4j’s broadcasting runs not only slower than Nd4j’s CPU implementation, not only slower that a less naive Neanderthal broadcasting on the CPU; it is slower than the most naive broadcasting implementation on the CPU. A less naive Neanderthal broadcasting on the GPU As I mentioned, on top of Nd4j’s summing synchronization handicap, it has its own array creating and populating with ones handicap.

Source: dragan.rocks