Beyond data and model parallelism for deep neural networks
FlexFlow encompasses both of these in its sample (data parallelism), and parameter (model parallelism) dimensions, and also adds an operator dimension (more model parallelism) describing how operators within a DNN should be parallelised, and an attribute dimension with defines how different attributes within a sample should be partitioned (e.g. height and width of an image). The execution simulator takes an operator graph, a device topology, and a parallelization strategy as inputs, ad predicts the execution time. The following charts show the training throughput for the best strategies found by FlexFlow, as compared to vanilla data parallelism or expert-defined strategies:
The strategies found by FlexFlow reduce per-iteration data transfers by 2-5.5x compared to other parallelisation strategies, and also reduce task computation time.