1: \begin{abstract}
2:
3:
4: Runtime and scalability of
5: large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement
6: is extremely challenging even for domain experts. Most existing automated device placement approaches are impractical due to the significant amount of compute required
7: and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an efficient end-to-end method based on
8: a scalable sequential attention mechanism over a graph neural network that is transferable to new graphs.
9: On a diverse set of representative deep learning models, including Inception-v3, AmoebaNet, Transformer-XL, and WaveNet,
10: our method on average achieves 16\% improvement over human experts and 9.2\% improvement over
11: the prior art
12: with 15$\times$ faster convergence.
13: To further reduce the computation cost,
14: we pre-train the policy network on a set of dataflow graphs and use a superposition network to fine-tune it on each individual graph,
15: achieving state-of-the-art performance on large hold-out graphs with over 50k nodes, such as an 8-layer GNMT.
16: \end{abstract}
17: