1: \begin{abstract}
2: \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
3: A recent trend in \acl{NLP} is the exponential growth in \ac{LM} size, which prevents research groups without a necessary hardware infrastructure from taking part in the development process.
4: This study investigates methods for \ac{KD} to provide efficient alternatives to large-scale models. In this context, \ac{KD} means the extraction of information about language encoded in a \acl{NN} and \acl{LKB}.
5:
6: To test our hypothesis that efficient architectures can gain knowledge from \ac{LM}s and extract valuable information from lexical sources, we developed two methods.
7: First, we present a technique to learn confident probability distribution for \acl{MLM} by prediction weighting of multiple teacher networks. Second, we propose a method for \ac{WSD} and lexical \ac{KD} that is general enough to be adapted to many \ac{LM}s.
8:
9: Our results show that \ac{KD} with multiple teachers leads to an improved training convergence. When using our lexical pre-training method, \ac{LM} characteristics are not lost, leading to increased performance in \ac{NLU} tasks over the state-of-the-art while adding no parameters. Moreover, the improved semantic understanding of our model increased the task performance beyond \ac{WSD} and \ac{NLU} in a real-problem scenario (\acl{PD}).
10:
11: This study suggests that sophisticated training methods and network architectures can be superior over scaling trainable parameters.
12: On this basis, we suggest the research area should encourage the development and use of efficient models and rate impacts resulting from growing \ac{LM} size equally against task performance.
13: \end{abstract}
14: