f8e28bae879df784.tex
1: \begin{abstract}
2:     \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
3:     A recent trend in \acl{NLP} is the exponential growth in \ac{LM} size, which prevents research groups without a necessary hardware infrastructure from taking part in the development process. 
4:     This study investigates methods for \ac{KD} to provide efficient alternatives to large-scale models. In this context, \ac{KD} means the extraction of information about language encoded in a \acl{NN} and \acl{LKB}.
5:     
6:     To test our hypothesis that efficient architectures can gain knowledge from \ac{LM}s and extract valuable information from lexical sources, we developed two methods.
7:     First, we present a technique to learn confident probability distribution for \acl{MLM} by prediction weighting of multiple teacher networks. Second, we propose a method for \ac{WSD} and lexical \ac{KD} that is general enough to be adapted to many \ac{LM}s.   
8:     
9:     Our results show that \ac{KD} with multiple teachers leads to an improved training convergence. When using our lexical pre-training method, \ac{LM} characteristics are not lost, leading to increased performance in \ac{NLU} tasks over the state-of-the-art while adding no parameters. Moreover, the improved semantic understanding of our model increased the task performance beyond \ac{WSD} and \ac{NLU} in a real-problem scenario (\acl{PD}).
10:     
11:     This study suggests that sophisticated training methods and network architectures can be superior over scaling trainable parameters. 
12:     On this basis, we suggest the research area should encourage the development and use of efficient models and rate impacts resulting from growing \ac{LM} size equally against task performance. 
13: \end{abstract}
14: