1: \begin{abstract}
2: Scholarly articles in mathematical fields feature mathematical
3: statements such as theorems, propositions, etc., as well as their proofs.
4: Extracting them from the PDF representation of the articles requires
5: understanding of scientific text along with visual and font-based
6: indicators. We pose this problem as a multimodal classification problem
7: using text, font features, and bitmap image rendering of the PDF as different modalities.
8: In this paper we propose a multimodal machine learning approach for
9: extraction of theorem-like environments and proofs, based on late fusion
10: of features extracted by individual unimodal classifiers, taking into
11: account the sequential succession of blocks in the document.
12: For the text
13: modality, we pretrain a new language model on a 11~GB scientific corpus;
14: experiments shows similar performance for our task than a model
15: (\textsf{RoBERTa})
16: pretrained on 160~GB, with faster convergence while requiring much less fine-tuning data.
17: Font-based information relies on training a 128-cell LSTM on the sequence
18: of font names and sizes within each block. Bitmap renderings are dealt with
19: using an \textsf{EfficientNetv2} deep network tuned to classify each image block. Finally, a
20: simple CRF-based approach uses the features of the multimodal
21: model along with information on block sequences. Experimental results
22: show the benefits of using a multimodal approach vs any single modality,
23: as well as major performance improvements using the CRF modeling of block
24: sequences.
25: \end{abstract}
26: