abstract:381f5514c1bf3654.tex

1: \begin{abstract}

2: Scholarly articles in mathematical fields feature mathematical

3: statements such as theorems, propositions, etc., as well as their proofs.

4: Extracting them from the PDF representation of the articles requires

5: understanding of scientific text along with visual and font-based

6: indicators. We pose this problem as a multimodal classification problem

7: using text, font features, and bitmap image rendering of the PDF as different modalities.

8: In this paper we propose a multimodal machine learning approach for

9: extraction of theorem-like environments and proofs, based on late fusion

10: of features extracted by individual unimodal classifiers, taking into

11: account the sequential succession of blocks in the document.

12: For the text

13: modality, we pretrain a new language model on a 11~GB scientific corpus;

14: experiments shows similar performance for our task than a model

15: (\textsf{RoBERTa})

16: pretrained on 160~GB, with faster convergence while requiring much less fine-tuning data.

17: Font-based information relies on training a 128-cell LSTM on the sequence

18: of font names and sizes within each block. Bitmap renderings are dealt with

19: using an \textsf{EfficientNetv2} deep network tuned to classify each image block. Finally, a

20: simple CRF-based approach uses the features of the multimodal

21: model along with information on block sequences. Experimental results

22: show the benefits of using a multimodal approach vs any single modality,

23: as well as major performance improvements using the CRF modeling of block

24: sequences.

25: \end{abstract}

26: