Element Training set Test set English Italian Percent . English Italian Percent . Paragraphs 18,896 18,506 97.9% 2,067 2,010 97.2% Questions 87,599 54,159 61.8% 10,570 7,609 72.0% Answers 87,599 54,159 61.8% 34,726 21,489 61.9% Table 1: The quantities of the elements of the final dataset obtained by translating the SQuAD dataset, with the percentage of material w.r.t the original dataset. The Italian test set was obtained from the English development set, being the English test set not available publicly. DrQA-IT BERT-IT EM 56.1 64.96 F1 65.9 75.95 Table 2: Results of BERT-iT over the SQuAD-IT dataset guage exists, so that the language model underly- ing BERT can be acquired over any text collection independently from the input language. As a con- sequence a pre-trained model acquired over docu- ments written in more than one hundred languages exists. It will be applied in the next section to train and evaluate such a QA model over a dataset of examples in Italian. 3 Experimental Evaluation In order to assess the applicability of the BERT architecture against the targeted QA task, a multi- lingual pre-trained model has been downloaded 2 : in particular, this model has been acquired over documents written in one hundred languages, it is composed of 12 layers of Transformers and asso- ciates each token in input to a word embedding made of 768 dimensions. For consistency with (Devlin et al., 2019), 5 epochs have been consid- ered to fine-tune the model. We trained the architecture over SQuAD-IT 3 , 2 models/ 2018 11 23/multi cased L-12 H-768 A-12.zip 3 a dataset made available by (Croce et al., 2019). This dataset includes more than 50,000 ques- tion/paragraph pairs obtained by automatic trans- lating the original SQuAD dataset. The details about the number of sentences is reported in Table 1 where a comparison with the original SQuAD in English is reported. The parameters of the neural network were set equal to those of the original work, including the word embeddings resource. Two evaluation met- rics are used: exact string match (EM) and the F1 score, which measures the weighted average of precision and recall at the token level. EM is a stricter measure evaluated as the percentage of an- swers perfectly retrieved by the systems, i.e. the text extracted by the span produced by the sys- tem is exactly the same as the gold-standard. The adopted token-based F1 score smooths this con- straint by measuring the overlap (the number of shared tokens) between the provided answers and the gold standard. Performances are reported in Table 2 together with the results achieved by a variant of the DrQA system (Chen et al., 2017), evaluated against the same SQuAD-IT dataset, as from (Croce et al., 2019). Improvements are impressive, as both EM and F1 are improved of more than 10%. Anyway, these results are in line with the impact of BERT over the original English dataset. In the final ver- sion of this paper we will provide an in depth com- parison between DrQA and BERT.
4 Conclusions This paper explores the application of Bidirec- tional Encoder Representations within the QA task
You've reached the end of your free preview.
Want to read all 6 pages?
- Spring '20
- Natural Language Processing, Bert, Danilo Croce