ParSQuAD: Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0

Document Type : Original Article

Authors

1 Department of Software Engineering, Faculty of Computer Engineering, University of Isfahan, Isfahan

2 Department of Linguistics, University of Isfahan, Isfahan, Iran

Abstract

Recent developments in Question Answering (QA) have improved state-of-the-art results, and various datasets have been released for this task. Since substantial English training datasets are available for this task, the majority of works published are for English Question Answering. However, due to the lack of Persian datasets, less research has been done on the latter language, making comparisons difficult. This paper introduces the Persian Question Answering Dataset (ParSQuAD) based on the machine translation of the SQuAD 2.0 dataset. Many errors have been discovered within the process of translating the dataset; therefore, two versions of ParSQuAD have been generated depending on whether these errors have been corrected manually or automatically. As a result, the first large-scale QA training resource for Persian has been generated. In addition, we trained three baseline models, i.e., BERT, ALBERT, and Multilingual-BERT (mBERT), on both versions of ParSQuAD. mBERT achieves scores of  56.66% and 52.86% for F1 score and exact match ratio respectively on the test set with the first version and scores of 70.84% and 67.73% respectively with the second version. This model obtained the best results out of the three on each version of ParSQuAD.

Keywords

Main Subjects


  • Zhu, W. Lei, C. Wang, J. Zheng, S. Poria and T.-S. Chua, “Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering”, arXiv preprint arXiv:2101.00774, 2021.
  • Chen, A. Fisch, J. Weston and A. Bordes, “Reading Wikipedia to Answer Open-Domain Questions”, arXiv preprint arXiv:1704.00051, 2017.
  • Wang et al., “R3: Reinforced Ranker-Reader for Open-Domain Question Answering”, in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32, no. 1.
  • Das, S. Dhuliawala, M. Zaheer and A. McCallum, “Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering”, arXiv preprint arXiv:1905.05733, 2019.
  • Guu, K. Lee, Z. Tung, P. Pasupat and M.-W. Chang, “Realm: Retrieval-Augmented Language Model Pre-Training”, arXiv preprint arXiv:2002.08909, 2020.
  • M. Hermann et al., “Teaching Machines to Read and Comprehend”, in Advances in neural information processing systems, vol. 28, pp. 1693–1701, 2015.
  • Nguyen et al., “MS MARCO: A Human Generated Machine Reading Comprehension Dataset”, In CoCo@ NIPS, 2016.
  • Lai, Q. Xie, H. Liu, Y. Yang and E. Hovy, “RACE: Large-scale Reading Comprehension Dataset from Examinations”, arXiv preprint arXiv:1704.04683, 2017.
  • Rajpurkar, J. Zhang, K. Lopyrev and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text”, arXiv:1606.05250 [cs], Oct. 2016, Accessed: Jan. 27, 2021. [Online]. Available: http://arxiv.org/abs/1606.05250
  • Rajpurkar, R. Jia and P. Liang, “Know What You Don’t Know: Unanswerable Questions for SQuAD”, arXiv:1806.03822 [cs], Jun. 2018, Accessed: Jan. 27, 2021. [Online]. Available: http://arxiv.org/abs/1806.03822
  • Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, arXiv preprint arXiv:1609.08144, 2016.
  • Lim, M. Kim and J. Lee, “KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension”, arXiv preprint arXiv:1909.07005, 2019.
  • Efimov, A. Chertok, L. Boytsov and P. Braslavski, “SberQuAD–Russian Reading Comprehension Dataset: Description and Analysis”, in International Conference of the Cross-Language Evaluation Forum for European Languages, Cham: Springer, 2020, pp. 3–15.
  • Cui et al., “A Span-Extraction Dataset for Chinese Machine Reading Comprehension”, arXiv preprint arXiv:1810.07366, 2018.
  • d’Hoffschmidt, W. Belblidia, T. Brendlé, Q. Heinrich and M. Vidal, “FQuAD: French Question Answering Dataset”, arXiv:2002.06071 [cs], May 2020, Accessed: Oct. 05, 2020. [Online]. Available: http://arxiv.org/abs/2002.06071
  • P. Carrino, M. R. Costa-jussà and J. A. R. Fonollosa, “Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering”, arXiv:1912.05200 [cs], Dec. 2019, Accessed: Oct. 05, 2020. [Online]. Available: http://arxiv.org/abs/1912.05200
  • Mozannar, K. E. Hajal, E. Maamary and H. Hajj, “Neural Arabic Question Answering”, arXiv:1906.05394 [cs], Jun. 2019, Accessed: Oct. 05, 2020. [Online]. Available: http://arxiv.org/abs/1906.05394
  • Lee, K. Yoon, S. Park and S. Hwang, “Semi-supervised Training Data Generation for Multilingual Question Answering”, In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • Artetxe, S. Ruder and D. Yogatama, “On the Cross-lingual Transferability of Monolingual Representations”, arXiv preprint arXiv:1910.11856, 2019.
  • Lewis, B. Oğuz, R. Rinott, S. Riedel and H. Schwenk, “MLQA: Evaluating Cross-Lingual Extractive Question Answering”, arXiv preprint arXiv:1910.07475, 2019.
  • Tohidi, C. Dadkhah and R. B. Rustamov, “Optimizing Persian Multi-Objective Question Answering System”, International Journal on Technical and Physical Problems of Engineering (IJTPE), vol. 13, no. 46, pp. 62–69, 2021.
  • Veisi and H. F. Shandi, “A Persian Medical Question Answering System”, International Journal on Artificial Intelligence Tools, vol. 29, no. 06, p. 2050019, 2020.
  • Boreshban, H. Yousefinasab and S. A. Mirroshandel, “Providing a religious corpus of question answering system in persian”, Signal and Data Processing, vol. 15, no. 1, pp. 87–102, 2018.
  • Joshi, E. Choi, D. S. Weld and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”, arXiv preprint arXiv:1705.03551, 2017.
  • Farahani, M. Gharachorloo, M. Farahani and M. Manthouri, “ParsBERT: Transformer-based Model for Persian Language Understanding”, arXiv preprint arXiv:2005.12515, 2020.

 

Negin Abadani received her B.Sc. of computer science from Univercity of Isfahan, in 2019. Presently, she is pursuing her M.Sc. in software engineering at University of Isfahan,. Her research intresets are in Question Answering, Natural Language Processing, Deep Learning and Data Mining. Currently, she is a member of the BIGDATA Research Group at University of Isfahan.

 

Jamshid Mozafari is a research assistant in Natural Language Processing and Information Retrieval at the BIGDATA lab of the University of Isfahan. He has received his B.Sc. and M.Sc. degrees in computer engineering from the University of Kurdistan and University of Isfahan in 2016 and 2019, respectively. His interests include Question Answering, Information Retrieval, and Machine Reading Comprehension. Currently, he is a member of the BIGDATA Lab at the University of Isfahan.

 

Afsaneh Fatemi received her B.S. degree in software engineering from Isfahan University of Technology in 1995, and the M.S. and Ph.D. degrees in software engineering both from University of Isfahan in 2002 and 2012, respectively. She is currently an assistant professor in the department of software engineering of University of Isfahan. Her current research interests include Big Data applications and challenges, especially in Question Answering Systems and social networks. She is also a member of Big Data Research Group of University of Isfahan from 2016.

Mohammadali Nematbakhsh is a full professor of software engineering in School of Computer  engineering at the University of Isfahan. He received his B.Sc. in electrical engineering from Louisiana Tech University in 1981 and his M.Sc. and Ph.D. degrees in electrical and computer engineering from the University of Arizona in 1983 and 1987, respectively. He had worked for Micro Advanced Co. and Toshiba Corporation for many years before joining University of Isfahan. He has published more than 160 research papers, several US-registered patents and two database books that are widely used in universities. His main research interests include intelligent Web and big data processing. He is also the head of Big Data Research Group of University of Isfahan.

 

Arefeh Kazemi received her B.Sc. degree in software engineering and the M.Sc. degree in artificial intelligence from University of Isfahan, Isfahan, Iran, in 2008 and 2010, respectively. She obtained her Ph.D. degree in artificial intelligence field from University of Isfahan in 2017. Currently she is an assistant professor in Computational Linguistics branch in University of Isfahan. Her main areas of research interest include Natural Language Processing, Computational Linguistics and Data Mining.