SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation

Document Type : Original Article

Authors

1 Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran

2 University of Tehran, Tehran, Iran

Abstract

Word Sense Disambiguation (WSD) is a long standing task in Natural Language Processing (NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evaluating different WSD systems in the language of interest. Although many WSD test collections have been developed for a variety of languages, no standard All-words WSD benchmark is available for Persian. In this paper, we address this shortage for the Persian language by introducing SBU-WSD-Corpus, as the first standard test set for the Persian All-words WSD task. SBU-WSD-Corpus is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. To this end, three annotators used SAMP (a tool for sense annotation based on FarsNet lexical graph) to perform the annotation task. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs). Providing baselines for future studies on the Persian All-words WSD task, we evaluate several WSD models on SBU-WSD-Corpus.  

Keywords


  • E Agirre, I Aldezabal, J Etxeberria, E Izagirre, K Mendizabal, E Pociello, and M Quintian. 2005. Eusem-cor: euskarako corpusa semantikoki etiketatzeko eskuliburua; editatze-, etiketatze-eta epaitze-lanak. Technical report, Internal report.
  • Eneko Agirre, Oier Lo´pez de Lacalle, and Aitor Soroa. The risk of sub-optimal use of open source NLP software: UKB is inadvertently state-of-the-art in knowledge-based WSD. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 29–33, Melbourne, Australia. Association for Computational Linguistics.
  • Eneko Agirre and Aitor Soroa. 2009. Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 33–41.
  • Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1591–1600.
  • Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. 2012. Japanese semcor: A sense-tagged corpus of japanese. In Proceedings of the 6th global WordNet conference (GWC 2012), pages 56–63. Citeseer.
  • Rebecca Bruce and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 139–146. Association for Computational Linguistics.
  • Marine Carpuat and Dekai Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  • Philip Edmonds and Scott Cotton. 2001. Senseval-2: overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 1–5. Association for Computational Linguistics.
  • Andreas Eisele and Yu Chen. 2010. Multiun: A multilingual corpus from united nation documents. In LREC.
  • W Nelson Francis and Henry Kucera. 1979. Brown corpus manual. Letters to the Editor, 5(2):7.
  • Masood Ghayoomi, Identifying Persian words’ senses automatically by utilizing the word embedding method,” Iranian Journal of Information Processing & Management, vol. 35, no. 1, pp. 25–50, 2019.
  • Masood Ghayoomi, Word Sense Induction in Persian and English: A Comparative Study, Journal of Information Systems and Telecommunication (JIST), 2021, Vol 9(36), pp. 263-274
  • Rube´n Izquierdo-Bevia´, Lorenza Moreno-Monteagudo, Borja Navarro, and Armando Sua´rez. 2006.   Spanish all-words semantic class disambiguation using cast3lb corpus. In Mexican International Conference on Artificial Intelligence, pages 879–888. Springer.
  • Fatemeh Khalghani and Mehrnoush Shamsfard. 2018. Extraction of verbal synsets and relations for farsnet. In Proceedings of the 9th Global WordNet Conference (GWC 2018), page 424.
  • Svetla Koeva, Svetlozara Leseva, Ekaterina Tarpomanova, Borislav Rizov, Tsvetana Dimitrova, and Hristina Kukova. Bulgarian sense-annotated corpus–results and achievements.   FASSBL7, page 41.
  • Mahmoodvand and M. Hourali, "Semi-supervised approach for Persian word sense disambiguation," 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), 2017, pp. 104-110.
  • Babak Masoudi, Saeed Rahati Ghouchani, A LDA topic Model for Farsi Word Sense Disambiguation, Signal and Data Processing, 12(4), 117-125.
  • Raheleh Makki, Mohammad Mahdi Homayounpour, 2008. Word Sense Disambiguation of Farsi Homographs Using Thesaurus and Corpus. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg.
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. 1990. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235– 244.
  • George A Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the workshop on Human Language Technology, pages 240–243. Association for Computational Linguistics.
  • George A Miller, Claudia Leacock, Randee Tengi, and Ross T Bunker. 1993. A semantic concordance. In Proceedings of the workshop on Human Language Technology, pages 303–308. Association for Computational Linguistics.
  • Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, et al. 2003. Building the italian syntactic-semantic treebank. In Treebanks, pages 189–210. Springer.
  • Moradi, E. Ansari and Z. Žabokrtský, "Unsupervised Word Sense Disambiguation Using Word Embeddings," 2019 25th Conference of Open Innovations Association (FRUCT), 2019, pp. 228-233,
  • Andrea Moro and Roberto Navigli. Semeval- 2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 288–297.
  • Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1– 69.
  • Roberto Navigli, David Jurgens, and Daniele Vannella. Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 222–231.
  • Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999.   A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX99: Standardizing Lexical Resources.
  • Dieke Oele and Gertjan Van Noord. 2017. Distributional lesk: Effective knowledge-based word sense disambiguation. In IWCS 2017—12th International Conference on Computational Semantics—Short papers.
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Alexander Popov, Kiril Simov, and Petya Osenova. 2019. Know your graph. state-of-the-art knowledge-based wsd. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 949–958.
  • Sameer Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. Semeval-2007 task-17: English lexical sample, srl and all words. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pages 87–92.
  • Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1156–1167.
  • Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 99–110.
  • Ganesh Ramakrishnan, Apurva Jadhav, Ashutosh Joshi, Soumen Chakrabarti, and Pushpak Bhattacharyya. 2003. Question answering via bayesian inference on lexical relations. In Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering-Volume 12, pages 1–10. Association for Computational Linguistics.
  • Radim Rehurek  and  Petr    2010.             Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
  • Masoud Rouhizadeh, Mehrnoush Shamsfard, and Mahsa A Yarmohammadi. 2007. Building a wordnet for persian verbs. GWC 2008, page 406.
  • Masoud Rouhizadeh, A Yarmohammadi, and Mehrnoush Shamsfard. 2010. Developing the persian wordnet of verbs: Issues of compound verbs and building the editor. In Proceedings of 5th Global WordNet Conference.
  • Hossein Rouhizadeh, Mehrnoush Shamsfard and Masoud Rouhizadeh, "Knowledge Based Word Sense Disambiguation with Distributional Semantic Expansion for the Persian Language," 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), 2020, pp. 329-335
  • Adriana Roventini, Alone Antonietta, Francesca Bertagna, Nicoletta Calzolari, Cacila Jessica, Christian Girardi, Bernardo Magnini, R Marinelli, Manuela Speranza, and A Zampolli. Italwordnet: building a large semantic database for the automatic treatment of italian.
  • Ali Saeed, Rao Muhammad Adeel Nawab, Mark Stevenson, and Paul Rayson. 2019a. A sense annotated corpus for all-words urdu word sense disambiguation. ACM Transactions on Asian and LowResource Language Information Processing (TALLIP), 18(4):1–14.
  • Ali Saeed, Rao Muhammad Adeel Nawab, Mark Stevenson, and Paul Rayson. 2019b. A word sense disambiguation corpus for urdu. Language Resources and Evaluation, 53(3):397–418.
  • Mehrnoush Shamsfard. 2011. Challenges and open problems in persian text processing. Proceedings of LTC, 11.
  • Mehrnoush Shamsfard, Akbar Hesabi, Hakimeh Fadaei, Niloofar Mansoory, Ali Famian, Somayeh Bagherbeigi, Elham Fekri, Maliheh Monshizadeh, and S Mostafa Assi. 2010. Semi automatic development of farsnet; the persian wordnet. In Proceedings of 5th global WordNet conference, Mumbai, India, volume 29.
  • Mehdi Soltani, Heshaam Faili, (2010), A statistical approach on Persian word sense disambiguation, The 7th International Conference on Informatics and Systems (INFOS), pp.1–6.
  • Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Proceedings of SENSEVAL3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 41–43.
  • Kaveh Taghipour and Hwee Tou Ng. 2015. One million sense-tagged instances for word sense disambiguation and induction. In Proceedings of the nineteenth conference on computational natural language learning, pages 338–344
  • Hossein Rouhizadeh, Mehrnoush Shamsfard, Mahdi Dehghan, and Masoud Rouhizadeh. 2021. Persian SemCor: A bag of word sense annotated corpus for the Persian language. In Proceedings of the 11th Global Wordnet Conference, pages 147–156, University of South Africa (UNISA). Global Wordnet Association.
  • Saba Urooj, Sana Shams, Sarmad Hussain, and Farah Adeeba. 2014. Sense tagged cle urdu digest corpus. Centre for Language Engineering, Al-Khawarizmi Institute of Compute Science, University of Engineering and Technology, Lahore.
  • Zhi Zhong and Hwee Tou Ng. 2012. Word sense disambiguation improves information retrieval. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers- Volume 1, pages 273–282. Association for Computational Linguistics., 1989.

 Hossein Rouhizadeh is Ph.D. student at the University of Geneva. His research interests include using machine learning and natural language processing methods to analyse texts in the biomedical domain.

 Dr. Mehrnoush Shamsfard has been with Shahid Beheshti University from 2004. She is currently associate professor of Faculty of computer science and engineering, and also the head of NLP research Laboratory of this faculty. Her main fields of interest are natural language processing, knowledge and ontology engineering, text mining and semantic and intelligent web.

 Dr. Vahide Tajalli is a Ph.D. graduate in linguistics from University of Tehran with eight years of experience in computational linguistics. She is cooperating with the NLP research lab of Shahid Beheshti University.