A Distant Supervised Approach for Relation Extraction in Farsi Texts

Document Type : Original Article

Authors

1 Department of Computer Engineering, Science and Research Branch, Azad University, Tehran, Iran.

2 Assistant Professor, ICT Research Institute (ITRC), Tehran, Iran

Abstract

The volume of Farsi information on the Internet has been increasing in recent years. However, most of this information is in the form of unstructured or semi-structured free text. For quick and accurate access to the vast knowledge contained in these texts, the information extraction methods are essential to generate knowledge bases. In recent years, relation extraction as a sub-task of information extraction has received much attention. While many of these systems were developed in English and other well-known languages, the systems for information extraction in Farsi have received less attention from researchers. In this systematic research for semi-automatic relation extraction, Persian Wikipedia articles were presented as reliable and semi-structured sources. In this system, the relation extraction is performed with the assistance of patterns that are automatically obtained with an approach based on distant supervised. In order to apply the distant supervised, the vast knowledge base of Wikidata has been used as a source in perfect synchronization with Wikipedia. The results show that the average precision value for all relations is 76.81%, which indicates an enhancement of precision compared to other methods in Farsi.

Keywords

Main Subjects


  • Emami, H. Shirazi, A. Abdollahzadeh, and M. Hourali, “A Pattern-Matching Method for Extracting Personal Information in Farsi Content”, University Politehnica of Bucharest-Scientific Bulletin, Series C, Electrical Engineering and Computer Science, vol. 78, pp. 125-139, 2016.
  • A. Hearst, “Automatic acquisition of hyponyms from large text corpora”, In Proceedings of the 14th conference on Computational Linguistics,Vol. 2, 1992, 539-545.
  • Rahimipour, M. Shamsfard, and Z. Ansari, “Information Extraction System, Mersad”, In Proceedings of the fifteenth Iran conference on Electric Engineering, 2007.
  • Sharifzadeh and M. Shamsfard, “Automatic Information Extraction on Special Domain”, In Proceedings of nineteenth Annual National Conference on Iran Computer Society, 2014.
  • Brin, “Extracting patterns and relations from the World Wide Web”, International Workshop on The World Wide Web and DatabasesSpringer, Berlin, Heidelberg ,1998, pp. 172-183.
  • Chen, D. Ji, C. L. Tan, and Z. Y. Niu, “Relation extraction using label propagation based semi-supervised learning”, In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2006, pp. 129-136.
  • Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Web-scale information extraction in knowitall, (preliminary results)”, In Proceedings of the 13th international conference on World Wide Web, ACM, 2004, pp. 100-110.
  • Feldman and B. Rosenfeld, “Boosting unsupervised relation extraction by using NER”, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 2006, pp. 473-481.
  • Hasegawa, S. Sekine, and R. Grishman, “Discovering relations among named entities from large corpora”, In Proceedings of the 42nd Annual Meeting on Association for Computational LinguisticsAssociation for Computational Linguistics, 2004, pp. 415-422.
  • Bach and S. Badaskar, “A review of relation extraction”, Literature review for Language and Statistics II, vol. 2, pp. 1-15, 2007.
  • Grishman, Information Extraction: Capabilities and Challenges, Lecture Notes of Computer Science, 2012.
  • Sudachi Khalese and M. A. Zare Bidaki, “An information framework for automatic answering to Farsi questions based on extracted knowledge from Wikipedia using self-supervised learning”, In Proceedings of 3th International Conference on Applied research in Computer and Information, 2016.
  • Agichtein and L. Gravano, “Snowball, Extracting relations from large plain-text collections”, In Proceedings of the fifth ACM conference on Digital libraries, ACM, 2000, pp. 85-94.
  • Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open information extraction from the web”, IJCAI. Vol. 7, pp. 2670-2676, 2007.
  • Fader, S. Soderland, and O. Etzioni, “Identifying relations for open information extraction”, Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011, pp. 1535-1545.
  • Wu and D. S.Weld, “Open information extraction using Wikipedia”, In Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, 2010, pp. 118-127.
  • Rozenfeld and R. Feldman, “Self-supervised relation extraction from the Web”, Knowledge and Information Systems, vol. 17, no. 1, pp. 17-33, 2008.
  • S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty, R. Hoffmann, and M. Skinner, “Intelligence in Wikipedia”, In AAAI, vol. 8, pp.1609-1614, 2008.
  • Konstantinova, “Review of relation extraction methods, what is new out there?”, International Conference on Analysis of Images, Social Networks and Texts, Springer, Cham, 2014, pp. 15-28.
  • Min, R. Grishman, L. Wan, C. Wang, and D. Gondek, “Distant supervision for relation extraction with an incomplete knowledge base”, In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics, Human Language Technologies, 2013, pp. 777-782.
  • Mintz, S .Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data”, In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, Association for Computational Linguistics, 2009, pp. 1003-1011.
  • Mosalla Nejad, D. Davoodi Moghadam, and A. Ahmadi, “An effective algorithm for semantic relation extraction in documents based on Wikipedia knowledge base”, Proceedings of 23th Iran Electrical Engineering Conference, 2016, pp. 918-923.
  • Nasser, M. Asgari, and B. Minaei-Bidgoli, “Distant Supervision for Relation Extraction in The Persian Language using Piecewise Convolutional Neural Networks”, 5th International Conference on Web Research (ICWR), IEEE, 2019, pp. 96-99.
  • Ji, K. Liu, S. He, J. Zhao, “Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions”, In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
  • Xu, S. Reddy, Y. Feng, S. Huang, and D. Zhao, “Question answering on freebase via relation extraction and textual evidence”, arXiv preprint arXiv:1603.00957, 2016.
  • Heydari, Z. Banaian, and V. Reshadat, “Study of information extraction methods based on machine learning and knowledge engineering”, The Second International Conference on Knowledge-Based Research. Tehran, Majlisi University, 2017.
  • Saheb-Nassagh, M. Asgari, and B. Minaei-Bidgoli, “RePersian A Fast Relation Extraction Tool in Persian”, International Journal of Web Research, vol. 2, no. 2, Autumn-Winter, 2019.
  • Asgari-Bidhendi, A. Hadian, and B. Minaei-Bidgoli, “FarsBase: The Persian Knowledge Graph”, Semantic Web, voll. 10, no 6, IOS Press, 2019.
  • Gu, W. Liu, and J. Song, “Relation extraction from Wikipedia leveraging intrinsic patterns”, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), IEEE, 2015, pp. 181-186.
  • Heist, S. Hertling, and H. Paulheim, “Language-agnostic relation extraction from abstracts in Wikis”, Information, vol. 9, no. 4, p. 75, 2018.
  • Huang, Y. Jia, J. Huang, and Z. He, “Multi-language person social relation extraction model based on distant supervision”, IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE, 2018, pp. 386-374.
  • Shamsfard and A. Abdollahzadeh Barforosh, “Extracting conceptual knowledge from the text using linguistic and semantic patterns”, Cognitive Science News, vol. 4, no 1, pp. 48-60, 2002
  • Mosallanejad, J. Davoodi Moghadam, and A. Ahmadi, “Presenting an efficient algorithm for extracting semantic relationships in documents, based on the tacit knowledge base of Wikipedia”. 23rd Iranian Electrical Engineering Conference. Tehran, Sharif University of Technology, 2015.
  • Dami, H. Shirazi, and A. Abdullah Zadeh, “Fapedia, a large-scale Persian cognitive database extracted from DBPedia”, 4th Joint Congress of Fuzzy and Intelligent Systems of Iran, Zahedan, University of Sistan and Baluchestan, 2015.
  • Asgari-Bidhendi, M. Nasser, B. Janfada, and B. Minaei-Bidgoli, “Perlex: A Bilingual Persian-English Gold Dataset for Relation Extraction”, Scientific Programming, 2020.
  • Khaleseh Sudachi, “Automatic production of Persian information boxes for individuals using the extraction of information made from Wikipedia articles”, The first national conference on new ideas in electrical and computer engineering, Iran, 2016.
  • Hasili, M. Hosseini Beheshti, and S. Pak Nohad, “Information Extraction, Methods and Applications”, The first international conference on interactive information retrieval, Tehran, University of Tehran, 2016.
  • Fadaei and M. Shamsfard, “Extracting conceptual relations from Persian resources”, In Proceeding of Seventh International Conference on Information Technology, New Generations, 2010, pp. 244-248.
  • Kambhatla, “Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations”, In Proceedings of
    the ACL 2004 on Interactive poster and demonstration sessions
    . Association for Computational Linguistics, 2004, pp. 178-181.
  • Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Manning, “Multi-instance multi-label learning for relation extraction”, In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, 2012, pp. 455-465.
  • Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation extraction”, Journal of machine learning research, vol. 3(Feb), pp. 1083-1106, 2003.
  • Zhao and R. Grishman, “Extracting relations with integrated information using kernel methods”, In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005, pp. 419-426.

Shireen Atarod received her bachelor’s degree in information technology (IT) engineering from Hamedan University of Technology (HUT), From Hamedan, Iran, in 2012. She received her master’s degree in e-commerce from Science and Research Branch of Islamic Azad University (SRBIAU) in 2018. Her research interests include relation extraction, supervised and semi supervised machine learning, and text mining.

 

Alireza Yari received his B.Sc. degree in control system engineering in 1993 from the University of Tehran, Iran, and M.Sc. and a Ph.D. degree in System engineering in 2000 from Kitami institute of technology, Japan. He is currently doing research in the Information Technology research faculty of Iran Telecom Research Center (ITRC). His research interests include web processing and cyber linguistics application, such as web search engines.