Learning an Efficient Text Augmentation Strategy: A Case Study in Sentiment Analysis

Document Type : Original Article

Author

Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran

10.22133/ijwr.2024.441414.1202

Abstract

Contemporary machine learning models, like deep neural networks, require substantial labeled datasets for proper training. However, in areas such as natural language processing, a shortage of labeled data can lead to overfitting. To address this challenge, data augmentation, which involves transforming data points to maintain class labels and provide additional valuable information, has become an effective strategy. In this paper, a deep reinforcement learning-based text augmentation method for sentiment analysis was introduced, combining reinforcement learning with deep learning. The technique uses Deep Q-Network (DQN) as the reinforcement learning method to search for an efficient augmentation strategy, employing four text augmentation transformations: random deletion, synonym replacement, random swapping, and random insertion. Additionally, various deep learning networks, including CNN, Bi-LSTM, Transformer, BERT, and XLNet, were evaluated for the training phase. Experimental findings show that the proposed technique can achieve an accuracy of 65.1% with only 20% of the dataset and 69.3% with 40% of the dataset. Furthermore, with just 10% of the dataset, the method yields an F1-score of 62.1%, rising to 69.1% with 40% of the dataset, outperforming previous approaches. Evaluation on the SemEval dataset demonstrates that reinforcement learning can efficiently augment text datasets for improved sentiment analysis results.

Keywords

Main Subjects


  • Raileanu, M. Goldstein, D. Yarats, I. Kostrikov and R. Fergus, “Automatic Data Augmentation for Generalization in Deep Reinforcement Learning,” arXiv preprint arXiv:2006.12862, 2020, https://doi.org/10.48550/arXiv.2006.12862
  • Wang, K. Wang and S. Lian, “A survey on face data augmentation for the training of deep neural networks,” Neural Comput Appl, vol. 32, no. 19, pp. 15503–15531, 2020, http://doi.org/10.1007/s00521-020-04748-3.
  • Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon and C. Ré, “Learning to compose domain-specific transformations for data augmentation,” arXiv. 2017, https://doi.org/10.48550/arXiv.1709.01643.
  • Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J Big Data, vol. 6, no. 1, 2019, http://doi.org/10.1186/s40537-019-0197-0.
  • Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” AI Open, vol. 3, pp. 71–90, 2022, http://doi.org/10.1016/j.aiopen.2022.03.001.
  • Y. Feng et al., “A Survey of Data Augmentation Approaches for NLP,” arXiv preprint arXiv:2105.03075., 2021, https://doi.org/10.48550/arXiv.2105.03075.
  • Kobayashi, “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 452-457, http://doi.org/10.18653/v1/N18-2072.
  • Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388. http://doi.org/10.18653/v1/d19-1670.
  • F. A. O. Pellicer, T. M. Ferreira and A. H. R. Costa, “Data augmentation techniques in natural language processing,” Appl Soft Comput, vol. 132, p. 109803, Jan. 2023, http://doi.org/10.1016/j.asoc.2022.109803.
  • Shorten, T. M. Khoshgoftaar and B. Furht, “Text Data Augmentation for Deep Learning,” J Big Data, vol. 8, no. 1, p. 101, Dec. 2021, http://doi.org/10.1186/s40537-021-00492-0.
  • Liu, G. Xu, C. Jia, W. Ma, L. Wang and S. Vosoughi, “Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9031–9041. http://doi.org/10.18653/v1/2020.emnlp-main.726.
  • Raille, S. Djambazovska, & C. Musat, “Fast cross-domain data augmentation through neural sentence editing,” arXiv preprint arXiv:2003.10254.

https://doi.org/10.48550/arXiv.2003.10254

  • Hataya, J. Zdenek, K. Yoshizoe and H. Nakayama, “Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12370 LNCS, pp. 1–16, 2020, http://doi.org/10.1007/978-3-030-58595-2_1.
  • Daval-Frerot and Y. Weis, “WMD at SemEval-2020 Tasks 7 and 11: Assessing humor and propaganda using Unsupervised Data Augmentation,” in 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings, 2020, pp. 1865–1874. http://doi.org/10.18653/v1/2020.semeval-1.246.
  • Dao, A. Gu, A. J. Ratner, V. Smith, C. De Sa and C. Ré, “A kernel theory of modern data augmentation,” in 36th International Conference on Machine Learning, ICML 2019, 2019, pp. 1528–1537.
  • Zhang, T. Li, H. Zhang, and B. Yin, “On Data Augmentation for Extreme Multi-label Classification,” arXiv preprint arXiv:2009.10778., 2020, https://doi.org/10.48550/arXiv.2009.10778
  • Zuo, Y. Chen, K. Liu and J. Zhao, “KnowDis: Knowledge Enhanced Data Augmentation for Event Causality Detection via Distant Supervision,” in COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference, 2020, pp. 1544–1550, http://doi.org/10.18653/v1/2020.coling-main.135.
  • Dai and H. Adel, “An Analysis of Simple Data Augmentation for Named Entity Recognition,” Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3861–3867, http://doi.org/10.18653/v1/2020.coling-main.343.
  • Longpre, Y. Wang and C. DuBois, “How effective is task-agnostic data augmentation for pretrained transformers?,” in Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, 2020, pp. 4401–4411. http://doi.org/10.18653/v1/2020.findings-emnlp.394.
  • Rastogi, N. Mofid and F.-I. Hsiao, “Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification,” arXiv preprint arXiv:2007.00875., 2020, https://doi.org/10.48550/arXiv.2007.00875.
  • Peng, C. Zhu, M. Zeng and J. Gao, “Data Augmentation for Spoken Language Understanding via Pretrained Models,” arXiv preprint arXiv:2004.13952., 2020, https://doi.org/10.48550/arXiv.2004.13952.
  • Yan, Y. Li, S. Zhang, and Z. Chen, “Data Augmentation for Deep Learning of Judgment Documents,” In Intelligence Science and Big Data Engineering. Big Data and Machine Learning: 9th International Conference, IScIDE 2019, Nanjing, China, October 17–20, 2019, pp. 232–242. https://doi.org/10.1007/978-3-030-36204-1_19.
  • Coulombe, “Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs,” 2018, [Online]. Available: http://arxiv.org/abs/1812.04718
  • Regina, M. Meyer and S. Goutal, “Text Data Augmentation: Towards better detection of spear-phishing emails,” arXiv preprint arXiv:2007.02033., 2020, https://doi.org/10.48550/arXiv.2007.02033
  • Min, R. Thomas McCoy, D. Das, E. Pitler and T. Linzen, “Syntactic data augmentation increases robustness to inference heuristics,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2339–2352, https://doi.org/10.18653/v1/2020.acl-main.212.
  • Zhang, T. Ge and X. Sun, “Parallel data augmentation for formality style transfer,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3221–3228, https://doi.org/.18653/v1/2020.acl-main.294.
  • Anaby-Tavor et al., “Do not have enough data? Deep learning to the rescue!,” in AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, pp. 7383–7390, https://doi.org/10.1609/aaai.v34i05.6233.
  • Thakur, N. Reimers, J. Daxenberger and I. Gurevych, “Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 296–310, https://doi.org/10.18653/v1/2021.naacl-main.28.
  • Guo, Y. Mao and R. Zhang, “Augmenting Data with Mixup for Sentence Classification: An Empirical Study,” arXiv preprint arXiv:1905.08941., 2019, https://doi.org/10.48550/arXiv.1905.08941.
  • Yu, R. Zhang, Y. Zhao, Y. Zhang, C. Li and C. Chen, “SDA: Improving Text Generation with Self Data Augmentation,” arXiv preprint arXiv:2101.03236., 2021, https://doi.org/10.48550/arXiv.2101.03236.
  • Fang and P. Li, “Data Augmentation with Reinforcement Learning for Document-Level Event Coreference Resolution,” In Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, 2020, pp. 751–763, https://doi.org/10.1007/978-3-030-60450-9_59.
  • Kim and K. E. Kim, “Data Augmentation for Learning to Play in Text-Based Games,” in Proc. IJCAI, 2022, pp. 3143-3149, https://doi.org/10.24963/ijcai.2022/436.
  • Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015, https://doi.org/10.1038/nature14236.
  • J. C. H. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3–4, pp. 279–292, May 1992, https://doi.org/10.1007/BF00992698.
  • “SemEval 2017 Task 4A.” [Online]. Available: https://alt.qcri.org/semeval2017/task4/
  • Niu and M. Bansal, “Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models,” arXiv preprint arXiv:1809.02079., 2018, https://doi.org/10.48550/arXiv.1809.02079.
  • Y. Wang and D. Yang, “That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets,” In Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2557-2563, https://doi.org/10.18653/v1/D15-1306.
  • Yu et al., “QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension,” arXiv preprint arXiv:1804.09541., Apr. 2018, https://doi.org/10.48550/arXiv.1804.09541.
  • Wang, J. He, X. Zhang and S. Liu, “A short text classification method based on N-gram and CNN,” Chinese Journal of Electronics, vol. 29, no. 2, pp. 248-254, 2020, https://doi.org/10.1049/cje.2020.01.001.
  • Jang, M. Kim, G. Harerimana, S. U. Kang and J. W. Kim, “Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism,” Applied Sciences, vol. 10, no. 17, pp. 5841, 2020, https://doi.org/10.3390/app10175841.
  • Singh, Sushant, and Ausif Mahmood. “The NLP cookbook: modern recipes for transformer based deep learning architectures,” IEEE Access, 9, pp. 68675-68702, 2021, https://doi.org/10.1109/ACCESS.2021.3077350..
  • Li, Y. Ma, Z. Ma and H. Zhu, “Weibo text sentiment analysis based on BERT and deep learning,” Applied Sciences, vol. 11, no. 22, pp. 10774, 2021, https://doi.org/10.3390/app112210774.
  • Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems, vol. 32, 2019.