International Journal of Web Research

International Journal of Web Research

Consistent Responses to Paraphrased Questions as Evidence Against Hallucination: A Study on Hallucinations in LLMs

Document Type : Original Article

Authors
Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran.
Abstract
The increasing adoption of large language models (LLMs) has intensified concerns about hallucinations—outputs that are syntactically fluent but factually incorrect. In this paper, we propose a method for detecting such hallucinations by evaluating the consistency of model responses to paraphrased versions of the same question. The underlying assumption is that if a model produces consistent answers across different paraphrases, the output is more likely to be accurate. To test this method, we developed a system that generates multiple paraphrases of each question and analyzes the consistency of the corresponding responses. Experiments were conducted using two LLMs—GPT-4O and LLaMA 3–70B Chat—on both Persian and English datasets. The method achieved an average accuracy of 99.5% for GPT-4O and 98% for LLaMA 3–70B, indicating the effectiveness of our approach in identifying hallucination-free outputs across languages. Furthermore, by automating the consistency evaluation using an instruction-tuned language model, we enabled scalable and unbiased detection of semantic agreement across paraphrased responses.
Keywords

Subjects


[1]     T. Brown et al., “Language Models Are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf 
[2]     V. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in Proceedings of the 10th International Conference on Learning Representations (ICLR), Apr. 2022. https://openreview.net/pdf?id=9Vrb9D0WI4 
[3]     R. Thoppilan et al., “LaMDA: Language Models for Dialog Applications,” arXiv preprint arXiv:2201.08239, 2022. https://doi.org/10.48550/arXiv.2201.08239
[4]     M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021. https://doi.org/10.48550/arXiv.2107.03374
[5]     A. Lewkowycz et al., “Solving Quantitative Reasoning Problems with Language Models,” Advances in Neural Information Processing Systems, vol. 35, pp. 3843–3857, Dec. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html
[6]     J. D. Blom, A Dictionary of Hallucinations. New York, NY: Springer, 2010. https://doi.org/10.1007/978-1-4419-1223-7
[7]     Y. Bang et al., “A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity,” in Proceedings of the 13th International Joint Conference on Natural Language Processing (IJCNLP-AACL 2023), Bali, Indonesia, Nov. 2023, pp. 675–718. https://doi.org/10.48550/arXiv.2302.04023
[8]     Z. Ji et al., “Survey of Hallucination in Natural Language Generation,” ACM Comput. Surv., vol. 55, no. 12, p. 248:1-248:38, Mar. 2023, https://doi.org/10.1145/3571730
[9]     L. Huang et al., “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” ACM Trans. Inf. Syst., vol. 43, no. 2, Art. no. 28, pp. 1–55, 2025, https://doi.org/10.1145/3703155
[10]   R. Friel and A. S. Sanyal, “ChainPoll: A High Efficacy Method for LLM Hallucination Detection,” arXiv preprint arXiv:2310.18344, 2023, https://doi/org/10.48550/arXiv.2310.18344
[11]   T. Zare and M. Shamsfard, “Detecting Hallucinations Generated by Large Language Models Using Paraphrasing Technique,” in Proceedings of the 10th International Web Research Conference (ICWR), Tehran, Iran, Apr. 2024, pp. 1–6. https://www.sid.ir/paper/1147671/en
[12]   S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
[13]   P. Manakul, A. Liusie, and M. Gales, “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 9004–9017. https://doi.org/10.18653/v1/2023.emnlp-main.557
[14]   T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui, “Table-to-Text Generation by Structure-Aware Seq2seq Learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, Art. no. 1, Apr. 2018, https://doi.org/10.1609/aaai.v32i1.11925
[15]   J. Li, X. Cheng, X. Zhao, J.-Y. Nie, and J.-R. Wen, “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, Dec. 2023, pp. 6449–6464, https://doi.org/10.18653/v1/2023.emnlp-main.397
[16]   B. M. Lattimer, P. H. Chen, X. Zhang, and Y. Yang, “Fast and Accurate Factual Inconsistency Detection Over Long Documents,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, Dec. 2023, pp. 1691–1703, https://doi.org/10.18653/v1/2023.emnlp-main.105
[17]   Y. Yehuda, I. Malkiel, O. Barkan, J. Weill, R. Ronen, and N. Koenigstein, “InterrogateLLM: Zero-Resource Hallucination Detection in LLM-Generated Answers,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V. Srikumar, Eds., Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9333–9347. https://doi.org/10.18653/v1/2024.acl-long.506
[18]   G. Sriramanan, S. Bharti, V. S. Sadasivan, S. Saha, P. Kattakinda, and S. Feizi, “LLM-Check: Investigating Detection of Hallucinations in Large Language Models,” Advances in Neural Information Processing Systems, vol. 37, pp. 34188–34216, Dec. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/3c1e1fdf305195cd620c118aaa9717ad-Abstract-Conference.html
[19]   Y. S. Chuang, L. Qiu, C. Y. Hsieh, R. Krishna, Y. Kim, and J. R. Glass, “Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. N. Chen, Eds., Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1419–1436. https://doi.org/10.18653/v1/2024.emnlp-main.84
[20]   S. Zhang, T. Yu, and Y. Feng, “TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V. Srikumar, Eds., Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 8908–8949. https://doi.org/10.18653/v1/2024.acl-long.483