Is the answer in the text? Challenging ChatGPT with evidence retrieval from instructive text
- Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article’s topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a textGenerative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article’s topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence or not. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.…


| Author: | Sophie Henning, Talita Anthonio, Wei Zhou, Heike Adel, Mohsen Mesgar, Annemarie FriedrichORCiDGND |
|---|---|
| URN: | urn:nbn:de:bvb:384-opus4-1266985 |
| Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/126698 |
| URL: | https://aclanthology.org/2023.findings-emnlp.949/ |
| ISBN: | 979-8-89176-061-5OPAC |
| Parent Title (English): | Findings of the Association for Computational Linguistics: EMNLP 2023, 6-10 Dezember 2023, Singapore |
| Publisher: | Association for Computational Linguistics (ACL) |
| Place of publication: | Stroudsburg, PA |
| Editor: | Houda Bouamor, Juan Pino, Kalika Bali |
| Type: | Conference Proceeding |
| Language: | English |
| Date of Publication (online): | 2025/12/02 |
| Year of first Publication: | 2023 |
| Publishing Institution: | Universität Augsburg |
| Release Date: | 2025/12/03 |
| First Page: | 14229 |
| Last Page: | 14241 |
| Institutes: | Fakultät für Angewandte Informatik |
| Fakultät für Angewandte Informatik / Institut für Informatik | |
| Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Computerlinguistik | |
| Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
| Licence (German): | CC-BY 4.0: Creative Commons: Namensnennung |



