Towards a Curatorial Agent for Heritage Institutions: Web Source Credibility Verification for Grounding Domain-Specific LLMs
Keywords: Hallucinations, BERT, BART, URL Classification, Source Credibility, Question Generation
Abstract. Hallucination problem is the main cause of the weak reliability of Large Language Models (LLMs) for their use in cultural institutions, such as museums and galleries. One proposed solution to the hallucination problem is to ground the LLM in the real data found on the Web. However, since the cultural heritage domain requires factual accuracy, cultural institutions cannot fully rely on the data obtained from the Web. To make the data suitable for the heritage domain use case, additional source filtering and verification must be applied. In this paper, we propose a potential source verification pipeline for verifying web sources, as well as a question-generating agent designed to guide heritage experts in collecting the right sources for their needs. Upon evaluation, the proposed system successfully filters the web-scraped sources given a search keyword, achieving moderate results in both classification tasks. In addition, our contributions include the curation of a custom dataset for training both models and estimation of an optimal training & dataset configuration for the proposed ’curatorial question generation’ task.