GHOST

Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark

Johns Hopkins University1, NVIDIA2, University of Stuttgart3

Introduction

GHOST Benchmark Overview
  • GHOST is an object-centric hallucination benchmark for evaluating Multimodal Large Language Models (MLLMs) on fine-grained object-level understanding.
  • We evaluate individual objects via compositional triplets: (object type, attribute, relation).
  • We introduce Consistency Checks (CC) across positive (true) and hard negative (false) statements for the same object and define the GHOST Consistency Score (GCS) to quantify hallucination tendencies.
  • GHOST contains 765 images and 3,173 compositional triplets, resulting in 38,072 questions, and we evaluate 20 state-of-the-art MLLMs.

GHOST Leaderboard

We report GCS (higher is better) for Overall, and by component: Object, Attribute, and Relation. Values below are taken from the paper.

Open-Source Proprietary
Model Size Overall Object Attribute Relation
Tiny MLLMs (<4B)
LLaVA-OneVision0.5B45.668.857.510.5
MiniCPM-V2B44.563.949.819.8
PaliGemma3B46.675.753.310.6
VILA-1.53B53.278.762.218.7
Small–Medium MLLMs (4B–13B)
Phi-3.5-V4B55.878.352.236.9
Chameleon7B41.444.740.838.8
Mantis-LLaMA38B54.077.855.329.0
VILA-1.58B60.384.163.633.2
Eagle-X4-Plus8B64.186.263.143.1
LLaVA-OneVision8B64.486.567.539.1
Idefics9B39.250.547.519.5
LLaVA-1.513B58.280.256.437.9
VILA-1.513B64.584.368.840.2
Large MLLMs (>13B)
MiniCPM-LLaMA318B63.979.965.446.4
CogVLM220B47.174.051.216.1
InternVL-Chat26B61.184.261.837.5
VILA-1.540B66.085.967.744.4
LLaVA-OneVision72B68.484.068.253.1
Proprietary MLLMs
Gemini 1.5 Pro64.179.367.146.0
GPT-4o69.082.268.956.0

GHOST Consistency Score (GCS)

Consistency vs. Accuracy

To evaluate an MLLM and reveal hallucinations via consistency checks, we propose the GHOST Consistency Score (GCS). Unlike image-level metrics that treat questions independently, GCS penalizes hallucinations according to their frequency using exponentially decaying weights.

$$ \text{GCS} = 1 - \left( \sum_{i=1}^{N_{\text{hallu}}} \frac{1}{2^{i-1}} \right) \Big/ \left( \sum_{i=1}^{N_{\text{total}}} \frac{1}{2^{i-1}} \right) $$

Here, \(N_{\text{hallu}}\) is the number of hallucinations (\(\mathrm{FP}+\mathrm{FN}\)), and \(N_{\text{total}}\) is the total number of questions in the category (object, attribute, relation). The weights \(w_i = 2^{-(i-1)}\) emphasize that even a single hallucination is highly informative of misunderstanding. The overall score averages categories:

$$ \text{Overall GCS} = \tfrac{1}{3}\, (\, \text{GCS}_{\text{obj}} + \text{GCS}_{\text{attr}} + \text{GCS}_{\text{rel}}\,) $$

Vision Encoder and LLM Size Effects

Vision encoder effect
Vision encoder effect
  • Stronger encoders reduce hallucinations across all categories, with the largest gains on relations.
  • Quality beats quantity: MoE/SigLIP-style encoders (e.g., Eagle-X4) outperform models trained on far more data (e.g., CogVLM2).
  • Practicality: Better encoders improve reliability for resource-constrained and on-device deployments.
LLM size effect
LLM size effect
  • Scaling helps: Larger LLMs consistently achieve higher GCS, especially on relations.
  • But: When hard negatives are added, all sizes show residual inconsistencies—highlighting limits of pure scaling.
  • Guidance: Prefer a balanced recipe—adequate LLM capacity paired with a strong vision encoder.

Takeaways. (1) Encoder quality is a first-order lever for reducing object-level hallucinations and boosting relation understanding; invest in strong visual encoders (e.g., SigLIP, MoE Vision). (2) LLM scaling improves consistency, but consistency checks with hard negatives still surface weaknesses—so combine capacity with encoder quality and evaluate with object-centric consistency.

BibTeX

@misc{vs2025ghost,
  title={{GHOST}: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark},
  author={VS, Vibashan and Chang, Nadine and Schmalfuss, Jenny and Patel, Vishal M. and Yu, Zhiding and Alvarez, Jose M.},
  year={2025},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Website template adapted from Nerfies. Images © the authors.