GHOST Benchmark

Introduction

GHOST is an object-centric hallucination benchmark for evaluating Multimodal Large Language Models (MLLMs) on fine-grained object-level understanding.
We evaluate individual objects via compositional triplets: (object type, attribute, relation).
We introduce Consistency Checks (CC) across positive (true) and hard negative (false) statements for the same object and define the GHOST Consistency Score (GCS) to quantify hallucination tendencies.
GHOST contains 765 images and 3,173 compositional triplets, resulting in 38,072 questions, and we evaluate 20 state-of-the-art MLLMs.

GHOST Leaderboard

We report GCS (higher is better) for Overall, and by component: Object, Attribute, and Relation. Values below are taken from the paper.

Open-Source Proprietary

Model	Size	Overall	Object	Attribute	Relation
Tiny MLLMs (<4B)
LLaVA-OneVision	0.5B	45.6	68.8	57.5	10.5
MiniCPM-V	2B	44.5	63.9	49.8	19.8
PaliGemma	3B	46.6	75.7	53.3	10.6
VILA-1.5	3B	53.2	78.7	62.2	18.7
Small–Medium MLLMs (4B–13B)
Phi-3.5-V	4B	55.8	78.3	52.2	36.9
Chameleon	7B	41.4	44.7	40.8	38.8
Mantis-LLaMA3	8B	54.0	77.8	55.3	29.0
VILA-1.5	8B	60.3	84.1	63.6	33.2
Eagle-X4-Plus	8B	64.1	86.2	63.1	43.1
LLaVA-OneVision	8B	64.4	86.5	67.5	39.1
Idefics	9B	39.2	50.5	47.5	19.5
LLaVA-1.5	13B	58.2	80.2	56.4	37.9
VILA-1.5	13B	64.5	84.3	68.8	40.2
Large MLLMs (>13B)
MiniCPM-LLaMA3	18B	63.9	79.9	65.4	46.4
CogVLM2	20B	47.1	74.0	51.2	16.1
InternVL-Chat	26B	61.1	84.2	61.8	37.5
VILA-1.5	40B	66.0	85.9	67.7	44.4
LLaVA-OneVision	72B	68.4	84.0	68.2	53.1
Proprietary MLLMs
Gemini 1.5 Pro	–	64.1	79.3	67.1	46.0
GPT-4o	–	69.0	82.2	68.9	56.0

GHOST Consistency Score (GCS)

To evaluate an MLLM and reveal hallucinations via consistency checks, we propose the GHOST Consistency Score (GCS). Unlike image-level metrics that treat questions independently, GCS penalizes hallucinations according to their frequency using exponentially decaying weights.

$$ \text{GCS} = 1 - \left( \sum_{i=1}^{N_{\text{hallu}}} \frac{1}{2^{i-1}} \right) \Big/ \left( \sum_{i=1}^{N_{\text{total}}} \frac{1}{2^{i-1}} \right) $$

Here, $N_{\text{hallu}}$ is the number of hallucinations ($\mathrm{FP}+\mathrm{FN}$), and $N_{\text{total}}$ is the total number of questions in the category (object, attribute, relation). The weights $w_i = 2^{-(i-1)}$ emphasize that even a single hallucination is highly informative of misunderstanding. The overall score averages categories:

$$ \text{Overall GCS} = \tfrac{1}{3}\, (\, \text{GCS}_{\text{obj}} + \text{GCS}_{\text{attr}} + \text{GCS}_{\text{rel}}\,) $$

Vision Encoder and LLM Size Effects

Stronger encoders reduce hallucinations across all categories, with the largest gains on relations.
Quality beats quantity: MoE/SigLIP-style encoders (e.g., Eagle-X4) outperform models trained on far more data (e.g., CogVLM2).
Practicality: Better encoders improve reliability for resource-constrained and on-device deployments.

Scaling helps: Larger LLMs consistently achieve higher GCS, especially on relations.
But: When hard negatives are added, all sizes show residual inconsistencies—highlighting limits of pure scaling.
Guidance: Prefer a balanced recipe—adequate LLM capacity paired with a strong vision encoder.

Takeaways. (1) Encoder quality is a first-order lever for reducing object-level hallucinations and boosting relation understanding; invest in strong visual encoders (e.g., SigLIP, MoE Vision). (2) LLM scaling improves consistency, but consistency checks with hard negatives still surface weaknesses—so combine capacity with encoder quality and evaluate with object-centric consistency.

BibTeX

@misc{vs2025ghost,
  title={{GHOST}: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark},
  author={VS, Vibashan and Chang, Nadine and Schmalfuss, Jenny and Patel, Vishal M. and Yu, Zhiding and Alvarez, Jose M.},
  year={2025},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}