We report GCS (higher is better) for Overall, and by component: Object, Attribute, and Relation. Values below are taken from the paper.
| Model | Size | Overall | Object | Attribute | Relation |
|---|---|---|---|---|---|
| Tiny MLLMs (<4B) | |||||
| LLaVA-OneVision | 0.5B | 45.6 | 68.8 | 57.5 | 10.5 |
| MiniCPM-V | 2B | 44.5 | 63.9 | 49.8 | 19.8 |
| PaliGemma | 3B | 46.6 | 75.7 | 53.3 | 10.6 |
| VILA-1.5 | 3B | 53.2 | 78.7 | 62.2 | 18.7 |
| Small–Medium MLLMs (4B–13B) | |||||
| Phi-3.5-V | 4B | 55.8 | 78.3 | 52.2 | 36.9 |
| Chameleon | 7B | 41.4 | 44.7 | 40.8 | 38.8 |
| Mantis-LLaMA3 | 8B | 54.0 | 77.8 | 55.3 | 29.0 |
| VILA-1.5 | 8B | 60.3 | 84.1 | 63.6 | 33.2 |
| Eagle-X4-Plus | 8B | 64.1 | 86.2 | 63.1 | 43.1 |
| LLaVA-OneVision | 8B | 64.4 | 86.5 | 67.5 | 39.1 |
| Idefics | 9B | 39.2 | 50.5 | 47.5 | 19.5 |
| LLaVA-1.5 | 13B | 58.2 | 80.2 | 56.4 | 37.9 |
| VILA-1.5 | 13B | 64.5 | 84.3 | 68.8 | 40.2 |
| Large MLLMs (>13B) | |||||
| MiniCPM-LLaMA3 | 18B | 63.9 | 79.9 | 65.4 | 46.4 |
| CogVLM2 | 20B | 47.1 | 74.0 | 51.2 | 16.1 |
| InternVL-Chat | 26B | 61.1 | 84.2 | 61.8 | 37.5 |
| VILA-1.5 | 40B | 66.0 | 85.9 | 67.7 | 44.4 |
| LLaVA-OneVision | 72B | 68.4 | 84.0 | 68.2 | 53.1 |
| Proprietary MLLMs | |||||
| Gemini 1.5 Pro | – | 64.1 | 79.3 | 67.1 | 46.0 |
| GPT-4o | – | 69.0 | 82.2 | 68.9 | 56.0 |
To evaluate an MLLM and reveal hallucinations via consistency checks, we propose the GHOST Consistency Score (GCS). Unlike image-level metrics that treat questions independently, GCS penalizes hallucinations according to their frequency using exponentially decaying weights.
$$ \text{GCS} = 1 - \left( \sum_{i=1}^{N_{\text{hallu}}} \frac{1}{2^{i-1}} \right) \Big/ \left( \sum_{i=1}^{N_{\text{total}}} \frac{1}{2^{i-1}} \right) $$
Here, \(N_{\text{hallu}}\) is the number of hallucinations (\(\mathrm{FP}+\mathrm{FN}\)), and \(N_{\text{total}}\) is the total number of questions in the category (object, attribute, relation). The weights \(w_i = 2^{-(i-1)}\) emphasize that even a single hallucination is highly informative of misunderstanding. The overall score averages categories:
$$ \text{Overall GCS} = \tfrac{1}{3}\, (\, \text{GCS}_{\text{obj}} + \text{GCS}_{\text{attr}} + \text{GCS}_{\text{rel}}\,) $$
Takeaways. (1) Encoder quality is a first-order lever for reducing object-level hallucinations and boosting relation understanding; invest in strong visual encoders (e.g., SigLIP, MoE Vision). (2) LLM scaling improves consistency, but consistency checks with hard negatives still surface weaknesses—so combine capacity with encoder quality and evaluate with object-centric consistency.
@misc{vs2025ghost,
title={{GHOST}: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark},
author={VS, Vibashan and Chang, Nadine and Schmalfuss, Jenny and Patel, Vishal M. and Yu, Zhiding and Alvarez, Jose M.},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Website template adapted from Nerfies. Images © the authors.