Reference-Free Image Caption Evaluation

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

MSD-Score evaluates image captions without references by modeling local visual evidence and token-level textual claims as distributions, then combining fine-grained discrepancy with global image-text similarity.

Shichao Kan1 Xuyang Zhang1 Haojie Zhang1 Zhe Zhu1 Yigang Cen2 Yixiong Liang1 Lianlei Shan3 Linna Zhang4 Zhe Qu1 Jiazhi Xia1,*

1Central South University 2Beijing Jiaotong University 3University of Chinese Academy of Sciences 4Guizhou University

*Corresponding author

CapArena 57.6 caption-level agreement
DocENT 64.8 overall accuracy
COCO-CF Easy 63.89 pairwise accuracy
Pascal-50S 86.9 mean pairwise accuracy

Abstract

Reference-free caption evaluation with explicit local verification.

Evaluating captions without references is difficult because global embedding similarity often misses hallucinated objects, missing attributes, or incorrect relations. MSD-Score models image patches and text tokens as von Mises-Fisher mixtures on the unit hypersphere. A weighted bi-directional KL divergence captures complementary coverage and support failures, while Soft-MSD adaptively combines this local discrepancy with global similarity. The result is a deterministic, decomposable, and reproducible signal for faithful offline caption evaluation.

What It Adds

MSD-Score checks whether image regions are covered and whether caption tokens are visually supported.

  • Local structure matters. Patch-token mismatch is the core source of hallucination and omission failures.
  • Bi-directional divergence is diagnostic. Coverage and support are measured separately instead of being blurred into one pooled similarity score.
  • Soft-MSD stays practical. Local verification is fused with global similarity under uncertainty-aware weighting.

Motivation

Global similarity can hide local grounding errors.

The central motivation of MSD-Score is that two captions can be globally similar to an image while differing in whether their token-level claims are visually grounded.

  • Failure mode 1: coverage. Image content can be omitted even when the global image-text similarity remains high.
  • Failure mode 2: support. Hallucinated words may be diluted after mean pooling and therefore receive similar scores to correct captions.
  • MSD view. Treating local embeddings as distributions gives a direct way to identify unsupported text and missing visual evidence.
MSD-Score motivation figure
Motivation figure showing why global pooled similarity misses coverage and support failures.

Method

From point similarity to distributional alignment.

MSD-Score preserves local image-text structure and evaluates whether image regions and caption tokens agree at multiple scales.

01

Local representations

Extract normalized image patch embeddings and caption token embeddings instead of collapsing both modalities into one vector.

02

vMF mixtures

Fit fixed-kappa hyperspherical mixtures to represent local semantic modes in high-dimensional, few-token regimes.

03

Bi-directional KL and Soft-MSD

Use coverage and support divergences, then fuse local discrepancy with global similarity under uncertainty-aware weighting.

MSD-Score framework
MSD-Score framework: local vision-language alignment, vMF mixture modeling, bi-directional divergence, and uncertainty-aware fusion.

Results

Strong alignment with human and diagnostic benchmarks.

MSD-Score improves reference-free caption evaluation across human preference benchmarks, factual error datasets, and controlled counterfactual tests.

MSD-Score overview and benchmark results
Human preference correlation and interpretability overview.
COCO-CF results
Controlled counterfactual hallucination evaluation on COCO-CF.

Diagnostics

KL decomposition gives score-relevant attribution.

The divergence can be decomposed into patch-level and token-level contributions, producing diagnostic heatmaps for unsupported details and missing visual evidence.

MSD-Score attribution examples
Heatmaps reveal unsupported text and missing visual evidence.
SugarCrepe results
Fine-grained compositional errors are better separated by local distributional verification.

Citation

Cite MSD-Score.

Use this BibTeX entry in papers, slides, benchmark reports, or project pages that reference MSD-Score.

BibTeX
@misc{kan2026msdscore,
  title         = {MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation},
  author        = {Kan, Shichao and Zhang, Xuyang and Zhang, Haojie and Zhu, Zhe and Cen, Yigang and Liang, Yixiong and Shan, Lianlei and Zhang, Linna and Qu, Zhe and Xia, Jiazhi},
  year          = {2026},
  eprint        = {2605.06080},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.06080}
}