microCLIP: Unsupervised CLIP Adaptation via Coarse–Fine Token Fusion for Fine-Grained Image Classification

Authors

Sathira Silva $^1$ · Eman Ali $^{1,2}$ · Chetan Arora $^3$ · Muhammad Haris Khan $^1$
$^1$_{Mohamed Bin Zayed University of Artificial Intelligence} · $^2$_{Alexandria University} · $^3$_{IIT Delhi}

Abstract

Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

Overall Architecture

Overall architecture of microCLIP

Results

Comparison to Zero-shot and UA Baselines

Top-1 accuracy comparison across 13 datasets (ViT-B/32 backbone)

Ablation Studies

Effect of coarse vs fine-grained cues:

Ablation on coarse-feature baselines

Effect of SOAP:

Ablation on Attention Pooling (SOAP vs baselines)

Dynamic Knowledge Aggregation:

Ablation on pseudo-labeler

Two-headed classifier initialization:

Two-headed classifier ablation

Backbone Scaling

Results with ViT-B/16 backbone

Visualizations

Sharper local attention via SOAP-guided [FG]:

Attention maps (Birdsnap/RESISC)

[CLS] vs [FG] attention across datasets:

Attention comparison between CLS and FG tokens

Pseudo-label accuracy progression:

Pseudo-labeling accuracy curves

NCut saliency masks:

NCut-based saliency maps on Birdsnap

Acknowledgements

This work builds upon the MUST repository.
We thank the authors of MetaCLIP for releasing their codebase, which we use in additional experiments.
We also acknowledge CuPL for providing GPT-3 generated class descriptions, included in all_prompts/.

Citation

@misc{silva2025microclipunsupervisedclipadaptation,
      title={microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification}, 
      author={Sathira Silva and Eman Ali and Chetan Arora and Muhammad Haris Khan},
      year={2025},
      eprint={2510.02270},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02270}, 
}