microCLIP: Unsupervised CLIP Adaptation via Coarse–Fine Token Fusion for Fine-Grained Image Classification
Authors
Sathira Silva $^1$ · Eman Ali $^{1,2}$ · Chetan Arora $^3$ · Muhammad Haris Khan $^1$
$^1$Mohamed Bin Zayed University of Artificial Intelligence · $^2$Alexandria University · $^3$IIT Delhi
Abstract
Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.
Overall Architecture
Results
Comparison to Zero-shot and UA Baselines
Ablation Studies
Effect of coarse vs fine-grained cues:
Effect of SOAP:
Dynamic Knowledge Aggregation:
Two-headed classifier initialization:
Backbone Scaling
Visualizations
Sharper local attention via SOAP-guided [FG]:
[CLS] vs [FG] attention across datasets:
Pseudo-label accuracy progression:
NCut saliency masks:
Acknowledgements
This work builds upon the MUST repository.
We thank the authors of MetaCLIP for releasing their codebase, which we use in additional experiments.
We also acknowledge CuPL for providing GPT-3 generated class descriptions, included in all_prompts/.
Citation
@misc{silva2025microclipunsupervisedclipadaptation,
title={microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification},
author={Sathira Silva and Eman Ali and Chetan Arora and Muhammad Haris Khan},
year={2025},
eprint={2510.02270},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02270},
}