CoN-CLIP - Jaisidh Singh

Negation is a fundamental aspect of natural language reasoning, yet foundational vision-language models (VLMs) like CLIP face significant challenges in accurately interpreting it. These models often process text prompts holistically, making it difficult to isolate and understand the role of negated terms. To overcome this limitation, we present CC-Neg: a novel dataset consisting of 228,246 images, each paired with both true captions and their corresponding negated versions. CC-Neg provides a critical benchmark to assess and improve foundational VLMs’ ability to process negations, focusing specifically on how the presence of terms like ‘not’ alters the semantic relationship between images and their textual descriptions. To illustrate the effectiveness of the CC-Neg dataset in enhancing negation understanding, we introduce the CoN-CLIP framework, which incorporates targeted modifications to CLIP’s contrastive loss function. When trained with CC-Neg, CoN-CLIP achieves a 3.85% average improvement in top-1 accuracy for zero-shot image classification across eight datasets, and a 4.4% performance boost on challenging compositionality benchmarks such as SugarCREPE. These results highlight CoN-CLIP’s enhanced understanding of the nuanced semantic relationships involving negation. Our code and the CC-Neg benchmark are available here.

Citation

@inproceedings{singh2025learning, title={Learning the Power of "No": Foundation Models with Negations}, author={Singh, Jaisidh and Shrivastava, Ishaan and Vatsa, Mayank and Singh, Richa and Bharati, Aparna}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, year={2025} }

Learning the Power of "No": Foundation Models with Negations

Abstract

Citation