Recognize Anything
A Strong Image Tagging Model

Youcai Zhang^1*, Xinyu Huang^1*, Jinyu Ma^1*, Zhaoyang Li^1*, Zhaochuan Luo¹, Yanchun Xie¹,
Yuzhuo Qin¹, Tong Luo¹, Yaqian Li¹, Shilong Liu², Yandong Guo³, Lei Zhang²

*Equal Contribution

¹OPPO Research Institute, ²International Digital Economy Academy (IDEA), ³AI² Robotics

RAM Paper RAM Demo Official Code

The Recognize Anything Model (RAM) can recognize any common category with high accuracy.
When combined with localization models (Grounded-SAM), RAM forms a strong and general pipeline for visual semantic analysis.

Highlight

Recognition and localization are two foundation computer vision tasks.

The Segment Anything Model (SAM) excels in localization capabilities, while it falls short when it comes to recognition tasks.

The Recognize Anything Model (RAM) exhibits exceptional recognition abilities, in terms of both accuracy and scope.

The avantages of RAM are summarized as follows:

Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;

Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;

Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

Superior Recognition Ability

RAM can recognize more valuable tags than other models.

RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.

RAM even surpasses the fully supervised manners (ML-Decoder).

RAM exhibits competitive performance with the Google tagging API.

Extensive Recognition Scopes

RAM automatically recognizes 6400+ common tags, covering more valuable categories than OpenImages V6.

With open-set capability, RAM is feasible to recognize any common category.

BibTeX

@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}}

Recognize AnythingA Strong Image Tagging Model