Recognize Anything
A Strong Image Tagging Model

*Equal Contribution
1OPPO Research Institute, 2International Digital Economy Academy (IDEA), 3AI2 Robotics

The Recognize Anything Model (RAM) can recognize any common category with high accuracy.
When combined with localization models (Grounded-SAM), RAM forms a strong and general pipeline for visual semantic analysis.

Highlight

Recognition and localization are two foundation computer vision tasks.

  • The Segment Anything Model (SAM) excels in localization capabilities, while it falls short when it comes to recognition tasks.
  • The Recognize Anything Model (RAM) exhibits exceptional recognition abilities, in terms of both accuracy and scope.
  • The avantages of RAM are summarized as follows:
  • Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
  • Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
  • Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.
  • Superior Recognition Ability

    RAM can recognize more valuable tags than other models.

  • RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
  • RAM even surpasses the fully supervised manners (ML-Decoder).
  • RAM exhibits competitive performance with the Google tagging API.
  • Extensive Recognition Scopes

  • RAM automatically recognizes 6400+ common tags, covering more valuable categories than OpenImages V6.
  • With open-set capability, RAM is feasible to recognize any common category.
  • BibTeX

    @article{zhang2023recognize,
    title={Recognize Anything: A Strong Image Tagging Model},
    author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
    journal={arXiv preprint arXiv:2306.03514},
    year={2023}
    }
    
    @article{huang2023tag2text,
      title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
      author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
      journal={arXiv preprint arXiv:2303.05657},
      year={2023}}