InsDet

Overview

Instance Detection (InsDet) is a practically important task in robotics applications, e.g., elderly-assistant robots need to fetch specific items from a cluttered kitchen, micro-fulfillment robots for the retail need to pick items from mixed boxes or shelves. Different from Object Detection (ObjDet) detecting all objects belonging to some predefined classes, InsDet aims to detect specific object instances defined by some examples capturing the instance from multiple views.

This year, we plan to run a competition on our InsDet dataset, which is the instance detection benchmark dataset which is larger in scale and more challenging than existing InsDet datasets. The major strengths of our InsDet dataset over prior InsDet datasets include (1) both high-resolution profile images of object instances and high-resolution testing images from more realistic indoor scenes, simulating real-world indoor robots locating and recognizing object instances from a cluttered indoor scene in a distance (2) a realistic unified InsDet protocol to foster the InsDet research.

Participants in this challenge will be tasked with predicting the bounding boxes for each given instance from testing images. This exciting opportunity allows researchers, students, and data scientists to apply their expertise in computer vision and machine learning to address instance detection problem. We refer participants to the user guide for details.

Important Dates

~~March 24, 2025, InsDet (6.60 GB) will be released.~~

~~March 24, 2025, EvalAI server will be open.~~

~~May 1, 2025, Scenes Test (2.43 GB) will be released.~~

~~May 5, 2025, Evaluation scripts will be uploaded to github repo.~~

~~June 5, 2025, Challenge will be closed.~~

~~June 6, 2025, Invitations to top teams for presentation at the workshop will be released.~~

June 11, 2025, Workshop day. InsDet Challenge will be at 4:45 - 5:20 Beijing Time.

LeaderBoard

Congratulations to all teams for your excellent performance!
We invite the following three teams: Grounding-X, leinad (ZERO) and hust_zwb to give a short talk (max 10 minutes) for your distinct performance. Please send us your technical reports or slides or videos before 12 pm GMT on June 10. If you have any questions, feel free to contact us or submit issues on our github repo.

Dataset

The InsDet dataset contains 100 object instances with multi-view profile images, 200 pure background images and 160 scene images. Participants can download the dataset from the InsDet dataset.

Objects.
100 different Object instances. Each profile image has a resolution of 3072×3072 pixels (some instances are 3456×3456). Each instance is captured at 24 rotation positions (every 15° in azimuth) with a 45° elevation view. When capturing profile images for each instance, inspired by prior arts, we paste a QR code on the tabletop, which enables pose estimation, e.g., using COLMAP. We use the GrabCut toolbox to derive foreground masks of instances in profile images. This removes background pixels (such as QR code regions) in the profile images.
In practice, we center-crop foreground instances from profile images and downsize the center-crops to 1024×1024, and save images and masks.
Background.
200 high-resolution background images of indoor scenes that do not include any given instances from Objects.

Scenes.
160 high-resolution images (6144×8192) in cluttered scenes, where some instances are placed in reasonable locations. We tag these images as easy or hard based on scene clutter and object occlusion levels.

The Scenes-Test has the same structure of Scenes, but most images are captured from different scenarios. This part includes 320 high-resolution images (6144×8192) in cluttered scenes.

Benchmarking Protocol

Goal.
Developing instance detectors using profile images (cf. visuals below) and optionally some random background images (e.g., to apply cut-paste-learn^[1]). The detector should detect object instances of interest in real-world testing images.
Environment for model development.
1. A set of object instances, each of which has some visual examples captured from multiple views. Participants should develop a model to successfully detect these object instances.
2. Some random background images (not used in testing). Participants might use them to synthesize images. Participants can also download and use other external background images in training.
Environment for testing.
Real-world indoor scene images, in which participants' algorithms should detect object instances of interest.

Importantly, participants are not allowed to develop any instance detectors on the Real-world indoor scene images we provided. Furthermore, for participants who are invited for a 10-min presentation (online or video).

Evaluation & Submission

Following the COCO dataset^[8], we tag testing object instances as small, medium, and large according to their bounding box area. The following 12 metrics are used for characterizing the performance of an instance detector on InsDet dataset. Additionally, we will also evaluate AP on easy and hard scenes separately.

Official Baseline

Faster RCNN^[2] + Cut-Paste-Learn strategy^[1]. We train Faster RCNN with data generated using the Cut-Paste-Learn strategy. The detailed hyperparameters used in data generation and model training are mentioned in the paper [3].
SAM^[4] + DINOv2^[5] + StableMatching^[6,7]. We proposed a simple non-learned method using the off-the-shelf class-agnostic segmentation model (Segment Anything Model, SAM), the self-supervised feature representation DINOv2 and classical StableMatching algorithm in the paper [3].

Submission

The generated JSON or CSV file should adhere to the following dictionary format:

[{"image_id": 0, 
  "category_id": 79, 
  "bbox": [976, 632, 64, 80], 
  "score": 99.32915569311469, 
  "image_width": 8192, 
  "image_height": 6144,
  "scale": 1, 
  "image_name": 
  "easy.leisure_zone.rgb_000.jpg"},
  ...
 {"image_id": 159, 
  "category_id": 9, 
  "bbox": [921, 803, 28, 106], 
  "score": 99.32927090665571, 
  "image_width": 8192, 
  "image_height": 6144, 
  "scale": 1, 
  "image_name": "hard.pantry_room_001.rgb_019.jpg"}]

References

[1] Dwibedi, Debidatta, et al. "Cut, paste and learn: Surprisingly easy synthesis for instance detection." Proceedings of the IEEE international conference on computer vision. 2017.

[2] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

[3] Shen, Qianqian, et al. “A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture.” Thirty-seventh conference on neural information processing systems datasets and benchmarks track. 2023.

[4] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[5] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

[6] David Gale and Lloyd S Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962.

[7] David G McVitie and Leslie B Wilson. The stable marriage problem. Communications of the ACM, 14(7):486–490, 1971.

[8] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

BibTeX

If you find our work useful, please consider citing our papers:

@inproceedings{shen2025solving,
            title={Solving Instance Detection from an Open-World Perspective},
            author={Shen, Qianqian and Zhao, Yunhan and Kwon, Nahyun and Kim, Jeeeun and Li, Yanan and Kong, Shu},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            year={2025}
            }

@inproceedings{shen2023high,
            title={A high-resolution dataset for instance detection with multi-view object capture},
            author={Shen, Qianqian and Zhao, Yunhan and Kwon, Nahyun and Kim, Jeeeun and Li, Yanan and Kong, Shu},
            booktitle={Conference on Neural Information Processing Systems (NeurIPS) Datasets & Benchmark Track},
            year={2023}
          }

Organizers

Qianqian Shen
Zhejiang University

Yunhan Zhao
University of California, Irvine

Shu Kong
University of Macau

Object Instance Detection

in conjunction with the 5th Workshop on VPLOW, CVPR 2025 in Nashville, USA.

Overview

Important Dates

LeaderBoard

Dataset

Objects.

Background.

Scenes.