📢 Gate Square Exclusive: #PUBLIC Creative Contest# Is Now Live!
Join Gate Launchpool Round 297 — PublicAI (PUBLIC) and share your post on Gate Square for a chance to win from a 4,000 $PUBLIC prize pool
🎨 Event Period
Aug 18, 2025, 10:00 – Aug 22, 2025, 16:00 (UTC)
📌 How to Participate
Post original content on Gate Square related to PublicAI (PUBLIC) or the ongoing Launchpool event
Content must be at least 100 words (analysis, tutorials, creative graphics, reviews, etc.)
Add hashtag: #PUBLIC Creative Contest#
Include screenshots of your Launchpool participation (e.g., staking record, reward
HKU-Alibaba's "Visual AI Any Door" can seamlessly transmit objects to the scene with one click
Source: Qubit
With two mouse clicks, the object can be seamlessly "transmitted" to the photo scene, and the light angle and perspective can also be automatically adapted.
Ali and HKU's AI version of "Any Gate" realizes zero-sample image embedding.
With it, online shopping clothes can also directly see the effect of the upper body.
AnyDoor can teleport multiple objects at a time.
Zero sample generation realistic effect
Compared with existing similar models, AnyDoor has the ability of zero-sample operation, and there is no need to adjust the model for specific items.
In fact, other Reference class models can only maintain semantic consistency.
In layman's terms, if the object to be transmitted is a cat, other models can only guarantee that there is also a cat in the result, but the similarity cannot be guaranteed.
For the movement, transposition, and even change of posture of objects in existing images, AnyDoor can also perform well.
working principle
However, before feeding the image containing the target object to the extractor, AnyDoor first performs background removal on it.
Then, AnyDoor will perform self-supervised object extraction and convert it into token.
The encoder used in this step is designed based on the current best self-supervised model DINO-V2.
In order to adapt to changes in angle and light, in addition to extracting the overall features of the item, additional detail information needs to be extracted.
In this step, in order to avoid excessive constraints, the team designed a way to represent feature information with high-frequency maps.
At the same time, AnyDoor uses Hadamard to extract the RGB color information in the image.
Combining this information with a mask that filters edge information yields an HF-Map that contains only high-frequency details.
Using the obtained token, AnyDoor synthesizes the image through the Vinsen graph model.
Specifically, AnyDoor uses Stable Diffusion with ControlNet.
The workflow of AnyDoor is roughly like this. In terms of training, there are also some special strategies.
Although AnyDoor targets still images, part of the data used for training is extracted from videos.
The training data of AnyDoor is formed by separating the object from the background and marking the pair.
But while video data is good for learning, there are quality issues that need to be addressed.
So the team designed an adaptive time-step sampling strategy to collect change and detail information at different times.
From the results of ablation experiments, it can be seen that with the addition of these strategies, both the CLIP and DINO scores gradually increased.
Team Profile
The first author of the paper is Xi Chen, a doctoral student at the University of Hong Kong, who used to be an algorithm engineer at Alibaba Group.
Chen Xi's supervisor, Hengshuang Zhao, is the corresponding author of this paper. His research fields include machine vision and machine learning.
In addition, researchers from Alibaba DAMO Academy and Cainiao Group also participated in this project.
Paper address: