HKU-Alibaba's "Visual AI Any Door" can seamlessly transmit objects to the scene with one click

Source: Qubit

With two mouse clicks, the object can be seamlessly "transmitted" to the photo scene, and the light angle and perspective can also be automatically adapted.

Ali and HKU's AI version of "Any Gate" realizes zero-sample image embedding.

With it, online shopping clothes can also directly see the effect of the upper body.

Because the function is very similar to any door, the R&D team named it AnyDoor.

AnyDoor can teleport multiple objects at a time.

Not only that, but it can also move existing objects in the image.

Some netizens admired after watching it, maybe it will evolve to (passing objects into) video next.

Zero sample generation realistic effect

Compared with existing similar models, AnyDoor has the ability of zero-sample operation, and there is no need to adjust the model for specific items.

In addition to these models that require parameter adjustment, AnyDoor is also more accurate than other Reference models.

In fact, other Reference class models can only maintain semantic consistency.

In layman's terms, if the object to be transmitted is a cat, other models can only guarantee that there is also a cat in the result, but the similarity cannot be guaranteed.

We might as well zoom in on the effect of AnyDoor, can we not see any flaws?

The results of user reviews also confirm that AnyDoor outperforms existing models in both quality and accuracy (out of 4 points).

For the movement, transposition, and even change of posture of objects in existing images, AnyDoor can also perform well.

So, how does AnyDoor achieve these functions?

working principle

In order to realize the transmission of an object, it must first be extracted.

However, before feeding the image containing the target object to the extractor, AnyDoor first performs background removal on it.

Then, AnyDoor will perform self-supervised object extraction and convert it into token.

The encoder used in this step is designed based on the current best self-supervised model DINO-V2.

In order to adapt to changes in angle and light, in addition to extracting the overall features of the item, additional detail information needs to be extracted.

In this step, in order to avoid excessive constraints, the team designed a way to represent feature information with high-frequency maps.

By convolving the target image with a high-pass filter such as a Sobel operator, an image with high-frequency details can be obtained.

At the same time, AnyDoor uses Hadamard to extract the RGB color information in the image.

Combining this information with a mask that filters edge information yields an HF-Map that contains only high-frequency details.

The last step is to inject this information.

Using the obtained token, AnyDoor synthesizes the image through the Vinsen graph model.

Specifically, AnyDoor uses Stable Diffusion with ControlNet.

The workflow of AnyDoor is roughly like this. In terms of training, there are also some special strategies.

###### The training data set used by AnyDoor

Although AnyDoor targets still images, part of the data used for training is extracted from videos.

For the same object, images containing different backgrounds can be extracted from the video.

The training data of AnyDoor is formed by separating the object from the background and marking the pair.

But while video data is good for learning, there are quality issues that need to be addressed.

So the team designed an adaptive time-step sampling strategy to collect change and detail information at different times.

From the results of ablation experiments, it can be seen that with the addition of these strategies, both the CLIP and DINO scores gradually increased.

Team Profile

The first author of the paper is Xi Chen, a doctoral student at the University of Hong Kong, who used to be an algorithm engineer at Alibaba Group.

Chen Xi's supervisor, Hengshuang Zhao, is the corresponding author of this paper. His research fields include machine vision and machine learning.

In addition, researchers from Alibaba DAMO Academy and Cainiao Group also participated in this project.

Paper address:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)