JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

ICLR 2024


Jingyang Zhang1, Shiwei Li1, Yuanxun Lu3, Tian Fang1, David McKinnon1, Yanghai Tsin1, Long Quan2 Yao Yao3
1Apple, 2HKUST, 3Nanjing University

Paper Code

Abstract




We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.



Method Overview




Inspired by ControlNet, we use a copy of the original diffusion model to deal with the additional joint modality. To establish information communication, the two branches are densely connected. All the exchanging tensors are transformed by an additional zero-initialized convolution layer so that the forward pass is unchanged before any fine-tuning.




Compared to the trivial solution that zero-expands the first and the last convolution layer (Direct Extend), JointNet provides smooth transition to joint training and thus prevents catastrophic forgetting. Otherwise, the RGB generation capability is degraded in early iterations and then gradually recovers by learning from the new dataset from scratch.



Joint Generation




Bi-directional Image-Depth Conversion


After capturing the joint image/depth distribution, the bi-directional conversion can be modeled as channel-wise inpainting.

Input Output
Input Output
Input Output
Input Output
Input Output
Input Output
Input Output
Input Output


High-res Depth Refinement


The low-res depth maps from MiDaS can be upsampled and refined by tile-based diffusion.

Input
MiDaS Refined
Input
MiDaS Refined


RGBD Panorama Generation


A beach with palm trees

A photo of a beautiful ocean with coral reef

A photo of a botanical garden




Acknowledgements: The website template was borrowed from Lior Yariv. Image sliders are based on dics.