JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

ICLR 2024

Jingyang Zhang¹, Shiwei Li¹, Yuanxun Lu³, Tian Fang¹, David McKinnon¹, Yanghai Tsin¹, Long Quan² Yao Yao³

¹Apple, ²HKUST, ³Nanjing University

Abstract

We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.

Method Overview

Inspired by ControlNet, we use a copy of the original diffusion model to deal with the additional joint modality. To establish information communication, the two branches are densely connected. All the exchanging tensors are transformed by an additional zero-initialized convolution layer so that the forward pass is unchanged before any fine-tuning.

Compared to the trivial solution that zero-expands the first and the last convolution layer (Direct Extend), JointNet provides smooth transition to joint training and thus prevents catastrophic forgetting. Otherwise, the RGB generation capability is degraded in early iterations and then gradually recovers by learning from the new dataset from scratch.