arxiv:2504.14202

Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

Published on Apr 19

Authors:

Abstract

We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2504.14202 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2504.14202 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2504.14202 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.