ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

arXiv Paper WACV 2025 arXiv v1.5 Paper arXiv v2 Paper Model Weights License

img

(April. 2025) Official implementation of Colorize Diffusion.

Colorize Diffusion is a SD-based colorization framework that can achieve high-quality colorization results with arbitrary input pairs.

Fundamental issue for this repository: ColorizeDiffusion (e-print).
Version 1 - Base training, 512px. Released, ckpt starts with mult.
Version 1.5 - Solving spatial entanglement, 512px. Released, ckpt starts with switch.
Version 2 - Enhancing background and style transfer, 768px. Released, ckpt starts with v2.
Version XL - Enhancing embedding guidance for character colorization, geometry disentanglement, 1024px. Available soon.

Getting Start


conda env create -f environment.yaml
conda activate hf

User Interface


We implement a fully-featured UI. To run it, just:

python -u app.py

The default server address is http://localhost:7860.

Important inference options

Options Description
BG enhance Low-level feature injection for v2 models.
FG enhance Useless for currently open-sourced models.
Reference strength Decreasing it to increase semantic fidelity to sketch inputs.
Foreground strength Similar to reference strength but only for foreground region. Need to activate FG or BG enhance.
Preprocessor Sketch preprocessing. Extract is suggested if the sketch input is complicated pencil drawing.
Line extractor Line extractors used when preprocessor is Extract.
Sketch guidance scale Classifier-free guidance scale of the sketch image, suggested 1.
Attention injection Noised low-level feature injection, 2x inference time.

768-level Cross-content colorization results (from v2)

img img

1536-level Character colorization results (from XL)

img img

Manipulation


The colorization results can be manipulated using text prompts, see ColorizeDiffusion (e-print).

It is now deactivated by default. To activate it, use

python -u app.py -manipulate

For local manipulations, a visualization is provided to show the correlation between each prompt and tokens in the reference image.

The manipulation result and correlation visualization of the settings:

Target prompt: the girl's blonde hair
Anchor prompt the girl's brown hair
Control prompt the girl's brown hair, 
Target scale: 8
Enhanced: false
Thresholds: 0.5、0.55、0.65、0.95

img img As you can see, the manipluation unavoidably changed some unrelated regions as it is taken on the reference embeddings.

Manipulation options

Options Description
Group index The index of selected manipulation sequences's parameter group.
Target prompt The prompt used to specify the desired visual attribute for the image after manipulation.
Anchor prompt The prompt to specify the anchored visaul attribute for the image before manipulation.
Control prompt Used for local manipulation (crossattn-based models). The prompt to specify the target regions.
Enhance Specify whether this manipulation should be enhanced or not. (More likely to influence unrelated attribute).
Target scale The scale used to progressively control the manipulation.
Thresholds Used for local manipulation (crossattn-based models). Four hyperparameters used to reduce the influnece on irrelevant visual attributes, where 0.0 < threshold 0 < threshold 1 < threshold 2 < threshold 3 < 1.0.
<Threshold0 Select regions most related to control prompt. Indicated by deep blue.
Threshold0-Threshold1 Select regions related to control prompt. Indicated by blue.
Threshold1-Threshold2 Select neighbouring but unrelated regions. Indicated by green.
Threshold2-Threshold3 Select unrelated regions. Indicated by orange.
>Threshold3 Select most unrelated regions. Indicated by brown.
Add Click add to save current manipulation in the sequence.

Training

Our implementation is based on Accelerate and Deepspeed.
Before starting a training, first collect data and organize your training dataset as follows:

[dataset_path]
β”œβ”€β”€ image_list.json    # Optionally for image indexing
β”œβ”€β”€ color/             # Color images
β”‚   β”œβ”€β”€ 0001.zip        
|   |   β”œβ”€β”€ 10001.png
|   |   β”œβ”€β”€ 100001.jpg
β”‚   |   └── ...
β”‚   β”œβ”€β”€ 0002.zip
β”‚   └── ...
β”œβ”€β”€ sketch             # Sketch images
β”‚   β”œβ”€β”€ 0001.zip
|   |   β”œβ”€β”€ 10001.png
|   |   β”œβ”€β”€ 100001.jpg
β”‚   |   └── ...
β”‚   β”œβ”€β”€ 0002.zip
β”‚   └── ...
└── mask               # Mask images (required for fg-bg training)
    β”œβ”€β”€ 0001.zip
    |   β”œβ”€β”€ 10001.png
    |   β”œβ”€β”€ 100001.jpg
    |   └── ...
    β”œβ”€β”€ 0002.zip
    └── ...

For details of dataset organization, check data/dataloader.py.
Training command example:

accelerate launch --config_file [accelerate_config_file] \
    train.py \
    --name base \
    --dataroot [dataset_path] \
    --batch_size 64 \
    --num_threads 8 \
    -cfg configs/train/sd2.1/mult.yaml \
    -pt [pretrained_model_path]

Refer to options.py for training/inference/validation arguments.
Note that the batch size here is micro batch size per gpu. If you run the command on 8 gpus, the total batch size is 512.

Code reference

  1. Stable Diffusion v2
  2. Stable Diffusion XL
  3. SD-webui-ControlNet
  4. Stable-Diffusion-webui
  5. K-diffusion
  6. Deepspeed
  7. sketchKeras-PyTorch

Citation

@article{2024arXiv240101456Y,
       author = {{Yan}, Dingkun and {Yuan}, Liang and {Wu}, Erwin and {Nishioka}, Yuma and {Fujishiro}, Issei and {Saito}, Suguru},
        title = "{ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text}",
      journal = {arXiv e-prints},
         year = {2024},
          doi = {10.48550/arXiv.2401.01456},
}

@InProceedings{Yan_2025_WACV,
    author    = {Yan, Dingkun and Yuan, Liang and Wu, Erwin and Nishioka, Yuma and Fujishiro, Issei and Saito, Suguru},
    title     = {ColorizeDiffusion: Improving Reference-Based Sketch Colorization with Latent Diffusion Model},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2025},
    pages     = {5092-5102}
}

@article{2025arXiv250219937Y,
    author = {{Yan}, Dingkun and {Wang}, Xinrui and {Li}, Zhuoru and {Saito}, Suguru and {Iwasawa}, Yusuke and {Matsuo}, Yutaka and {Guo}, Jiaxian},
    title = "{Image Referenced Sketch Colorization Based on Animation Creation Workflow}",
    journal = {arXiv e-prints},
    year = {2025},
    doi = {10.48550/arXiv.2502.19937},
}

@article{yan2025colorizediffusionv2enhancingreferencebased,
      title={ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities}, 
      author={Dingkun Yan and Xinrui Wang and Yusuke Iwasawa and Yutaka Matsuo and Suguru Saito and Jiaxian Guo},
      year={2025},
      journal = {arXiv e-prints},
      doi = {10.48550/arXiv.2504.06895},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support