Diffusers
Safetensors
English

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By clicking "Agree", you acknowledge that these models are released solely for academic research purposes. The models are initialized from Stable Diffusion v2.1 base (CreativeML Open RAIL++-M License) and further trained on a subset of the Re-LAION-5B (research-safe) dataset. You agree to review and comply with the terms and licenses of both the pretrained model and training dataset, and you bear responsibility for any use of this model.

Log in or Sign Up to review the conditions and access this model content.

SD v2-1-base, covariance mismatch experiments

This repository contains versions of the U-Net of Stable Diffusion v2.1 base trained under various settings for the article "Covariance Mismatch in Diffusion Models". The weights are initialized from the pretrained model (Stable Diffusion v2.1 base), and training was done on a 100,000-sample subset of the Re-LAION-5B research-safe dataset.

These models are intended for academic research use only and are not suitable for production deployment.

Training settings:

  • Original: White noise on original data (original): typical training setting with covariance mismatch.
  • Colored noise: Colored noise on original data (colorednoise): The covariance of the noise is modified to align with the covariance of the data. The colored noise is obtained through DCT or DFT domain (colorednoiseDCT, colorednoiseDFT).
  • White data: White noise on whitened data (whitedata): The covariance of the data is modified to align with the covariance of the noise. The whitened data is obtained through DCT or DFT domain (whitedataDCT, whitedataDFT). The loss is either unweighted (whitedata), or weighted (whitedatacolorloss) according to the components' original variance.

The models are trained during 20,000 steps.

Example usage 1: inference on smaller range of noise levels

In this example, we perform inference over a smaller range of noise levels, $SNR \in [0.0386, 0.1930]$, instead of the original one used in Stable Diffusion v2.1 base ($SNR \in [0.0047, 1175.4403]$). For ease of implementation, we keep the original noise schedule and only use the range of timesteps $t \in [600, 800]$ (instead of $t \in [1, 1000]$).

Smaller range of noise levels

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

subfolder = "unet_colorednoiseDCT" # choose among "unet_original", "unet_colorednoiseDCT", "unet_colorednoiseDFT", "unet_whitedataDCT", "unet_whitedataDFT", "unet_whitedatacolorlossDCT", "unet_whitedatacolorlossDFT", "unet_original_600_800", "unet_colorednoiseDCT_600_800", "unet_colorednoiseDFT_600_800"

unet = UNet2DConditionModel.from_pretrained("EPFL-IVRL/sd2.1-base-covariance-mismatch", subfolder=subfolder)    
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base", unet=unet)    
pipe = pipe.to("cuda")

# EulerDiscreteScheduler to allow custom timesteps, "leading" to compute "init_noise_sigma" correctly when using custom timesteps
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config, timestep_spacing="leading")       
from torch_dct import dct_2d, idct_2d
import torch
from diffusers.utils import _get_model_file
from safetensors.torch import load_file

stats = load_file(_get_model_file("EPFL-IVRL/sd2.1-base-covariance-mismatch", weights_name="stats.safetensors", subfolder="relaion2B-en-research-safe-subset100000/fourier_stats"))
variance_spectrum_dft = stats["variance_spectrum_vae64"].to("cuda")
stats = load_file(_get_model_file("EPFL-IVRL/sd2.1-base-covariance-mismatch", weights_name="stats.safetensors", subfolder="relaion2B-en-research-safe-subset100000/dct_stats"))
variance_spectrum_dct = stats["variance_spectrum_vae64"].to("cuda")

# Generate
prompt = "A colorful castle, vibrant colors, detailed."
generator = torch.manual_seed(123456)
initial_noise = torch.randn((1, 4, 64, 64), generator=generator).to("cuda")

# Color initial noise if necessary (if the model is trained with colored noise)
dct = dct_2d(initial_noise, norm='ortho')
dct *= torch.sqrt(variance_spectrum_dct)
initial_noise = idct_2d(dct, norm='ortho')
#ft = torch.fft.fftshift(torch.fft.fftn(initial_noise, dim=(-2, -1), norm="ortho"), dim=(-2, -1))
#ft *= torch.sqrt(variance_spectrum_dft)
#initial_noise = torch.real(torch.fft.ifftn(torch.fft.ifftshift(ft, dim=(-2, -1)), dim=(-2, -1), norm="ortho"))

# Inference timesteps
start = 799
stop = 599
num_steps = 50
timesteps = np.linspace(start, stop, num_steps).astype(int).tolist()

# Generate
generated = pipe(
    prompt=prompt,
    timesteps=timesteps,
    latents=initial_noise,
    output_type="latent",
).images

# Unwhiten (color) generated VAE code if necessary (if the model generates white data)
#dct = dct_2d(generated, norm='ortho')
#dct *= torch.sqrt(variance_spectrum_dct)
#generated = idct_2d(dct, norm='ortho')
#ft = torch.fft.fftshift(torch.fft.fftn(generated, dim=(-2, -1), norm="ortho"), dim=(-2, -1))
#ft *= torch.sqrt(variance_spectrum_dft)
#generated = torch.real(torch.fft.ifftn(torch.fft.ifftshift(ft, dim=(-2, -1)), dim=(-2, -1), norm="ortho"))
    
image = pipe.vae.decode(1 / 0.18215 * generated).sample[0]                
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().detach().permute(1, 2, 0).numpy()               
image_pil = pipe.numpy_to_pil(image)[0]
image_pil.show()
Training setting Generated image
unet_original Generated image
unet_colorednoiseDCT Generated image
unet_colorednoiseDFT Generated image
unet_whitedataDCT Generated image
unet_whitedataDFT Generated image
unet_whitedatacolorlossDCT Generated image
unet_whitedatacolorlossDFT Generated image
unet_original_600_800 Generated image
unet_colorednoiseDCT_600_800 Generated image
unet_colorednoiseDFT_600_800 Generated image

Example usage 2: training and inference on a single noise level

In this example, we perform inference over a single noise level, $SNR = 0.1930$, instead of the wide range of noise levels used in Stable Diffusion v2.1 base ($SNR \in [0.0047, 1175.4403]$). For ease of implementation, we keep the original noise schedule and only use the timestep $t = 600$ (instead of $t \in [1, 1000]$).

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

subfolder = "unet_colorednoiseDCT_600" # choose among "unet_original_600", "unet_colorednoiseDCT_600", "unet_colorednoiseDFT_600"

unet = UNet2DConditionModel.from_pretrained("EPFL-IVRL/sd2.1-base-covariance-mismatch", subfolder=subfolder)    

# As the model is trained on a single timestep, we patch the unet.forward to always use this timestep instead of the provided one
original_forward = unet.forward
unet.forward = lambda sample, _, encoder_hidden_states, *a, **k: original_forward(sample, start, encoder_hidden_states, *a, **k)

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base", unet=unet)    
pipe = pipe.to("cuda")

# EulerDiscreteScheduler to allow custom timesteps, "leading" to compute "init_noise_sigma" correctly when using custom timesteps
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config, timestep_spacing="leading")       
from torch_dct import dct_2d, idct_2d
import torch
from diffusers.utils import _get_model_file
from safetensors.torch import load_file

stats = load_file(_get_model_file("EPFL-IVRL/sd2.1-base-covariance-mismatch", weights_name="stats.safetensors", subfolder="relaion2B-en-research-safe-subset100000/fourier_stats"))
variance_spectrum_dft = stats["variance_spectrum_vae64"].to("cuda")
stats = load_file(_get_model_file("EPFL-IVRL/sd2.1-base-covariance-mismatch", weights_name="stats.safetensors", subfolder="relaion2B-en-research-safe-subset100000/dct_stats"))
variance_spectrum_dct = stats["variance_spectrum_vae64"].to("cuda")

# Generate
prompt = "A colorful castle, vibrant colors, detailed."
generator = torch.manual_seed(12345678)
initial_noise = torch.randn((1, 4, 64, 64), generator=generator).to("cuda")

# Color initial noise if necessary (if the model is trained with colored noise)
dct = dct_2d(initial_noise, norm='ortho')
dct *= torch.sqrt(variance_spectrum_dct)
initial_noise = idct_2d(dct, norm='ortho')
#ft = torch.fft.fftshift(torch.fft.fftn(initial_noise, dim=(-2, -1), norm="ortho"), dim=(-2, -1))
#ft *= torch.sqrt(variance_spectrum_dft)
#initial_noise = torch.real(torch.fft.ifftn(torch.fft.ifftshift(ft, dim=(-2, -1)), dim=(-2, -1), norm="ortho"))

# Inference timesteps
start = 599
stop = 199
num_steps = 3
timesteps = np.linspace(start, stop, num_steps).astype(int).tolist()

# Generate
generated = pipe(
    prompt=prompt,
    timesteps=timesteps,
    latents=initial_noise,
).images[0].show()
Training setting Generated image
unet_original_600 Generated image
unet_colorednoiseDCT_600 Generated image
unet_colorednoiseDFT_600 Generated image

Model Description

Citation

@article{everaert2024covariancemismatch,
    author   = {Everaert, Martin Nicolas and Süsstrunk, Sabine and Achanta, Radhakrishna},
    title    = {{C}ovariance {M}ismatch in {D}iffusion {M}odels}, 
    journal  = {Infoscience preprint Infoscience:20.500.14299/242173},
    month    = {November},
    year     = {2024},
}

Training details

  • Dataset size: 100k image-caption pairs from Re-LAION-5B research-safe
  • Hardware: 1 × NVIDIA A100-SXM4-80GB
  • Pretrained model: Stable Diffusion v2.1 base
  • Optimizer: AdamW (32-bit, no quantization)
    • betas: (0.9, 0.999)
    • weight_decay: 0.01
    • eps: 1e-08
    • lr: Constant 1e-05
  • Batch size: 32 (no gradient accumulation)
  • Caption dropout: 10%
  • Exponential Moving Average (EMA) decay: 0.99
  • Training steps: 20,000
  • Training Time:
    • unet_original: 9h52min
    • unet_colorednoiseDCT: 9h51min
    • unet_colorednoiseDFT: 9h55min
    • unet_whitedataDCT: 9h47min
    • unet_whitedataDFT: 9h47min
    • unet_whitedatacolorlossDCT: 9h50min
    • unet_whitedatacolorlossDFT: 9h43min
    • unet_original_600_800: 9h51min
    • unet_colorednoiseDCT_600_800: 9h52min
    • unet_colorednoiseDFT_600_800: 9h53min
    • unet_original_600: 9h48min
    • unet_colorednoiseDCT_600: 9h48min
    • unet_colorednoiseDFT_600: 9h54min
  • Covariance realignment method:
    • unet_original: no covariance realignment, original data (not whitened), white noise (not colored), no reweighting of components in the loss
    • unet_colorednoiseDCT: original data, colored noise (DCT), no reweighting
    • unet_colorednoiseDFT: original data, colored noise (DFT), no reweighting
    • unet_whitedataDCT: whitened data (DCT), white noise, no reweighting
    • unet_whitedataDFT: whitened data (DFT), white noise, no reweighting
    • unet_whitedatacolorlossDCT: whitened data (DCT), white noise, loss reweighted by component variance
    • unet_whitedatacolorlossDFT: whitened data (DFT), white noise, loss reweighted by component variance
    • unet_original_600_800: no covariance realignment, original data, white noise, no reweighting
    • unet_colorednoiseDCT_600_800: original data, colored noise (DCT), no reweighting
    • unet_colorednoiseDFT_600_800: original data, colored noise (DFT), no reweighting
    • unet_original_600: original data, white noise, no reweighting
    • unet_colorednoiseDCT_600: original data, colored noise (DCT), no reweighting
    • unet_colorednoiseDFT_600: original data, colored noise (DFT), no reweighting
  • Training range of noise levels:
    • unet_*: full original range $SNR \in [0.0047, 1175.4403]$
    • unet_*_600_800: smaller range $SNR \in [0.0386, 0.1930]$
    • unet_*_600: fixed noise level $SNR = 0.1930$
  • Training loss:
    • unet_original
      • Training loss
    • unet_colorednoiseDCT
      • Training loss
    • unet_colorednoiseDFT
      • Training loss
    • unet_whitedataDCT
      • Training loss
    • unet_whitedataDFT
      • Training loss
    • unet_whitedatacolorlossDCT
      • Training loss
    • unet_whitedatacolorlossDFT
      • Training loss
    • unet_original_600_800
      • Training loss
    • unet_colorednoiseDCT_600_800
      • Training loss
    • unet_colorednoiseDFT_600_800
      • Training loss
    • unet_original_600
      • Training loss
    • unet_colorednoiseDCT_600
      • Training loss
    • unet_colorednoiseDFT_600
      • Training loss
Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EPFL-IVRL/sd2.1-base-covariance-mismatch