Robust Learning of Diffusion Models with Extremely Noisy Conditions

1 University of Auckland
2 Shandong University
3 University of Melbourne
4 King Abdullah University of Science and Technology

Corresponding author: jingfeng.zhang@auckland.ac.nz
Figure 1: (a) Structures of our robust diffusion model: a lightweight prediction head that predicts pseudo conditions \( \hat{y} \) is added at the output of the U-Net encoder of the diffusion model, and temporal ensembling is then adopted to update pseudo conditions. (b) Learning dynamics of conditional diffusion models on CIFAR-10 under 40% symmetric noise. Y-axis: controllability (Top-1 ACC of generated images, 1k images/class, 10 classes); X-axis: training epochs. Generations are evaluated using a pretrained CIFAR-10 classifier (Top-1 ACC 92.89%, silver dash-dot line). We compare the pseudo condition (PC) in orange curve and PC with Reverse-time Diffusion Condition (RDC) in green curve, both with early stopping (star markers), against the vanilla conditional diffusion [1] in gray curve.

Abstract

Conditional diffusion models have the generative controllability by incorporating external conditions. However, their performance significantly degrades with noisy conditions, such as corrupted labels in the image generation or unreliable observations or states in the control policy generation. This paper introduces a robust learning framework to address extremely noisy conditions in conditional diffusion models. We empirically demonstrate that existing noise-robust methods fail when the noise level is high. To overcome this, we propose learning pseudo conditions as surrogates for clean conditions and refining pseudo ones progressively via the technique of temporal ensembling. Additionally, we develop a Reverse-time Diffusion Condition (RDC) technique, which diffuses pseudo conditions to reinforce the memorization effect and further facilitate the refinement of the pseudo conditions. Experimentally, our method achieves state-of-the-art performance across a range of noise levels on both class-conditional image generation and visuomotor policy generation tasks.

Method

Noisy conditions \(\tilde{y}\) entangle the clean conditional distribution \(p_0(x|y)\), producing a noisy conditional distribution \(p_0(x|\tilde{y})\). In this distribution, each noisy condition \(\tilde{y}\) can correspond to demonstrations from multiple clean conditions \(y\). This reduces feature consistency and makes vanilla conditional diffusion poorly controllable (Figure 1(b)). We correct noisy conditions by explicitly breaking this entanglement.

Pseudo Condition (PC)

  1. Initialization: For each demonstration \( x_i \) in the noisy dataset, we initialize a pseudo condition \( \hat{y}_i \) as an all-zero vector of the same shape as \( \tilde{y}_i \), disrupting entangled clusters.
  2. Update: We update \( \hat{y} \) to approximate the clean condition \( y \) using temporal ensembling and early stopping. The update \( \hat{y}_{\phi} \) is predicted by a lightweight prediction head added to the diffusion model U-Net encoder (Figure 1(a)).

Figure 2: Evolution of pseudo conditions over training. Each row corresponds to a training without (first row) and with (second row) early stopping. In each row, the left panel shows a scatter plot of points with colors ranging from purple (0) to red/blue (classes 1/-1), and the right panel shows the associated loss curve.

Reverse-Diffusion Condition (RDC)

  1. Augmentation: Inspired by the idea that diffusion training can be interpreted as data augmentation with Gaussian noise [2], we transfer the pseudo-condition \( \hat{y} \) into a diffusion process to augment its learning: \[ \left\{ \begin{aligned} &\text{Forward SDE:}~~ \mathrm{d}\hat{y} = \frac{-f(\hat{y}, t)}{g(t)} \mathrm{d}t + \frac{1}{g(t)} \mathrm{d}\mathbf{w}\text{,}\\ &\text{Reverse SDE:}~~ \mathrm{d}\hat{y} = \left( \frac{-f(\hat{y}, t)}{g(t)} - \frac{1}{g(t)^2} \nabla_{\hat{y}} \log p_t(\hat{y}) \right) \mathrm{d}t + \frac{1}{g(t)}\, \mathrm{d}\mathbf{\bar{w}} \text{,} \end{aligned} \right. \]

Image Generation Conditioned on Noisy Labels

We evaluated class-label image generation on noisy CIFAR-10 and CIFAR-100, using class labels as conditions and images as demonstrations. We trained the baseline EDM and our method (PC+RDC) on the CIFAR-10 and CIFAR-100 datasets under symmetric noise (uniform mislabeling) with noise rates \( \eta \) = 20%, 40%, 60%, 80%. Table 1 shows that our method significantly surpass the baseline method EDM across all moderate and extreme noise level.

Table 1: Conditional Generation Performance Comparison with EDM and TDSM on the CIFAR-10 and CIFAR-100 datasets under Symmetric Noise.
Dataset Noise Level Method FID (↓) IS (↑) Density (↑) Coverage (↑) CW-FID (↓) CW-Density (↑) CW-Coverage (↑)
CIFAR-10 0% EDM 1.92 10.03 103.08 81.90 10.23 102.63 81.57
20% EDM 2.00 9.91 100.03 81.13 16.21 88.45 77.80
Ours 2.02 10.05 107.90 94.28 10.24 106.24 93.84
40% EDM 2.07 9.83 100.94 80.93 30.45 73.02 71.63
Ours 2.17 10.04 105.88 93.80 10.64 102.96 93.18
60% EDM 3.67 9.70 99.14 83.99 51.69 53.47 74.12
Ours 3.23 9.68 99.33 86.84 33.53 68.00 81.56
80% EDM 5.84 9.45 99.36 61.73 79.42 38.40 51.18
Ours 4.25 9.58 103.53 78.73 68.39 47.35 56.70
CIFAR-100 0% EDM 2.51 12.80 87.98 77.63 66.97 82.58 75.78
20% EDM 2.96 12.28 83.01 75.02 79.91 66.47 70.11
Ours 3.18 12.95 98.32 93.52 71.57 90.49 91.51
40% EDM 3.36 11.86 81.70 73.92 100.04 49.77 60.64
Ours 4.60 12.73 84.75 89.25 76.56 75.74 87.90
60% EDM 7.07 12.54 93.55 83.53 117.75 42.92 74.37
Ours 5.57 12.03 91.89 87.45 104.34 67.39 84.03
80% EDM 11.13 12.66 92.09 71.53 146.97 25.02 52.57
Ours 11.50 10.94 83.08 73.03 133.09 35.64 59.98

Figures 3 and 4 illustrate the generation processes of our model trained on CIFAR-10 and CIFAR-100 under 40% symmetric noise. To better visualize the robustness of our model trained with noisy condition, we fix the initial random noise of the demonstration and vary only the label condition. Taking Figure 3 as an example, each row corresponds to one class in the dataset. Each column shows the generation process under different class conditions starting from the same random noise, allowing a clear comparison of how the condition affects the generated image. It can be observed that, even under 40% noise, our method robustly generates images for all given classes.


Visualization Results on CIFAR-10 under 40% symmetric noise.
Figure 3: Visualization Results on CIFAR-10 under 40% symmetric noise. Each row shows generations from different random noise samples for the same class, while each column shows generations from the same random noise across different classes (from top to bottom: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).


Visualization Results on CIFAR-100 under 40% symmetric noise.
Figure 4: Visualization Results on CIFAR-100 under 40% symmetric noise. Each row shows generations from different random noise samples for the same class, while each column shows generations from the same random noise across different classes (from top to bottom: apple, aquarium fish, baby, bear, beaver, bed, bee, beetle, bicycle, bottle, bowl, boy, bridge, bus, butterfly, camel, can, castle, caterpillar, cattle).

Visuomotor Policy Generation Conditioned on Noisy Images

We conducted visuomotor policy generation experiments on the noisy Push-T dataset, where the task is to push a gray T-shaped block to a green target using image observations. Condition noise is simulated via radial (center magnification) and tangential (misalignment stretch) camera distortions, applied with probability \( \eta \) = 20%, 40%, 60%, 80% within preset intensities. As shown in Figure 5, we illustrate the differences between the original clean observations and the synthesized clean observations in the training set under 60% camera distortion.


Figure 5: Examples from the noisy Push-T dataset under 60% camera distortion. Each subfigure consists of two images: the left image shows the original clean Push-T image observation, and the right image shows the corresponding synthesized noisy observation.

Given these noisy Push-T datasets, we trained both the baseline Diffusion Policy (DP) and our method. After training, the policies generated by the DP model and our model were evaluated using Target Area Coverage (TAC), which measures the block-target IoU averaged over three seeds and 500 random initializations. As shown in Table 2, our method outperforms the baseline across all noise levels and demonstrates more stable performance.


Table 2: TAC Comparison with EDM and TDSM on the CIFAR-10 and CIFAR-100 datasets under Symmetric Noise.
Noise Level 20% 40% 60% 80%
DP 76.64 ± 1.67 73.02 ± 2.53 68.35 ± 5.00 68.46 ± 3.31
Ours 80.26 ± 1.07 73.44 ± 1.42 72.74 ± 1.21 71.78 ± 3.24

Figure 6 offers a clear visual comparison of policy generation between our method and the DP baseline. The first row displays policies from the baseline, and the second row shows those from our method, demonstrating that our method complete the pushing task more effectively.


Figure 6: Visualization Results on Push-T under 60% image observation distortion. First row: policies generated by the baseline DP method. Second row: policies generated by our method.

Reference

  1. [1] Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35, 26565-26577.
  2. [2] Kingma, D., & Gao, R. (2023). Understanding diffusion objectives as the ELBO with simple data augmentation. Advances in Neural Information Processing Systems, 36, 65484-65516.

BibTeX

@article{chen2025robustlearningdiffusionmodels,
  title={Robust Learning of Diffusion Models with Extremely Noisy Conditions},
  author={Xin Chen, Gillian Dobbie, Xinyu Wang, Feng Liu, Di Wang, Jingfeng Zhang},
  journal={Arxiv},
  year={2025},
  url={https://arxiv.org/abs/2510.10149}
}