This is a ~10M parameter diffusion model trained on LibriSpeech train-clean-360. It can roughly be reproduced by running the below command for 277K iterations: python3 -u train_diffusion.py \ --predictor unet \ --base-channels 32 \ --grad-checkpoint \ --batch 4 \ --ema-rate 0.9999 \ $DATA/LibriSpeech/train-clean-360 The final quartile losses for the model were: q0=0.02244 q1=0.00119 q2=0.00010 q3=0.00003 When sampling with 100 steps (and using x0 constraints), the samples sound like noisy babbling. The samples seem to use several different voices (although voice is mostly consistent within samples), ranging widely in pitch. The volume also appears to vary widely. For evaluations, I sampled with this command: python3 sample_diffusion.py \ --checkpoint-path model_diffusion_ema.pt \ --sample-steps 50 \ --schedule 'lambda x: x**2' \ --constrain \ --num-samples 10000 \ --batch-size 16 \ --sample-path samples-50step-sqschedule Class score: 47.1 (higher better) Frechet score: 2494 (lower better)