This is a ~60M parameter VQ-VAE model with a small MFCC-based encoder and a ~50M
parameter diffusion decoder. Both models expect u-law encoded inputs.

This model encodes audio to a rate of 50 codes per second, where each code is 14
bits (as constrained by the VQ layer). The decoder is speaker conditional, and
the model presents the ability to perform rudimentary speaker conversion.

Most of the flags for training are stored in run_info.json. However, the LR was
annealed to 1e-5 at some point near the end of training in an attempt to boost
reliability; it's not clear if this helped.