This is a ~60M parameter VQ-VAE model with a small MFCC-based encoder and a ~50M parameter diffusion decoder. Both models expect u-law encoded inputs. This model encodes audio to a rate of 50 codes per second, where each code is 14 bits (as constrained by the VQ layer). The decoder is speaker conditional, and the model presents the ability to perform rudimentary speaker conversion. Most of the flags for training are stored in run_info.json. However, the LR was annealed to 1e-5 at some point near the end of training in an attempt to boost reliability; it's not clear if this helped.