Machine Learning Project

Machine Learning Project)

Code will be uploaded soon!

GitHub : https://github.com/kelee-lab/kelee-MLproj.github.io

Demo Page : https://kelee-lab.github.io/kelee-MLproj.github.io/

Key Paper : HiFi-GAN [J.Kong et al., NIPS 2020]

(Link : https://arxiv.org/abs/2010.05646)

Abstract

In this paper, we chose our baseline model as HiFi-GAN[4]. Based on this model, we changed some structure of generator by suggesting new processing block. We added Mel Supervised Learning Block (MSLB) in generator to generate more realistic and clear waveform audio data. The main concept of MSLB is to add 1D Convolutional layer to extract mel-spectrogram at each upsampling layer as a middle mel value and compare them with given ground truth mel. So we decided to call this process as mel based supervised learning. Additionally, we suggested experimental method by setting an input data directly as wav file format not the two split format as feats.npy and wav.npy. In the experiment we set this condition as input-wave and compared with various model conditions. After training this HGMG model, we could recognize that this model created more natural waveform audio through pixel-wise mel-spectrogram results.

Training Models and Conditions (based on config_v2.json)

To see the effectiveness of different way of mel generation, we set model name with

"-wave" its epoch as 1500

Model_Log Name	DataFormat	Dataset	(Epoch,Batch_size)
HiFi	wave.npy, feats.npy	NIKL	(3100,8)
HGMG	wave.npy, feats.npy	NIKL	(3100,8)
HiFi-wave	Real-time feats.npy	NIKL	(1500,8)
HGMG-wave	Real-time feats.npy	NIKL	(1500,8)
HiFi-LJ	wave.npy, feats.npy	LJSpeech	(3100,8)
HGMG-LJ	wave.npy, feats.npy	LJSpeech	(3100,8)
HiFi-wave-LJ	Real-time feats.npy	LJSpeech	(1500,8)
HGMG-wave-LJ	Real-time feats.npy	LJSpeech	(1500,8)

Reference

[1] A. Oord et al., “WaveNet: A Generative Model For Raw Audio,” arXiv, 2016

[2] Oord, Aaron, et al. "Parallel wavenet: Fast high-fidelity speech synthesis." International conference on machine learning. PMLR, 2018.

[3] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, et al., "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", Advances in Neural Information Processing Systems, pp. 14881-14892, 2019.

[4] J. Kong et al., “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020.

[5] J. Yang, J. Lee, Y. Kim, H. Cho, and J. Kim, “Vocgan: a high fidelity real-time vocoder with a hierarchically-nested adversarial network,” INTERSPEECH, pp. 200–204, 2020.

[6] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP (in press), 2020.

[7] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim, “Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators,” in Proc. ICASSP (in press), 2021.

[8] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv:1807.03748, 2018.

[9] Alexander H Liu, Yu-An Chung, and James Glass.2020. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406.

[10] Steffen Schneider, Alexei Baevski, Ronan Collobert, andMichael Auli. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019

[11] J.-H. Kim, S.-H. Lee, J.-H. Lee, and S.-W. Lee, "Fre-GAN: Adversarial Frequency-consistent Audio Synthesis," Proc. of Interspeech, 2021 (Accepted).

저작자표시 비영리 변경금지

Candy's AI Study Archive

Machine Learning Project

Abstract

Training Models and Conditions (based on config_v2.json)

Reference

댓글

티스토리툴바