본문 바로가기
Recent Projects

Machine Learning Project

by Candy Lee 2021. 6. 17.

Machine Learning Project)

Code will be uploaded soon!

GitHub : https://github.com/kelee-lab/kelee-MLproj.github.io

Demo Page : https://kelee-lab.github.io/kelee-MLproj.github.io/

 

Key Paper : HiFi-GAN [J.Kong et al., NIPS 2020]

(Link : https://arxiv.org/abs/2010.05646)

 

Abstract

In this paper, we chose our baseline model as HiFi-GAN[4]. Based on this model, we changed some structure of generator by suggesting new processing block. We added Mel Supervised Learning Block (MSLB) in generator to generate more realistic and clear waveform audio data. The main concept of MSLB is to add 1D Convolutional layer to extract mel-spectrogram at each upsampling layer as a middle mel value and compare them with given ground truth mel. So we decided to call this process as mel based supervised learning. Additionally, we suggested experimental method by setting an input data directly as wav file format not the two split format as feats.npy and wav.npy. In the experiment we set this condition as input-wave and compared with various model conditions. After training this HGMG model, we could recognize that this model created more natural waveform audio through pixel-wise mel-spectrogram results.

Training Models and Conditions (based on config_v2.json)

To see the effectiveness of different way of mel generation, we set model name with

"-wave" its epoch as 1500

Model_Log Name     DataFormat  Dataset  (Epoch,Batch_size)
HiFi wave.npy, feats.npy NIKL (3100,8)
HGMG wave.npy, feats.npy NIKL (3100,8)
HiFi-wave Real-time feats.npy NIKL (1500,8)
HGMG-wave Real-time feats.npy NIKL (1500,8)
HiFi-LJ wave.npy, feats.npy LJSpeech (3100,8)
HGMG-LJ wave.npy, feats.npy LJSpeech (3100,8)
HiFi-wave-LJ Real-time feats.npy LJSpeech (1500,8)
HGMG-wave-LJ Real-time feats.npy LJSpeech (1500,8)

 

Reference

[1] A. Oord et al., “WaveNet: A Generative Model For Raw Audio,” arXiv, 2016

[2] Oord, Aaron, et al. "Parallel wavenet: Fast high-fidelity speech synthesis." International conference on machine learning. PMLR, 2018.

[3] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, et al., "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", Advances in Neural Information Processing Systems, pp. 14881-14892, 2019.

[4] J. Kong et al., “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020.

[5] J. Yang, J. Lee, Y. Kim, H. Cho, and J. Kim, “Vocgan: a high fidelity real-time vocoder with a hierarchically-nested adversarial network,” INTERSPEECH, pp. 200–204, 2020.

[6] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP (in press), 2020.

[7] R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim, “Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators,” in Proc. ICASSP (in press), 2021.

[8] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv:1807.03748, 2018.

[9] Alexander H Liu, Yu-An Chung, and James Glass.2020. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406.

[10] Steffen Schneider, Alexei Baevski, Ronan Collobert, andMichael Auli. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019

[11] J.-H. Kim, S.-H. Lee, J.-H. Lee, and S.-W. Lee, "Fre-GAN: Adversarial Frequency-consistent Audio Synthesis," Proc. of Interspeech, 2021 (Accepted).

 

반응형

댓글