Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-End Speaker Verification

Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, Patrick Kenny

๐Ÿ“Œ Abstract

  • GANs๋ฅผ ์ด์šฉํ•œ domain invariant speaker embedding์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹ ์ œ์•ˆ - source data์™€ target data๋กœ generator๊ฐ€ embedding์„ ์ƒ์„ฑ - ์ƒ์„ฑ๋œ embedding์ด source์ธ์ง€ target์ธ์ง€ discriminator๊ฐ€ ์‹๋ณ„

  • ์ด๋Ÿฌํ•œ framework๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ GAN ๋ณ€ํ˜•์„ ํ›ˆ๋ จํ•˜๊ณ  ํ™”์ž ๊ฒ€์ฆ์— ์ ์šฉ

  • Angular Margin loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ End-to-End model ์ตœ์ ํ™”

img




โ… . Introduction

- ํ™”์ž embedding : ๊ฐœ์ธ์˜ identity์™€ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ์ €์ฐจ์› ๋ฒกํ„ฐ ํ‘œํ˜„


โœ” Neural Network๊ธฐ๋ฐ˜ ํ™”์ž embedding

  • ์Œ์„ฑ ์ธ์‹, ํ•ฉ์„ฑ ๋ฐ source ๋ถ„๋ฆฌ, ํ™”์ž ๊ฒ€์ฆ ์ ์šฉ ๋“ฑ ๋‹ค์–‘ํ•˜๊ฒŒ ์ ์šฉ

โœ” End-to-End system speaker verification

  • ๋‘ ๊ฐœ์˜ ์Œ์„ฑ ํŒŒ์ผ์—์„œ embedding์„ ์ถ”์ถœํ•œ ๋’ค embedding ์‚ฌ์ด์˜ cosine distance ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ score ๊ณ„์‚ฐ
  • ๋ชจ๋ธ์ด ๊ฒฌ๊ณ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฑฐ๋ฆฌ ์ธก์ • ๊ธฐ์ค€์„ ์ง์ ‘ ์ตœ์ ํ™”ํ•ด์•ผ ํ•จ (End-to-End)
  • ๊ทธ๋Ÿฌ๋‚˜, ํ™”์ž ๊ฒ€์ฆ์—์„œ ํ›ˆ๋ จํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒƒ์œผ๋กœ ํŒ๋‹จ

โœ” I-vector system๊ณผ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉ

  • ์ฐจ์› ๊ฐ์†Œ์—๋Š” LDA(Linear Discriminant Analysis) ์‚ฌ์šฉ
  • ๊ฒ€์ฆ ์‹œ PLDA(Probabilistic Linear Discriminant Analysis) ์‚ฌ์šฉ

โœ” NIST SRE 2016 dataset ์‚ฌ์šฉ

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ(์˜์–ด)์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ(๊ด‘๋‘ฅ์–ด ๋ฐ ํƒ€๊ฐˆ๋กœ๊ทธ์–ด) ์‚ฌ์ด์˜ mismatch๋ฅผ ๋„์ž… (Domain or Covariate shift)
  • domain ๋ณด์ƒ์„ ์œ„ํ•œ ์ ์€ ์–‘์˜ label์ด ์—†๋Š” target ๋ฐ์ดํ„ฐ ์ œ๊ณต

โœ” ๋ณธ ๋…ผ๋ฌธ ์ €์ž์˜ ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ, End-to-End์˜ cosine score๋ฅผ ์‚ฌ์šฉํ•˜๋Š” domain adversarial ํ›ˆ๋ จ์„ ์ด์šฉํ•œ domain ๋ถˆ๋ณ€ ํ™”์ž embedding ํ›ˆ๋ จ ์ œ์•ˆ (Domain Adversarial Neural Speaker Embeddings, DANSE)

  • Gradient reversal์„ ์‚ฌ์šฉํ•˜์—ฌ domain ๋ถˆ๋ณ€์„ฑ ๋ฐ adversarial grame์˜ ์ตœ์†Œํ™” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑ


โœ” ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” GANs๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ unsupervised domain adaptation/invariant๋กœ ์ด์ „ ์—ฐ๊ตฌ ํ™•์žฅ

< ์žฅ์ >

  • gradient reversal๋ณด๋‹ค ๋ถˆ๋ณ€์„ฑ mapping์„ ํ•™์Šตํ•˜๋Š”๋ฐ ๋” ๋‚˜์€ gradients ์ œ๊ณต
  • GAN framework๋Š” gradient reversal๋ณด๋‹ค ๋” ์ผ๋ฐ˜์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅ


โœ” ๋‹ค์–‘ํ•œ GAN ๋ณ€ํ˜•

  • ํŠน์ง• ๊ณต๊ฐ„์˜ ๋‹ค๋ฅธ ๋ณ€ํ˜•์„ ์ƒ์„ฑ
  • ์ด๋Ÿฌํ•œ ํŠน์ง• ๊ณต๊ฐ„์„ ๊ฒฐํ•ฉ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ด
  • Auxiliary Classifier GAN(AuxGAN)์˜ ์ˆ˜์ •์„ ์ œ์•ˆ
  • GAN ๋ชจ๋ธ์ด DNASE ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€
  • ๋‹ค์–‘ํ•œ GAN ๋ชจ๋ธ์˜ score๋ฅผ ํ‰๊ท ํ•จ์œผ๋กœ์จ x-vector์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ํ–ฅ์ƒ๋จ




โ…ก. Domain Adaption with GANs

โœ” GAN

  • Generator : target data๋ฅผ source data์˜ domain์œผ๋กœ mapping
  • Discriminator : source data์™€ target data์˜ domain์„ ๊ตฌ๋ณ„
img
  • ์—ฌ๋Ÿฌ GAN ๋ณ€ํ˜•์— ํ•ด๋‹นํ•˜๋Š” ๋‹ค๋ฅธ discriminator์˜ ๊ตฌ์„ฑ์ด ํŠน์ง• ๊ณต๊ฐ„์˜ ๋‹ค๋ฅธ ๋ณ€ํ™˜์„ ๊ฐ€์ ธ์˜จ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ
  • vanilla GAN์—์„œ discriminator๋Š” binary cross-entropy(BCE) loss๋ฅผ ์ตœ์ ํ™”ํ•˜์—ฌ ํ›ˆ๋ จ


โœ” GAN game (๊ธฐ์กด GAN loss)

img

E, D : Embedding(generator), Discriminator ํ•จ์ˆ˜

img


โœ” Gradients reversal model

img




โ…ข. Generative Adversarial Speaker Embedding Networks

โœ” ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ

  • ํ™”์ž embedding model์ด ํŠน์ง• ์ถ”์ถœ๊ธฐ(generator)์™€ domain ์‹๋ณ„์ž(discriminator) ์‚ฌ์ด์˜ GAN game์„ ํ†ตํ•ด domain ๋ถˆ๋ณ€์  ํŠน์ง•์„ ํ•™์Šต
  • GAN์ด domain ๋ถˆ๋ณ€์„ฑ์„ ๊ฐ–์œผ๋ฉฐ, embedding์ด ํ™”์ž๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ


โœ” Loss function (AM-softmax/GAN loss)

  • class๊ฐ„ cosine similarity๋ฅผ ์ง์ ‘ ์ตœ์ ํ™”
img

C, E : Classifier, Embedding(generator) ํ•จ์ˆ˜


img

s, m : scale factor, margin


  • BCE loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ domain discriminator๋ฅผ ํ›ˆ๋ จ
img
  • ๋งˆ์ง€๋ง‰์œผ๋กœ, ์•„๋ž˜์˜ loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ discriminator๋ฅผ ์†์ด๊ธฐ ์œ„ํ•ด generator(embedding) ํ›ˆ๋ จ
img
  • embedding ํ•จ์ˆ˜๋Š” task loss์™€ ํ•จ๊ป˜ ๊ทธ ๋‹ค์Œ adversarial loss ์ด 2๋ฒˆ ํ•™์Šต


3.1. Auxiliary Classifier GAN

โœ” AuxGAN(ACGAN)

  • ์กฐ๊ฑด(conditional) ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•ด ๋ณด์กฐ(Auxiliary) loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GAN์„ ๋ณด์™„

  • side ์ •๋ณด(class label ๋“ฑ)์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ

  • D (discriminator) : 2๊ฐœ์˜ classifier - ๋ฐ์ดํ„ฐ๊ฐ€ ์ง„์งœ(real) ์ธ์ง€ ๊ฐ€์งœ(fake) ์ธ์ง€ ํŒ๋ณ„ - ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์ฃผ(category)๋ฅผ ๋ถ„๋ฅ˜

  • G (generator) : label์ •๋ณด์™€ z(noise)๋กœ ๊ฐ€์งœ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

img

โœ” ์›๋ž˜ ACGAN์˜ object fuction

  • source์˜ log-likelihood $L_s$, class์˜ log-likelihood $L_c$

    $L_s$ : ๊ธฐ์กด GAN์˜ ๋ชฉ์  ํ•จ์ˆ˜์™€ ๊ฐ™์Œ (real/fake ํŒ๋ณ„)
    $L_c$ : ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ class๋ฅผ ํŒ๋‹จ (conditional-GAN, CGAN๊ณผ ์œ ์‚ฌ)


  • D(discriminator)๋Š” $L_s + L_c$๋ฅผ ์ตœ๋Œ€ํ™”
  • G(generator)๋Š” $L_c - L_s$๋ฅผ ์ตœ๋Œ€ํ™”


img


โœ” ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ACGAN์˜ object function

img


3.2. GAN Variants

๐Ÿ”น ๋‹ค์–‘ํ•œ GAN์˜ ๋ณ€ํ˜• ์‚ฌ์šฉ

  • ํ‘œ์ค€ GAN
  • Least-Squares GAN
  • Relativistic GAN

๐Ÿ”น ๊ฐ ๋ณ€ํ˜•์ด ํŠน์ง• ๊ณต๊ฐ„์„ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ๋ณ€ํ˜•

  • ๋ชจ๋“  ๋ชจ๋ธ์€ ๊ฑฐ์˜ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

๐Ÿ”น ๋ชจ๋“  GAN ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐํ•ฉ

  • ํ‰๊ท  ์ ์ˆ˜(cosine distance score)๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์ด ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž„




โ…ฃ. Experiments and Results

โœ” Training data(source)


  • ์ œ์•ˆํ•œ DANSE ๋ชจ๋ธ๊ณผ x vector, i vector ์˜ baseline ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด NIST SRE 2004 2010 ๋ฐ Switchboard Cellular audio ์‚ฌ์šฉ
  • ์žก์Œ ๋ฐ ์ž”ํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (128K์˜ noisy data์ถ”๊ฐ€ํ•˜์—ฌ, 220K๊ฐœ ์‚ฌ์šฉ)
  • Adversarial ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์œ„ํ•ด , 5 ๊ฐœ ์ดํ•˜์˜ ๋ฐœํ™”์ธ ํ™”์ž๋Š” ๊ฑธ๋Ÿฌ๋‚ด๊ณ  ์•ฝ 6000 ๋ช…์˜ ํ™”์ž๋ฅผ ์‚ฌ์šฉ
  • x-vector, i-vector ๋Š” Kaldi toolkit ์‚ฌ์šฉ
  • ๋Œ€๋ถ€๋ถ„์ด ์˜์–ด ์‚ฌ์šฉ์ž ์ด๋ฉฐ , ์ „ํ™”๋ฅผ ํ†ตํ•ด ๋…น์Œ


โœ” Model


  • Embedding(generator) ํ•จ์ˆ˜๋Š” 3X 2 3 input ์˜ Convolutional layer, 4 ๊ฐœ์˜ residual block, attentive statistics layer, 2 ๊ฐœ์˜ fully connected layer (512, 512) ๋กœ ๊ตฌ์„ฑ
  • Classifier๋Š” fully connected layer (64) ์™€ AM softmax output layer ๋กœ ๊ตฌ์„ฑ (fully connected layer ๊ฐ€ ์ตœ์ข… domain ๋ถˆํŽธ ํ™”์ž embedding)
  • Discriminator๋Š” 2 ๊ฐœ์˜ fully connected layer (256, 256) ์™€ binary cross entropy output layer ๋กœ ๊ตฌ์„ฑ
  • ELU(Exponential Linear Units)๋ฅผ ๋ชจ๋“  ๊ณ„์ธต์— ์‚ฌ์šฉ
  • Batch normalization์€ attentive statistics layer ๋ฅผ ์‚ฌ์šฉํ•œ ๊ณ„์ธต์— ์‚ฌ์šฉ
  • AMsoftmax loss ์˜ s ์™€ m parameter ๋Š” ๊ฐ๊ฐ 30 ๊ณผ 0.6 ์œผ๋กœ ์„ค์ •


โœ” Optimization

  • cross entropy ํ›ˆ๋ จ์„ ์‚ฌ์šฉํ•˜์—ฌ embedding ํŠน์ง•์„ ์‚ฌ์ „ ํ›ˆ๋ จ
  • ์„ธ ๊ฐ€์ง€ ๋„คํŠธ์›Œํฌ (embedding ํŠน์ง• , Classifier, ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ optimizer ์‚ฌ์šฉ
  • Discriminator๋Š” lr = 0.003 ์˜ RMSprop , Classifier ์™€ embedding ์€lr 0.001 ์˜ SGD ์‚ฌ์šฉ


โœ” Data sampling

  • ํ›ˆ๋ จ ์ค‘ ํ›ˆ๋ จ set ์˜ ๊ฐ ๋…น์Œ์—์„œ ๋ฌด์ž‘์œ„๋กœ audio chunk sampling
  • ๊ฐ ์Œ์„ฑ์„ 10 ๋ฒˆ sampling (epoch)
  • Source data์˜ mini batch ์— ๋Œ€ํ•ด GAN ํ›ˆ๋ จ์„ ์œ„ํ•œ label ์ด ์—†๋Š” adaption data ๋„ ๋™์ผํ•˜๊ฒŒ ๋ฌด์ž‘์œ„๋กœ mini batch ๋ฅผ sampling


โœ” Speaker Verification

  • Test์‹œ embedding ์ถ”์ถœ์— ํ•„์š”ํ•˜์ง€ ์•Š์€ domain discriminator ๋ฅผ ์—†์•ฐ
  • 64์ฐจ์›์˜ ๋งˆ์ง€๋ง‰ hidden layer ๊ฐ€ ์ตœ์ข… ํ™”์ž embedding
  • Verification์‹คํ—˜์€ cosine distance ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ score ๊ณ„์‚ฐ
  • ์„ฑ๋Šฅ์˜ ์ง€ํ‘œ๋Š” EER ์‚ฌ์šฉ


โœ” Model block

img
img


โœ” ์ œ์•ˆํ•œ adversarial ํ™”์ž embedding๊ณผ baseline system ์„ฑ๋Šฅ ๋น„๊ต

  • Baseline์‹œ์Šคํ…œ ์ค‘์—์„œ๋Š” DNN ๊ธฐ๋ฐ˜์˜ x vector ์‹œ์Šคํ…œ์ด LDA ์ฐจ์› ๊ฐ์†Œ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋„ i-vector ์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ํ–ฅ์ƒ
  • ๋ชจ๋“  GAN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์ด DANSE ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • AuxGAN(ACGAN), LSGAN, RelGAN embedding ์˜ score ๋ฅผ ํ‰๊ท ํ•œ ๊ฒƒ์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•จ


img




โ…ค. Conclusion

  • GANs๋ฅผ ์ด์šฉํ•œ domain ๋ถˆ๋ณ€ ํ™”์ž embedding ํ•™์Šต์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด framework ์ œ์•ˆ
  • ์—ฌ๋Ÿฌ ๊ฐ€์ง€ GAN ์˜ ๋ณ€ํ˜•์„ ํ•™์Šตํ•˜์—ฌ score ๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ํฌ๊ฒŒ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ์–ป์Œ
  • End-to-End model ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ„๋‹จํ•œ cosine distance ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ score ๋ฅผ ๊ณ„์‚ฐ


  • ํ–ฅํ›„ ํŠน์ง• ๊ณต๊ฐ„๊ณผ ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„ GAN ์˜ ๊ฒฐํ•ฉ ๋ฐ GAN ๊ธฐ๋ฐ˜ ํŠน์ง• ๊ณต๊ฐ„ ์ฆ๊ฐ• ๋ฐฉ๋ฒ•๊ณผ ๊ฐ™์ด ๋‹ค๋ฅธ adversarial ์ „๋žต์„ ๊ณ ๋ คํ•  ๊ฒƒ