Text-Independent Speaker Verification with Adversarial Learning on Short Utterances

Kai Liu, Huan Zhou

๐Ÿ“Œ Abstract

๋ฌธ์ œ์ : Text-independent speaker verification์€ ์งง์€ ๋ฐœํ™” ์กฐ๊ฑด์—์„œ ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช์Œ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•: short embedding์„ enhanced embedding์— ์ง์ ‘ ๋งคํ•‘ํ•˜์—ฌ ํŒ๋ณ„๋ ฅ(discriminability)์„ ๋†’์ด๋„๋ก adversarialํ•˜๊ฒŒ ํ›ˆ๋ จ๋œ embedding model ์ œ์•ˆ

  • ํŠนํžˆ, loss criteria(๊ธฐ์ค€)์ด ๋งŽ์€ Wasserstein GAN ์‚ฌ์šฉ
  • ์—ฌ๋Ÿฌ loss function์€ ๋šœ๋ ทํ•˜๊ฒŒ ์ตœ์ ํ™”ํ•˜๋ ค๋Š” ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‚˜ ๊ทธ ์ค‘ ์ผ๋ถ€๋Š” ํ™”์ž ๊ฒ€์ฆ ์—ฐ๊ตฌ์— ๋„์›€์ด ๋˜์ง€ ์•Š์Œ
  • ๋Œ€๋ถ€๋ถ„์˜ ์ด์ „ ์—ฐ๊ตฌ์™€ ๋‹ฌ๋ฆฌ ์ด ์—ฐ๊ตฌ์˜ ์ฃผ์š” ๋ชฉํ‘œ ๋Š” ์ˆ˜๋งŽ์€ ablation ์—ฐ๊ตฌ ๋กœ ๋ถ€ํ„ฐ loss criteria์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆ ใ€€โ†’ ์œ„์—์„œ ๋งํ•˜๋Š” SV์—์„œ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” loss๋“ค์„ ์ œ๊ฑฐํ•˜๋ฉด์„œ loss์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์กฐ์‚ฌ
  • VoxCeleb dataset์— ๋Œ€ํ•œ ์‹คํ—˜์—์„œ ์ผ๋ถ€ criteria๋Š” SV ์„ฑ๋Šฅ์— ์ด๋กœ์šด ๋ฐ˜๋ฉด ์ผ๋ถ€ criteria๋Š” ์‚ฌ์†Œํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ
  • ๋งˆ์ง€๋ง‰์œผ๋กœ, finetuning์—†์ด ์‚ฌ์šฉํ•œ Wasserstein GAN์€ baseline์„ ๋„˜์–ด ์˜๋ฏธ ์žˆ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, EER์—์„œ๋Š” 4%์˜ ์ƒ๋Œ€์  ๊ฐœ์„ ๊ณผ 2์ดˆ๊ฐ„์˜ ์งง์€ ๋ฐœํ™”์˜ challengeํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” 7%์˜ minDCF๋ฅผ ๋‹ฌ์„ฑ

โ… . Introduction ๐ŸŒฑ

  • TI-SV: ๋“ฑ๋ก๋œ ํ™”์ž์™€ ํ…Œ์ŠคํŠธ ์Œ์„ฑ(๋‚ด์šฉ ์ œ์•ฝ X)์„ ํ†ตํ•ด ํ™”์ž์˜ ์‹ ์›์„ ๊ฒ€์ฆ
  • ์ค‘์š”ํ•œ ๋‹จ๊ณ„: ์ž„์˜์˜ ์ง€์†์‹œ๊ฐ„์„ ๊ฐ–๋Š” ์Œ์„ฑ์„ ๊ณ ์ • ์ฐจ์›์˜ speaker representation์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ (acoustic feature โ†’ speaker feature)
  • Baseline System: GhostVLAD-aggregated embedding(G-vector); ๊ธด ๋ฐœํ™”, ์งง์€ ๋ฐœํ™”์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์žก์Œ ํ™˜๊ฒฝ์—์„œ x-vector๋ณด๋‹ค ์ด์ ์ด ์žˆ์–ด SV ์‹œ์Šคํ…œ์— ๋” ์œ ๋ฆฌ
  • NIST-SRE 2010 test set์—์„œ full-duration์ด 5์ดˆ๋กœ ๋‹จ์ถ•๋˜์—ˆ์„ ๋•Œ i-vector/PLDA system ์„ฑ๋Šฅ์ด 2.48%์—์„œ 24.78% ๋กœ ๊ฐ์†Œ, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ  ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฅผ ๋ณด์™„ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ง„ํ–‰ ์ค‘
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Wasserstein GAN์˜ adversarial ํ•™์Šต์„ ์ด์šฉํ•˜์—ฌ ํ–ฅ์ƒ๋œ ์ฐจ๋ณ„์„ฑ์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด embedding์„ ์ œ์•ˆ (๊ฐ™์€ ํ™”์ž์˜ ์งง์€ ๋ฐœํ™”์™€ ๊ธด ๋ฐœํ™”์—์„œ ์ถ”์ถœํ•œ G-vector๋ฅผ ํ™œ์šฉํ•˜์—ฌ)

โ…ก. Related Work ๐ŸŒฟ

โœ” GAN ์ด๋ž€: ์ƒ์„ฑ์ž(Generator)์™€ ์‹๋ณ„์ž(Discriminator)๊ฐ€ ์‹ธ์šฐ๋ฉด์„œ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ

  • Generator : Discriminator๋ฅผ ์†์ด๋„๋ก ํ•™์Šต
  • Discriminator : real sample ๐‘ฆ์™€ noise ๐œ‚๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ fake sample ๐‘”์˜ ์ฐจ์ด๋ฅผ ํ•™์Šต

</br>

โœ” Adversarial Learning

  • minmax loss function์ด ๊ต๋Œ€๋กœ ์ตœ์ ํ™” ๊ณผ์ •์„ ์ˆ˜ํ–‰ (๋‘ ๋ชจ๋ธ์˜ loss๊ฐ€ ๊ฐ™์•„์ง€๋Š” ์ƒํƒœ๊ฐ€ ๋  ๋•Œ๊นŒ์ง€)
img
  • Gradients diminishing, exploding ๋ฌธ์ œ๋กœ ํ›ˆ๋ จํ•˜๊ธฐ ์–ด๋ ค์šด๋ฐ ์ด๋ฅผ Wasserstein GAN(WGAN)์—์„œ ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค๋ฃจ์—ˆ์Œ
  • Discriminator๋Š” ์ข‹์€ $๐‘“_๐‘ค$๋ฅผ ์ฐพ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ์ƒˆ๋กœ์šด loss function์€ Wasserstein ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜๋„๋ก ๊ตฌ์„ฑ
img

โ…ข. Proposed Approach ๐ŸŒณ

  • ์ œ์•ˆํ•˜๋Š” ์ „๊ธ‰ ๋ฐฉ์‹์€ ์•„๋ž˜์˜ ๊ตฌ์กฐ์™€ ๊ฐ™์Œ
img

$๐‘ฅ, ๐‘ฆ$ : ๊ฐ™์€ speaker์˜ ๊ฐ๊ฐ ์งง๊ณ  ๊ธด ๋ฐœํ™”์— ํ•ด๋‹นํ•˜๋Š” D์ฐจ์›์˜ G-vector
$๐‘ง$ : speaker ID label
$๐บ_๐‘“$ : embedding generator
$๐บ_๐‘$ : speaker label predictor
$๐บ_๐‘‘$ : Distance calculator
$๐ท_๐‘ค$ : Wasserstein discriminator


  • ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์˜ ํ•ต์‹ฌ์ ์ธ task๋Š” discriminability์ด ํ–ฅ์ƒ๋œ embedding์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ

โœ” loss functions

  • WGAN loss
img


  • Conditional WGAN loss: GAN์— Wasserstein ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•œ ์ƒˆ๋กœ์šด loss function ์ •์˜

    • $๐‘ฅ$ (์งง์€ ๋ฐœํ™” embedding)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, $๐ท_๐‘ค$์™€ $๐บ_๐‘“$ ๋ถ„ํฌ์˜ ์ฐจ์ด ($๐‘ฅ$์™€ real sample, fake sample์„ ์—ฐ๊ฒฐํ•˜์—ฌ ํ•™์Šต)
img


โšก๏ธ WGAGN loss / Conditional WGAN loss ์ค‘ ํ•˜๋‚˜๋งŒ ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ ์ฐจ์ด๋ฅผ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‹ค์‹œ

</br>

  • FID loss: Frรฉchet Inception Distance

    • Real sample๊ณผ fake sample์˜ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์„ ์œ„ํ•œ metric
img


  • class loss: Multi-class cross-entropy loss

    • Speaker์— ๋”ฐ๋ฅธ embedding ์ฐจ์ด๋ฅผ ์œ„ํ•œ loss ์ •์˜
img

$๐‘$ : Batch size
$๐‘$ : Class ์ˆ˜
$๐‘”_๐‘–$ : i๋ฒˆ์งธ ์ƒ์„ฑ๋œ embedding
$๐‘ง_๐‘–$ : ํ•ด๋‹น label index
$๐‘Šโˆˆโ„œ^(๐ทโˆ—๐‘), ๐‘โˆˆโ„œ^๐‘$ : weight matrix, bias


  • Triplet loss

    • Class ๋ถ„๋ฅ˜ ์‹œ error์— ๋Œ€ํ•œ ํŒจ๋„ํ‹ฐ
img

$\Gamma$ : training set์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  embedding์˜ triplet $\gamma=(๐‘”_๐‘Ž, ๐‘”_๐‘, ๐‘”_๐‘›)$์˜ set
$๐‘”_๐‘Ž$ : anchor input
$๐‘”_๐‘$ : positive input
$๐‘”_๐‘›$ : negative input
$\Psiโˆˆโ„œ^+$ : positive์™€ negative ์‚ฌ์ด์˜ safety margin


  • Center loss

    • Class ๋‚ด variation ์ตœ์†Œํ™”
img

$๐‘_(๐‘ฆ_๐‘–)$ : deep feature์˜ ๐‘ฆ_๐‘–๋ฒˆ์งธ class center
$๐‘ฅ_๐‘–$ : $๐‘ฆ_๐‘–$๋ฒˆ์งธ class์— ์†ํ•˜๋Š” ๐‘–๋ฒˆ์งธ deep feature
$๐‘š$ : mini-batch size


  • Cosine distance loss

    • Generator model๋กœ ์–ป์€ ํ–ฅ์ƒ๋œ embedding๊ณผ real sample(target) ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ ๋ ค
img

$\bar ๐‘’$: normalized embedding


:star: โœ” Generator์™€ Discriminator์˜ ์ตœ์ข… Loss

  • $G_f$
img
  • $D_w$
img
  • WGAN ํ›ˆ๋ จ ํ›„ generative model $๐บ_๐‘“$ ์œ ์ง€

    • Test ๋‹จ๊ณ„์—์„œ ์งง์€ ๋ฐœํ™” embedding $๐‘ฅ$๋ฅผ $๐บ_๐‘“$์— ๋„ฃ์–ด enhanced embedding($g$)๋ฅผ ์–ป์Œ

โ…ฃ. Experiments and Results ๐ŸŒบ

โœ” Experimental setup

  • Train: VoxCeleb2์˜ subset (1,057๋ช… ํ™”์ž์˜ 164,716๊ฐœ ๋ฐœํ™”)
  • Test: VoxCeleb1์˜ subset (40๋ช… ํ™”์ž์˜ 13,265๊ฐœ ๋ฐœํ™”)
  • ์งง์€ ๋ฐœํ™”๋ฅผ ์œ„ํ•ด randomํ•˜๊ฒŒ 2์ดˆ ์ž˜๋ผ์„œ ์‚ฌ์šฉ

โœ” Baseline system

  • G-vector (VGG-Restnet34s)

โœ” Hyper Parameter

  • Learning rate 0.0001
  • Adam Optimizer
  • Weight clipping -0.01 ~ 0.01 threshold ($๐ท_๐‘ค$)
  • Batch size 128


โœ” ๋‹ค์–‘ํ•œ loss function์˜ ์˜ํ–ฅ ์—ฐ๊ตฌ

img
img
- FID loss์€ ๊ธ์ •์ ์ธ ์˜ํ–ฅ (v1 vs v2)
- Conditional WGAN์ด WGAN๋ณด๋‹ค ๋‚˜์Œ (v3 vs v4)
- Triplet loss๋ฅผ ๋„ฃ์œผ๋ฉด ์กฐ๊ธˆ ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„ (v7 vs v2)
- Triplet b(fake)๋ณด๋‹ค Triplet a(real, fake ๋ชจ๋‘)๊ฐ€ ํฌ๊ฒŒ ์„ฑ๋Šฅ ํ–ฅ์ƒ (v3 vs v8)
- Softmax๋Š” ๊ธ์ •์ ์ธ ์˜ํ–ฅ (v3 vs v5)
- Center loss์€ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ (v6 vs v7)
- Cosine loss์€ ๊ธ์ •์  ์˜ํ–ฅ (v6 vs v8)


  • ์ถ”๊ฐ€์ ์ธ training function(softmax, cosine, triplet)์ด ๋ชจ๋‘ ํ›ˆ๋ จ์— ๊ธ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • SV์‹œ์Šคํ…œ์— FID, conditional WGAN์€ ๋งค์šฐ ์œ ์šฉ, ์ถ”๊ฐ€ ์กฐ์‚ฌ ๊ฐ€์น˜๊ฐ€ ์žˆ์Œ


โœ” Baseline system๊ณผ ๋น„๊ต

  • ์‹คํ—˜ ์ค‘ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋˜ v3 system๊ณผ G-vector baseline system ๋น„๊ต
    • EER๊ณผ minDCF
img


  • Baseline๋ณด๋‹ค ์งง์€ duration์— ๋Œ€ํ•ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
    • ์ƒ๋Œ€์ ์œผ๋กœ EER์€ 4.2% ๊ฐœ์„ ํ•˜์˜€์œผ๋ฉฐ, minDCF๋Š” 7.2% ๊ฐœ์„  โ€“ 1์ดˆ task์—์„œ๋„ ์ƒ๋Œ€์  EER 3.8% ํ–ฅ์ƒ
  • ์‹œ๊ฐ„ ์ œ์•ฝ์œผ๋กœ FID loss๋Š” ์ตœ์ข… system์— ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์•˜์œผ๋ฉฐ hyper-parameter, loss weight($\alpha, \beta, \gamma, \lambda, \epsilon$)์™€ triplet margin $\Psi$์— ๋Œ€ํ•œ ๋ฏธ์„ธ์กฐ์ •์ด ์—†์—ˆ์Œ
    • ์ œ์•ˆํ•œ system์˜ ๊ฐœ์„ ๋  ์—ฌ์ง€๊ฐ€ ๋งŽ์ด ๋‚จ์•„์žˆ์Œ

โ…ค. Conclusion ๐ŸŒž

  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” WGAN์„ ์ ์šฉ ํ•˜์—ฌ ๋ฐœํ™”๊ฐ€ ์งง์€ speaker verification application์˜ ํ–ฅ์ƒ๋œ embedding์„ ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šต
  • ์ œ์•ˆ๋œ WGAN ๊ธฐ๋ฐ˜ ์ปค๋„ ์‹œ์Šคํ…œ ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์œ„์—, GAN ํ›ˆ๋ จ์—์„œ ๋งŽ์€ loss criteria์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆ
  • ์ตœ์ข… ์ œ์•ˆ ์‹œ์Šคํ…œ์€ ๋„์ „์ ์ธ ์งง์€ ์Šคํ”ผ์ปค ๊ฒ€์ฆ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ baseline system์„ ๋Šฅ๊ฐ€
  • ์ „๋ฐ˜์ ์œผ๋กœ, ์ƒ๋‹นํ•œ ์ง„๋ณด์™€ ์—ฐ๊ตฌ๊ฐ€ ์ง„์ „๋˜๋Š” ์ž ์žฌ์  ๋ฐฉํ–ฅ์„ ๋ณด์—ฌ์คŒ