End-to-End DNN based Speaker Recognition Inspired by i-vector and PLDA

Johan Rohdin, Anna Silnova , Mireia Diez, Oldrich Plchot , Pavel Matejka , Lukas Burget

๐Ÿ“Œ Abstract

  • ์ตœ๊ทผ text-dependent๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์งง์€ ๋ฐœํ™”์—์„œ์˜ text-independent task์—์„œ๋„ DNN ๊ธฐ๋ฐ˜ End-to-End ์‹œ์Šคํ…œ์˜ ๊ฒฝ์Ÿ๋ ฅ์„ ์ž…์ฆ
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ธด ๋ฐœํ™” text-independent์˜ ๊ฒฝ์šฐ ์•„์ง i-vector + PLDA ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • i-vector + PLDA baseline์„ ๋ชจ๋ฐฉํ•œ speaker verification system์„ ์ œ์•ˆ (End-to-End ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋˜์ง€๋งŒ baseline system์— ๋ฉ€๋ฆฌ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก ์ •๊ทœํ™”)
  • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ overfitting์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ํ•ด๊ฒฐํ•˜์˜€์œผ๋ฉฐ, ๊ธด ๋ฐœํ™”์™€ ์งง์€ ๋ฐœํ™”์—์„œ ๋ชจ๋‘ i-vector + PLDA baseline system๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ํ™•์ธ




โ… . Introduction

[ ์ด์ „์— ์†Œ๊ฐœ๋œ DNN ๊ธฐ๋ฐ˜์˜ speaker recognition system ํŠน์ง• ]

  1. i-vector + PLDA system์˜ ๊ตฌ์„ฑ์š”์†Œ(feature extraction, calculation of sufficient statistics, i-vector extraction or PLDA) ์ค‘ ํ•˜๋‚˜๋ฅผ NN(Neural Network)๋กœ ๋Œ€์ฒดํ•˜๊ฑฐ๋‚˜ ๊ฐœ์„ 
    • MFCC feature ๋Œ€์‹  bottleneck feature ์‚ฌ์šฉ
    • sufficient statistics ๊ณ„์‚ฐ ์‹œ GMM-UBM ๋Œ€์‹  NN acoustic model ์‚ฌ์šฉ
    • PLDA๋ฅผ ๋ณด์™„ํ•˜๊ฑฐ๋‚˜ ๋Œ€์ฒดํ•˜๋Š” NN ์‚ฌ์šฉ
  2. Speaker ID๋ฅผ ๋ถ„๋ฅ˜ํ•˜์—ฌ ํ›ˆ๋ จํ•œ NN์„ ํ†ตํ•ด speaker embedding ์ถ”์ถœ - ๋Œ€ํ‘œ์ ์ธ ํŠน์ง• : d-vector, x-vector์™€ ๊ฐ™์€ embedding
    • acoustic feature๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์„œ speaker label๊ณผ loss๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค, NN ๋ชจ๋ธ์˜ ์ผ๋ถ€(hidden layer, TDNN + fully-connected DNN ์ค‘ DNN)์„ utterance-level์˜ feature๋กœ ์‚ฌ์šฉ
    • text-dependent, ์งง์€ ๋ฐœํ™” text-independent์—์„œ ํšจ๊ณผ์ 
    • ๋น„๊ต์  ๊ธด ๋ฐœํ™” text-independent์—์„œ๋Š” i-vector + PLDA๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ


Proposed Method : i-vector + PLDA baseline์„ ๋ชจ๋ฐฉํ•œ End-to-End speaker verification ์‹œ์Šคํ…œ

1. f2s (sufficient statistics ์ถ”์ถœ์„ ์œ„ํ•œ NN ๋ชจ๋“ˆ)

2. s2i (i-vector ์ถ”์ถœ์„ ์œ„ํ•œ NN ๋ชจ๋“ˆ)

3. DPLDA (์ ์ˆ˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ Discriminative PLDA ๋ชจ๋“ˆ)

  • ์„ธ ๊ฐœ์˜ ๋ชจ๋“ˆ์ด ๊ฐœ๋ณ„์ ์œผ๋กœ baseline์„ ๋ชจ๋ฐฉํ•˜๊ณ , ํ›ˆ๋ จ๋˜๋ฉฐ ์ดํ›„ ๊ฒฐํ•ฉํ•œ ๋’ค ์งง์€ ๋ฐœํ™”์™€ ๊ธด ๋ฐœํ™” ๋ชจ๋‘์— ๋Œ€ํ•ด End-to-End ๋ฐฉ์‹์œผ๋กœ ์ถ”๊ฐ€ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•จ

  • ์ด๋•Œ, ์ถ”๊ฐ€ ํ›ˆ๋ จ ์‹œ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ›ˆ๋ จํ•˜์—ฌ ์–ป์€ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ์ˆ˜์ •๋˜์ง€ ์•Š๋„๋ก ์ •๊ทœํ™”๋ฅผ ์‹ค์‹œ (baseline๊ณผ ๋„ˆ๋ฌด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ณ  overfitting์˜ ์œ„ํ—˜์„ ์ค„์ด๋Š” ์žฅ์ ์ด ์กด์žฌ)

  • NIST SRE์—์„œ ํŒŒ์ƒ๋œ 3๊ฐœ์˜ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์…‹์— ๋Œ€ํ•ด ์‹œ์Šคํ…œ์„ ํ‰๊ฐ€ (๋‹ค์–‘ํ•œ ์–ธ์–ด์˜ ์Œ์„ฑ์„ ํฌํ•จํ•˜๊ณ , 2๋ถ„ ๋ฏธ๋งŒ์˜ ๊ธด ๋ฐœํ™”์™€ 40์ดˆ ๋ฏธ๋งŒ์˜ ์งง์€ ๋ฐœํ™” ๋ชจ๋‘์— ๋Œ€ํ•ด ์„ฑ๋Šฅ์„ ํ…Œ์ŠคํŠธ)




โ…ก. Database and Baseline Systems

  1. ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ๋Š” PRISM dataset ๊ธฐ๋ฐ˜, 3๊ฐ€์ง€ ํ‰๊ฐ€ ์…‹ (1) NIST SRE 2005~2010๋…„ ๋ฐ์ดํ„ฐ ์›๋ณธ(๊ธด) ์ „ํ™” ๋ฐœํ™” ์ค‘ ์—ฌ์„ฑ (2) (1) ์Œ์›์„ ์—ฌ๋Ÿฌ ์งง์€ ๋ฐœํ™”๋กœ ์ƒ์„ฑ(๋“ฑ๋ก : 20~50์ดˆ, ํ…Œ์ŠคํŠธ 30~40์ดˆ) (3) NIST SRE 2016 ํ‰๊ฐ€ ์„ธํŠธ (๋‚จ/์—ฌ ๋ชจ๋‘, ๋‹จ์ผ ๋“ฑ๋ก)


  1. Generative(PLDA) and Discriminative(DPLDA) Baseline
    • ํŠน์ง• : 60dimension-MFCC (20์ฐจ์›, โˆ†, โˆ†โˆ†)
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ค‘ ์ „ํ™” ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉ (์งง์€ ๋ฐœํ™” ์‹œ๊ฐ„์€ 10~60์ดˆ ์‚ฌ์ด ๊ท ์ผ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋ฉฐ ์ด 85,858๊ฐœ ์ค‘ ์งง์€ ๋ฐœํ™”๋Š” 22,766๊ฐœ)
    • PLDA/DPLDA : 2048๊ฐœ component๋ฅผ ๊ฐ–๋Š” UBM, 400์ฐจ์› i-vector


PLDA

  • i-vector์˜ ํ‰๊ท (๋ชจ๋“  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ i-vector ํ‰๊ท ) ๊ณผ ๊ธธ์ด๋ฅผ ์ •๊ทœํ™”
  • ์ถ”๊ฐ€์ ์ธ domain ์ ์‘์ด๋‚˜ score normalization์€ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Œ
  • ๊ฐ ํ™”์ž๊ฐ€ 6๊ฐœ์˜ ๋ฐœํ™”๋ฅผ ๊ฐ–๋„๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ 68,994๊ฐœ๋กœ ์ค„์—ฌ์„œ ์‚ฌ์šฉ

DPLDA

  • LBFGS optimizer๋กœ binary cross-entropy๋ฅผ ์ตœ์ ํ™” (๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ, ์ดˆ๊ธฐํ™”๋กœ PLDA๋ฅผ ์‚ฌ์šฉ)
  • i-vector์˜ ํ‰๊ท ๊ณผ ๊ธธ์ด๋ฅผ ์ •๊ทœํ™”
  • LDA๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ i-vector์˜ ์ฐจ์›์„ 250์ฐจ์›์œผ๋กœ ์ถ•์†Œ




โ…ข. Proposed End-to-End DNN Architecture

1. Feature to Sufficient statistics : f2s [ํŠน์ง• ๋ฒกํ„ฐ โ†’ ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰]

  • ์ž…๋ ฅ ๋ฐœํ™”์˜ ๊ฐ frame์— ๋Œ€ํ•ด GMM responsibilities (posteriors, ์‚ฌํ›„ ํ™•๋ฅ )์„ ์˜ˆ์ธก
    • 60์ฐจ์›์˜ MFCC๋ฅผ ์ „์ฒ˜๋ฆฌ(preprocessing) ํ•˜์—ฌ input์œผ๋กœ ์‚ฌ์šฉ
    • ํ˜„์žฌ frame์„ ๊ธฐ์ค€์œผ๋กœ ยฑ15 frame์„ ๊ณ ๋ ค (์ด 31๊ฐœ frame) โ†’ 6๊ฐœ ์‚ฌ์šฉ
    • 6 * 60 โ†’ 360์ฐจ์›
  • Hidden layer : 4๊ฐœ (activation function : sigmoid, node : 1500๊ฐœ)
  • Output : 2048๊ฐœ (GMM-UBM baseline์˜ component ์ˆ˜) - softmax
  • Optimizer : SGD(stochastic gradient descent)
  • Loss : categorical cross-entropy (label : GMM-UBM์˜ ์‚ฌํ›„ ํ™•๋ฅ )
  • frame์„ ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰์œผ๋กœ pooling (์ „์ฒด frame์— ๊ฑธ์ณ softmax layer์—์„œ ๋‚˜์˜จ ์‚ฌํ›„ ํ™•๋ฅ , ์ „์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์€ MFCC)


2. Sufficient statistics to i-vectors : s2i [์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰ โ†’ i-vector]

  • f2s์—์„œ ๋‚˜์˜จ ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰์„ input์œผ๋กœ ์‚ฌ์šฉ (2048x60์ฐจ์›)
  • MAP ์ ์‘๋œ supervector๋กœ ๋ณ€ํ™˜ (112880 ์ฐจ์›) - ์ฐจ์› ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด PCA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 4000์ฐจ์›์œผ๋กœ ์ถ•์†Œ
  • Hidden layer : 3๊ฐœ (activation function : tanh, 1-2 layer node : 600๊ฐœ, 3 layer node : 250๊ฐœ) - ๋งˆ์ง€๋ง‰ layer์—์„œ i-vector ๊ธธ์ด ์ •๊ทœํ™”
  • NN์˜ output๊ณผ LDA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 250์ฐจ์›์œผ๋กœ ์ค„์ด๊ณ  ๊ธธ์ด๋ฅผ ์ •๊ทœํ™”ํ•œ reference i-vector์˜ average cosine distance
  • Optimizer : SGD, L1-regularization


img


3. i-vector to scores (DPLDA)

  • ๋‘ i-vector(ฯ• ํ‘œ๊ธฐ)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, PLDA ๋ชจ๋ธ์˜ Log-Likelihood Ratio(LLR)
img
  • DPLDA๋Š” ์œ„์˜ ์‹์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ (ฮ›, ฮ“, c, k)๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ ๊ตฌํ•˜๋Š” ๊ฒƒ
  • ๋‘ i-vector๊ฐ€ ๊ฐ™์€ ํ™”์ž ์ธ์ง€ ํŒ๋‹จ (Binary cross-entropy ํ˜น์€ SVM ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง)
  • ๋ชจ๋“  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ (์ „์ฒด batch๋ฅผ ์‚ฌ์šฉ) ํ•˜๋‚˜, End-to-End ์‹œ์Šคํ…œ ํ›ˆ๋ จ ์‹œ์—๋Š” ๋„ˆ๋ฌด ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„์ด ํ•„์š”ํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ค‘ ๋ฌด์ž‘์œ„๋กœ subset์„ ์„ ํƒํ•œ Minibatch ๊ธฐ๋ฐ˜


< Minibatch ์„ ํƒ rule >

  1. ๊ฐ ํ™”์ž์— ๋Œ€ํ•ด ๋žœ๋คํ•˜๊ฒŒ ๋ฐœํ™”๋ฅผ ์Œ์œผ๋กœ ๋งŒ๋“ ๋‹ค
    • ๋งŒ์•ฝ ์–ด๋–ค ํ™”์ž์˜ ๋ฐœํ™”๊ฐ€ ํ•˜๋‚˜๋ผ๋ฉด ๋™์ผํ•œ ๋ฐœํ™”๋ฅผ ์Œ์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉ
    • ๋งŒ์•ฝ ์–ด๋–ค ํ™”์ž์˜ ๋ฐœํ™”๊ฐ€ ๊ท ๋“ฑํ•œ ๊ฐœ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ๋ฐœํ™”์˜ ์Œ ์ค‘ ํ•˜๋‚˜๋Š” ์„ธ ๊ฐœ์˜ ๋ฐœํ™”๋ฅผ ๊ฐ€์ง
  2. ๊ฐ Minibatch์— ๋Œ€ํ•ด ์ž„์˜๋กœ N ๊ฐœ์˜ ๋ฐœํ™”๋ฅผ ์„ ํƒํ•˜์—ฌ ์„ ํƒ๋œ ๋ฐœํ™”๋กœ ํ˜•์„ฑ๋  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ์‹คํ—˜์— ์‚ฌ์šฉ(๋งˆ์ง€๋ง‰ ์Œ์„ ์„ ํƒํ•œ ๊ฒฝ์šฐ ๋‹ค์‹œ 1๋กœ ๋Œ์•„๊ฐ)


4. End-to-End System

  • ์•ž์„œ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ›ˆ๋ จํ•œ ๋’ค ๊ฒฐํ•ฉํ•˜์—ฌ End-to-End๋กœ ์ถ”๊ฐ€ ํ›ˆ๋ จ

  • ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ต‰์žฅํžˆ ๋งŽ์ด ํ•„์š”ํ•˜๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌ

  1. PCA : f2s์™€ s2i๋ฅผ ์—ฐ๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด network์˜ ์ผ๋ถ€๊ฐ€ ๋˜์–ด์•ผ ํ•˜๋Š”๋ฐ, 122800x4000๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•„์š”
    • ์ „์ฒด End-to-End ํ›ˆ๋ จ ์ „์—, s2i NN๊ณผ DPLDA ๋ชจ๋ธ๋งŒ ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ

    • s2i์˜ ๊ฐœ๋ณ„ ํ›ˆ๋ จ ์‹œ, f2s๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜์ง€ ์•Š๋Š” ์ด์ƒ ์ž…๋ ฅ์ด ๊ณ ์ •์ด๋ฏ€๋กœ PCA๋ฅผ ๊ฑฐ์นœ ํŠน์ง•์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ


  1. f2s : DPLDA ๋ชจ๋“ˆ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋‹ค์–‘ํ•œ ๋ฐœํ™”์˜ ๋ชจ๋“  frame์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•ด์•ผ ํ•จ
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ๋œ ์œ ์ง€ํ•˜๋„๋ก ํ›ˆ๋ จ๊ณผ์ •์„ ์ˆ˜์ •
    • ํ•˜๋‚˜์˜ ๋ฐœํ™”์— ๋Œ€ํ•ด ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•˜๊ณ  block A์˜ ๋ชจ๋“  layer์˜ ์ถœ๋ ฅ์„ ์—†์•ฐ
    • block A์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ „์ฒด frame(nf) x (1500+1500+1500+1500+2048) ๊ฐœ ๋ณ€์ˆ˜๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
    • ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰ F, N์œผ๋กœ pooling ํ•œ ๋’ค ํŒŒ๋ผ๋ฏธํ„ฐ : 2048x60
    • Optimizer : ADAM
    • Training rate๋ฅผ epoch์—์„œ $C^{prm}_{min}$์ด ๊ฐœ์„ ๋˜์ง€ ์•Š์„ ๋•Œ ๋งˆ๋‹ค ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” DPLDA์™€ ๊ฐ™์Œ


โ…ฃ. Results and Discussion

  • Baseline์˜ ์ผ๋ถ€๋งŒ NN์œผ๋กœ ๋Œ€์ฒด๋œ ์‹œ์Šคํ…œ, End-to-End ๊ฒฐ๊ณผ ํ‘œ
img
> 1,2ํ–‰ : PLDA์™€ DPLDA baseline์˜ ์„ฑ๋Šฅ > 3ํ–‰ : UBM์ด f2s NN์œผ๋กœ ๋Œ€์ฒด๋˜์—ˆ์„ ๋•Œ ์„ฑ๋Šฅ > 4ํ–‰ : i-vector ์ถ”์ถœ๊ธฐ๊ฐ€ s2i NN์œผ๋กœ ๋Œ€์ฒด๋˜์—ˆ์„ ๋•Œ ์„ฑ๋Šฅ > 5ํ–‰ : UBM์˜ ์ถฉ๋ถ„ ํ†ต๊ณ„๋Ÿ‰ ๋Œ€์‹  f2s ๋ชจ๋“ˆ์˜ ์ถœ๋ ฅ์œผ๋กœ s2i ํ›ˆ๋ จ ํ•œ ์„ฑ๋Šฅ > 6ํ–‰ : PLDA ๋Œ€์‹  DPLDA๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ ์„ฑ๋Šฅ > 7ํ–‰ : s2i์™€ DPLDA๋งŒ ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ๋  ๋•Œ์˜ ์„ฑ๋Šฅ > 8ํ–‰ : ๋ชจ๋“  ๋ชจ๋“ˆ์ด ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ๋  ๋•Œ์˜ ์„ฑ๋Šฅ


  • 3๊ฐœ์˜ ๋ชจ๋“ˆ์ด ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ๋  ๋•Œ์˜ ์„ฑ๋Šฅ(8ํ–‰)๊ณผ 2๊ฐœ์˜ ๋ชจ๋“ˆ์ด ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์„ ๋•Œ ์„ฑ๋Šฅ(7ํ–‰)์ด ํฐ ์ฐจ์ด๊ฐ€ ์—†์—ˆ์Œ


< 3๊ฐ€์ง€ ๊ฐ€๋Šฅ์„ฑ >

  1. Minibatch๊ฐ€ ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ์„ ํ•˜๊ธฐ์— ๋„ˆ๋ฌด ์ž‘์„ ์ˆ˜ ์žˆ๋‹ค. (3๊ฐœ์˜ ๋ชจ๋“ˆ์„ ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จ ์‹œ, N=75 ์ตœ๋Œ€)
  2. ๋ชจ๋ธ์ด local minimum์œผ๋กœ ๊ณ ์ •๋  ์ˆ˜ ์žˆ๋‹ค. (f2s์˜ ์ถœ๋ ฅ์— ๋”ฐ๋ผ ํ›„์† ๋ชจ๋ธ๋“ค๋„ ํ›ˆ๋ จ์ด ๋˜๊ธฐ ๋•Œ๋ฌธ)
  3. f2s์˜ ์„ค๊ณ„๊ฐ€ ์ƒ๋‹นํžˆ ์ œ์•ฝ์ ์ด๋‹ค. (์‚ฌํ›„ ํ™•๋ฅ ๋งŒ ์ถ”์ • ํ•  ๋ฟ ํ†ต๊ณ„ ๊ณ„์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ํŠน์ง•์„ ์ˆ˜์ •ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ)


โ…ค. Conclusion

  • ๋‹ค์–‘ํ•œ ์–ธ์–ด์™€ ๊ธด ๋ฐœํ™”, ์งง์€ ๋ฐœํ™”๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ์„ธ ๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ i-vector + PLDA baseline์„ ๋Šฅ๊ฐ€ํ•˜๋Š” End-to-End ํ™”์ž ๊ฒ€์ฆ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ
  • i-vector + PLDA ์‹œ์Šคํ…œ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋„๋ก ์ œํ•œํ•จ์œผ๋กœ์จ End-to-End ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๋Š” overfitting์„ ์™„ํ™”
  • ์‹œ์Šคํ…œ 3๊ฐœ์˜ ์„œ๋ธŒ ๋ชจ๋“ˆ ์ค‘ 3๊ฐœ์˜ ๋ชจ๋“ˆ์˜ ๊ณต๋™ ํ›ˆ๋ จ์€ ์„ฑ๋Šฅ์ด ์ข‹์•˜์ง€๋งŒ, ๋ชจ๋‘ ๊ณต๋™ ํ›ˆ๋ จํ•˜์˜€์„ ๋•Œ ํšจ๊ณผ์ ์ด์ง€ ์•Š์•˜์Œ
    • ์„ธ๊ฐ€์ง€ ๋ชจ๋“ˆ์„ ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์„ ๋•Œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๋„๋ก ๊ฐœ๋ฐœํ•  ๊ฒƒ
  • ๋‹จ์ผ ๋“ฑ๋ก์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ค๊ณ„, ์—ฌ๋Ÿฌ ๋“ฑ๋ก์„ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ™•์žฅํ•  ๊ฒƒ