Attention-based Models For Text-dependent Speaker Verification

F A Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno, Li Wan

๐Ÿ“Œ Abstract

  • Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ : ์ž…๋ ฅ sequence์˜ ์ „์ฒด ๊ธธ์ด๋ฅผ ์š”์•ฝํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ
  • ์Œ์„ฑ ์ธ์‹, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ, ์ด๋ฏธ์ง€ ์บก์…˜๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ณณ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • End-to-End Text-dependent ํ™”์ž ์ธ์‹ ์‹œ์Šคํ…œ์—์„œ attention mechanism ์‚ฌ์šฉ์„ ๋ถ„์„
  • ๋‹ค์–‘ํ•œ attention layer์˜ ๋ณ€ํ˜•์„ ์—ฐ๊ตฌํ•˜๊ณ  attention weight์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ pooling๋ฐฉ๋ฒ•์„ ๋น„๊ต
  • Attention mechanism์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ LSTM๊ณผ ์„ฑ๋Šฅ ๋น„๊ต




โ… . Introduction

โœ” Global Password Text-dependent Speaker Verification(SV) ์‹œ์Šคํ…œ

  • ๋“ฑ๋ก ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐœํ™”๊ฐ€ ํŠน์ • ๋‹จ์–ด๋กœ ์ œํ•œ (Text-dependent)
  • โ€œOk-Googleโ€๊ณผ โ€œHey Googleโ€ ์‚ฌ์šฉ ( Global password)


โœ” ํ˜„์žฌ ๊ฐ€์žฅ ๋งŽ์ด ์ ‘๊ทผํ•˜๊ณ  ์žˆ๋Š” ํ›ˆ๋ จ ๋ฐฉ๋ฒ•

  • ๋“ฑ๋ก ๋ฐ ํ…Œ์ŠคํŠธํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” End-to-End ๊ตฌ์กฐ
  • [6]๋…ผ๋ฌธ โ€œi-vector+PLDA ์‹œ์Šคํ…œ์„ ๊ทธ๋Œ€๋กœ ๋ชจ๋ฐฉํ•œ ๊ตฌ์กฐโ€์˜ ๊ฒฝ์šฐ, ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ๊ทœ์ œํ•˜์˜€์œผ๋‚˜ ์ดˆ๊ธฐํ™”๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ i-vector์™€ PLDA ๋ชจ๋ธ์ด ํ•„์š”
  • [7] ๋…ผ๋ฌธ, TD-SV task์—์„œ LSTM ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ธฐ์กด End-to-End DNN๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ


โœ”์ด์ „ ๋…ผ๋ฌธ์—์„œ์˜ ๋ฌธ์ œ์ 

  • ๋ฌต์Œ๊ณผ ๋ฐฐ๊ฒฝ ์žก์Œ์ด ๋งŽ์ด ์—†์Œ
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” keyword ๊ฒ€์ถœ์— ์˜ํ•ด ๋ถ„ํ• ๋œ 800ms์˜ ์งง์€ frame์ด์ง€๋งŒ, ๋ฌต์Œ๊ณผ ์žก์Œ์ด ์žˆ์Œ


โœ”์ด์ƒ์ ์ธ Embedding ์ƒ์„ฑ

  • ์Œ์†Œ์— ํ•ด๋‹นํ•˜๋Š” frame์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์ž‘
  • ์ž…๋ ฅ sequence ์ค‘ ๊ด€๋ จ์„ฑ์ด ๋†’์€ ์š”์†Œ๋ฅผ ๊ฐ•์กฐํ•˜๊ธฐ ์œ„ํ•ด attention layer ์‚ฌ์šฉ




โ…ก. Baseline Architecture

TE2E model

โœ” baseline end-to-end training architecture

img
  • ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ, ํ•˜๋‚˜์˜ ํ‰๊ฐ€์šฉ ๋ฐœํ™” ๐’™๐‘—~์™€ N๊ฐœ์˜ ๋“ฑ๋ก ๋ฐœํ™” ๐’™๐‘˜๐‘› (๐‘“๐‘œ๐‘Ÿ ๐‘›=1, โ€ฆ, ๐‘) tuple์ด LSTM network์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

${x_{j\tilde{}}, (x_{k_1}, โ€ฆ, x_{k_N})}$ ; input
$x$ : ๊ณ ์ • ๊ธธ์ด์˜ log-mel fiterbank feature
$j, k$ : ๋ฐœํ™”ํ•œ ํ™”์ž ($j$์™€ $k$๋Š” ๊ฐ™์„ ์ˆ˜ ์žˆ์Œ)
๋งŒ์•ฝ $x_{j\tilde{}}$์™€ $M$ ๊ฐœ์˜ ๋“ฑ๋ก ๋ฐœํ™”๊ฐ€ ๊ฐ™์€ ํ™”์ž๋ผ๋ฉด tuple positive $(j=k)$, ๋‹ค๋ฅด๋ฉด negative

  • โ„Ž๐‘ก : t๋ฒˆ์งธ frame์—์„œ LSTM์˜ ๋งˆ์ง€๋ง‰ layer์˜ ์ถœ๋ ฅ ( ๊ณ ์ • ์ฐจ์›์˜ vector )
  • ๋งˆ์ง€๋ง‰ frame์˜ output์„ d-vector ๐Ž (โ„Ž๐‘‡) ๋กœ ์ •์˜

${\omega(j\tilde{}), (\omega(k_1), โ€ฆ, \omega(k_N))}$ ; output
Tuple $(\omega(k_1), โ€ฆ, \omega(k_N))$์„ ํ‰๊ท ๋‚ด์–ด centroid ๊ณ„์‚ฐ


img


โœ” Cosine Similarity Function ์ •์˜

img


โœ” Loss Function ์ •์˜

img




โ…ข. Attention-based Model

3.1 Basic attention layer

โœ” Baseline system๊ณผ ์ฐจ์ด์ 

  • ๋งˆ์ง€๋ง‰ frame์˜ ์ถœ๋ ฅ์„ d-vector(๐Ž)๋กœ ์ง์ ‘ ์‚ฌ์šฉ
  • Attention layer๋Š” ๊ฐ t frame ์—์„œ์˜ LSTM ์ถœ๋ ฅ โ„Ž๐‘ก์— ๋Œ€ํ•œ ์Šค์นผ๋ผ ์ ์ˆ˜ ๐‘’๐‘ก ๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ weighted sumํ•œ ๊ฒฐ๊ณผ๋กœ d-vector(๐Ž) ์ •์˜
img
  • Normalized weight ๐›ผ๐‘ก์™€ weighted sumํ•œ ๊ฒฐ๊ณผ d-vector๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜
img
img


  • aritecture๋กœ ๋ณด๋Š” ์ฐจ์ด์ 
img


3.2 Scoring functions

  • Bias-only attention ์—ฌ๊ธฐ์„œ b๐‘ก๋Š” scalar. LSTM ์ถœ๋ ฅ h๐‘ก์— ์˜์กดํ•˜์ง€ ์•Š์Œ.
img
  • Linear attention ์—ฌ๊ธฐ์„œ w๐‘ก๋Š” m์ฐจ์› vector, b๐‘ก๋Š” scalar. frame๋งˆ๋‹ค ๋‹ค๋ฅธ parameter๊ฐ€ ์‚ฌ์šฉ
img
  • Shared-parameter linear attention ๋ชจ๋“  frame์— ๋Œ€ํ•ด m์ฐจ์› vector w์™€ scalar b๊ฐ€ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉ
img
  • Non-linear attention ์—ฌ๊ธฐ์„œ ๐‘พ๐’•๋Š” mโ€™ X m matrix, ๐›๐‘ก์™€ ๐ฏ๐‘ก๋Š” mโ€™์ฐจ์›์˜ vector(์ฐจ์› mโ€™์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ์กฐ์ •)
img
  • Shared-parameter non-linear attention ๋ชจ๋“  ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด ๋™์ผํ•œ parameter ๐–, ๐›, ๐ฏ ๋ฅผ ๊ณต์œ 
img


3.3 Attention layer variants

  • ๊ธฐ๋ณธ์ ์ธ attention layer์™€ ๋‹ฌ๋ฆฌ ๋‘๊ฐ€์ง€์˜ ๋ณ€ํ˜•๋œ ๊ธฐ๋ฒ• Cross-layer attention์™€ Divided-layer attention ์†Œ๊ฐœ

โœ” Cross-layer attention

  • ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ• : ๋งˆ์ง€๋ง‰ LSTM์˜ layer์˜ ์ถœ๋ ฅ h๐‘ก (1โ‰ค๐‘กโ‰ค๐‘‡)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ score e๐‘ก์™€ weight ฮฑ๐‘ก๋ฅผ ๊ณ„์‚ฐ
  • ๋ณ€ํ˜•๋œ ๋ฐฉ๋ฒ• : ์ค‘๊ฐ„ LSTM layer์˜ ์ถœ๋ ฅ hโ€™๐‘ก(1โ‰ค๐‘กโ‰ค๐‘‡)์œผ๋กœ ๊ณ„์‚ฐ (๊ทธ๋ฆผ 3.(a) output์—์„œ ๋งˆ์ง€๋ง‰ 2๋ฒˆ์งธ layer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ)
  • d-vector ๐Ž๋Š” ์—ฌ์ „ํžˆ ๋งˆ์ง€๋ง‰ layer ์ถœ๋ ฅ h๐‘ก์™€ weighted sum์œผ๋กœ ๊ณ„์‚ฐ
img


โœ” Divided-layer attention

  • ๋งˆ์ง€๋ง‰ LSTM layer์˜ ์ถœ๋ ฅ h๐‘ก์˜ ์ฐจ์›์„ 2๋ฐฐ๋กœ ๋Š˜๋ฆฌ๊ณ  ๊ทธ ์ฐจ์›์„ part a์™€ part b ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ”
  • part b๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ weight๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๋‚˜๋จธ์ง€ part a์™€ weighted sumํ•˜์—ฌ d-vector ์ƒ์„ฑ
img


3.4 Weights pooling

โœ” Basic attention layer์˜ ๋˜ ๋‹ค๋ฅธ ๋ณ€ํ™”

  • LSTM์˜ output โ„Ž๋ฅผ averageํ•˜๊ธฐ ์œ„ํ•ด normalized weight ๐›ผ๐‘ก ๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , maxpooling์œผ๋กœ ์„ ํƒ์ ์œผ๋กœ ์‚ฌ์šฉ

โœ” ๋‘ ๊ฐ€์ง€ maxpooling ๋ฐฉ๋ฒ• ์‚ฌ์šฉ

  • Sliding Window maxpooling : Sliding window์•ˆ์˜ weight ์ค‘ ํฐ ๊ฐ’๋งŒ ๋‘๊ณ , ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ๋งŒ๋“ฆ
  • Global top-K maxpooling : ๊ฐ€์žฅ ํฐ K๊ฐœ์˜ ๊ฐ’๋งŒ ๋‘๊ณ , ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ๋งŒ๋“ฆ
img

t๋ฒˆ์งธ pixel : ๊ฐ€์ค‘์น˜ $\alpha_t$
๋ฐ์„ ์ˆ˜๋ก ๊ฐ€์ค‘์น˜๊ฐ€ ํฐ ๊ฐ’์„ ์˜๋ฏธ




โ…ฃ. Experiments

4.1 Datasets and basic setup

โœ” ์‚ฌ์šฉํ•œ Dataset

  • โ€œOk Googleโ€๊ณผ โ€œHey Googleโ€์ด ํ˜ผํ•ฉ๋œ ๋ฐœํ™” ๋ฐ์ดํ„ฐ
  • ์•ฝ 630K ํ™”์ž๊ฐ€ 150M ๋ฐœํ™” (ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ : 665๋ช… ํ™”์ž)
  • ํ‰๊ท ์ ์œผ๋กœ enrollment๋Š” 4.5๊ฐœ, evaluation์€ 10๊ฐœ์˜ ๋ฐœํ™”๋กœ ๊ตฌ์„ฑ

โœ” Basic setup

  • ๊ธฐ๋ณธ baseline์€ 3๊ฐœ์˜ layer๋กœ ์ด๋ฃจ์–ด์ง„ LSTM
  • ๊ฐ layer๋Š” 128์ฐจ์›์ด๋ฉฐ, 64์ฐจ์›์œผ๋กœ projectionํ•˜๋Š” linear layer๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • Global password๋งŒ ํฌํ•จํ•˜๋Š” ๊ธธ์ด T=80 frame(800ms)์˜ ์„ธ๊ทธ๋จผํŠธ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” keyword detection ํ›„ 40์ฐจ์›์˜ log-mel-filterbank feature ์ƒ์„ฑ
  • MultiReader๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๊ฐœ์˜ keyword๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ์‚ฌ์šฉ

4.2 Basic attention layer

  • ๋‹ค์–‘ํ•œ ์ ์ˆ˜ ๊ณ„์‚ฐ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Basic attention layer๊ณผ ๋น„๊ต
img
  • Bias-only์™€ linear attention์€ EER์ด ๊ฑฐ์˜ ๊ฐœ์„ ๋˜์ง€ ์•Š์Œ
  • Non-linear ์ค‘ ํŠนํžˆ, shared-parameter์˜ ๊ฒฝ์šฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Œ

4.3 Variants

  • Basic attention layer์™€ ๋‘ ๊ฐ€์ง€ ๋ณ€ํ˜•(cross-layer, divided-layer) ๋น„๊ต
  • ์ด์ „ ์‹คํ—˜์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ธ shared-parameter non-linear scoring function์„ ์‚ฌ์šฉ
img
  • cross-layer๋Š” ๋งˆ์ง€๋ง‰์—์„œ 2๋ฒˆ์งธ layer์—์„œ score๋ฅผ ํ›ˆ๋ จ
  • divided-layer attention์ด ๋งˆ์ง€๋ง‰ LSTM layer์˜ ์ฐจ์›์ด 2๋ฐฐ์ด์ง€๋งŒ, Basic attention๊ณผ cross-layer attention๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

4.4 Weights pooling

  • Attention weight๋ฅผ ๋‹ค์–‘ํ•œ pooling๋ฐฉ๋ฒ•์œผ๋กœ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋น„๊ต
  • Shared-parameter non-linear scoring function๊ณผ divided-layer attention ์‚ฌ์šฉ
  • Sliding window maxpooling : 10 frame window size์™€ 5 frame step size
  • Global top-K maxpooling : K = 5
img
  • Sliding window maxpooling์ด EER์ด ์•ฝ๊ฐ„ ๋” ๋‚ฎ์€ ๊ฒƒ์„ ํ™•์ธ

โœ” ๊ฐ ๋ฐฉ๋ฒ•์—์„œ attention weight๋ฅผ visualization

img
  • Pooling์ด ์—†์„ ๋•Œ, 4์Œ์†Œ(O-kay-Goo-gle) ๋˜๋Š” 3์Œ์†Œ(Hey-Goo-gle) ํŒจํ„ด์„ ํ™•์ธ
  • Pooling์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์‹œ์ž‘๋ถ€๋ถ„ ๋ณด๋‹ค๋Š” ๋๋ถ€๋ถ„์˜ ๋ฐœํ™”๊ฐ€ ๋” ํฐ attention weight๋ฅผ ๊ฐ€์ง
  • LSTM์€ ์ด์ „ ์ƒํƒœ ๊ฐ’์„ ๋ˆ„์ ํ•˜์—ฌ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ๊ฐ€์ง์œผ๋กœ์จ ๋‚˜์˜ค๊ฒŒ ๋˜๋Š” ํ˜„์ƒ์œผ๋กœ ํŒ๋‹จ




โ…ค. Conclusion

  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” keyword ๊ธฐ๋ฐ˜์˜ Text-dependent ํ™”์ž ๊ฒ€์ฆ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๋‹ค์–‘ํ•œ Attention mechanism์„ ์‹คํ—˜

  • ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•
    1. shared-parameter non-linear scoring function ์‚ฌ์šฉ
    2. LSTM์˜ ๋งˆ์ง€๋ง‰ layer์— divided-layer attention ์‚ฌ์šฉ
    3. Sliding window maxpooling์„ attention weight์— ์ ์šฉ
  • ์œ„์˜ 3๊ฐ€์ง€๋ฅผ ๊ฒฐํ•ฉํ•˜์˜€์„ ๋•Œ ๊ธฐ๋ณธ LSTM๋ชจ๋ธ EER 1.72%์—์„œ 14%์˜ ์ƒ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ด

  • ๋™์ผํ•œ attention mechanism(ํŠนํžˆ, shared-parameter scoring function)์€ Text-independentํ•œ ํ™”์ž ๊ฒ€์ฆ ๋ฐ ํ™”์ž ์‹๋ณ„์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Œ