Attentive Statistics Pooling for Deep Speaker Embedding

Koji Okabe, Takafumi Koshinaka, Koichi Shinoda

๐Ÿ“Œ Abstract

  • Text-independent(๋ฌธ์žฅ ๋…๋ฆฝ : ๋ฐœํ™” ๋‚ด์šฉ์ด ๋™์ผํ•˜์ง€ ํ•˜์ง€ ์•Š์Œ)ํ•œ Speaker Verification(ํ™”์ž ๊ฒ€์ฆ : ๋“ฑ๋ก๋œ ํ™”์ž์ธ์ง€ ์•„๋‹Œ์ง€ ํŒ๋‹จ, SV)์—์„œ Deep speaker embedding์„ ์œ„ํ•œ attentive statistics pooling ์ œ์•ˆ

  • ๊ธฐ์กด์˜ speaker embedding์—์„œ๋Š” ๋‹จ์ผ ๋ฐœํ™”์˜ ๋ชจ๋“  frame์—์„œ frame-level์˜ ํŠน์ง•์„ ๋ชจ๋‘ ํ‰๊ท  ๋‚ด์–ด utterance-level์˜ ํŠน์ง•์„ ํ˜•์„ฑ

  • ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ attention mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ frame๋งˆ๋‹ค ๋‹ค๋ฅธ weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋ถ€์—ฌํ•˜๊ณ , weighted mean(๊ฐ€์ค‘ ํ‰๊ท )๊ณผ weighted standard deviations(๊ฐ€์ค‘ ํ‘œ์ค€ ํŽธ์ฐจ)๋ฅผ ์ƒ์„ฑ

โœ” NISE SRE 2012 ๋ฐ VoxCeleb data set์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ๋น„ํ•ด EER์ด ๊ฐ๊ฐ 7.5%, 8.1% ๊ฐ์†Œ




๐Ÿ“Œ Introduction

  • ํ™”์ž ์ธ์‹์€ ์ง€๋‚œ 10๋…„๋™์•ˆ i-vector paradigm๊ณผ ์ง„ํ™”ํ•˜์˜€๊ณ , i-vector๋Š” ๊ณ ์ •๋œ ์ €์ฐจ์›์˜ ํŠน์ง• ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ์Œ์„ฑ ๋ฐœํ™” ํ˜น์€ ํ™”์ž๋ฅผ ํ‘œํ˜„

  • ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ํ•™์Šต์„ ํ†ตํ•ด Deep learning์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํฌ๊ฒŒ ๊ธฐ์—ฌํ•˜๋ฉฐ, ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ํŠน์ง• ์ถ”์ถœ์— Deep learning์„ ๋„์ž…์ด ์ฆ๊ฐ€

  • ์ดˆ๊ธฐ ์—ฐ๊ตฌ์—์„œ๋Š” ASR(Automatic Speech Recognition)์˜ ์Œํ–ฅ ๋ชจ๋ธ์—์„œ ๋„์ถœ๋œ DNN์„ UBM์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ GMM๊ธฐ๋ฐ˜ UBM๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ ์–ธ์–ด ์˜์กด์„ฑ ๋‹จ์ ๊ณผ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์Œ์†Œ transcription์ด ํ•„์š”

  • ์ตœ๊ทผ DNN์€ ์ด๋Ÿฌํ•œ i-vector framework์™€ ๋…๋ฆฝ์ ์œผ๋กœ ํ™”์ž ๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š”๋ฐ ์œ ์šฉํ•˜๋‹ค๊ณ  ๋ฐํ˜€์ง (ํŠนํžˆ, ์งง์€ ๋ฐœํ™” ์กฐ๊ฑด์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„)

  • Text-dependent(๋ฌธ์žฅ ์ข…์† : ๋ฐœํ™” ๋‚ด์šฉ์ด ๋™์ผํ•จ) SV์—์„œ LSTM(๋งˆ์ง€๋ง‰ frame์—์„œ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ์„ ๊ฐ–๋Š” ๊ตฌ์กฐ)์„ ์‚ฌ์šฉํ•˜์—ฌ utterance-level์˜ ํŠน์ง•์„ ์–ป๋Š” End-to-End Neural Network๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-vector๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

  • Text-independent SV๋Š” ์ž…๋ ฅ์œผ๋กœ ๋‹ค์–‘ํ•œ ๊ธธ์ด์˜ ๋ฐœํ™”๋ฅผ ๊ฐ–์œผ๋ฏ€๋กœ average pooling layer๊ฐ€ ๋„์ž…๋˜์–ด frame-level์˜ ํ™”์ž ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ผ์ •ํ•œ์ฐจ์›์„ ๊ฐ–๋Š” speaker embedding ๋ฒกํ„ฐ๋ฅผ ์–ป์Œ

  • ๋Œ€๋ถ€๋ถ„ ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ DNN์ด i-vector๋ณด๋‹ค ๋” ๋‚˜์€ ์ •ํ™•๋„๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ Snyder ์™ธ๋Š” average pooling๋ฅผ ํ™•์žฅํ•œ statistics pooling (ํ‰๊ท  ๋ฐ ํ‘œ์ค€ ํŽธ์ฐจ ๊ณ„์‚ฐ)์„ ์ฑ„ํƒ

  • ๊ทธ๋Ÿฌ๋‚˜ ์•„์ง ์ •ํ™•๋„ ํ–ฅ์ƒ์— ๋Œ€ํ•œ ํ‘œ์ค€ ํŽธ์ฐจ pooling์˜ ํšจ์œจ์„ฑ์€ ๋ณด๊ณ ํ•˜์ง€ ์•Š์Œ


  • ์ตœ๊ทผ ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด์ „์— ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์—์„œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜จ attention mechanism๊ณผ ํ†ตํ•ฉ

  • ํ™”์ž ์ธ์‹์—์„œ๋„ ์ค‘์š”๋„ ๊ณ„์‚ฐ ์‹œ, speaker embedding ์ถ”์ถœํ•˜๋Š” network์˜ ์ผ๋ถ€๋กœ ์ž‘๋™ํ•˜๋Š” ์ž‘์€ attention network ์‚ฌ์šฉ

  • ๊ณ„์‚ฐ๋œ ์ค‘์š”๋„๋Š” frame-level์˜ ํŠน์ง• ๋ฒกํ„ฐ์˜ weighted mean ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉํ•˜์—ฌ speaker embedding์ด ์ค‘์š”ํ•œ frame์— ์ดˆ์ ์„ ๋งž์ถค

  • ๊ทธ๋Ÿฌ๋‚˜ ์ด์ „ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ณ ์ • ๊ธธ์ด์˜ text-independent ํ˜น์€ text-dependent ํ™”์ž ์ธ์‹๊ณผ ๊ฐ™์€ ์ œํ•œ๋œ ์ž‘์—…์—์„œ๋งŒ ์ˆ˜ํ–‰

- ๋ณธ ๋…ผ๋ฌธ์—์„œ attention mechanism์œผ๋กœ ๊ณ„์‚ฐ๋œ ์ค‘์š”๋„๋กœ importance-weighted standard deviation๊ณผ weighted mean์‚ฌ์šฉํ•œ ์ƒˆ๋กœ์šด pooling๋ฐฉ๋ฒ•์ธ attentive statistics pooling๋ฅผ ์ œ์•ˆ

  • ๊ฐ€๋ณ€ ๊ธธ์ด์˜ text-independentํ•œ ํ™˜๊ฒฝ์—์„œ attentive statisitics pooling์„ ์‚ฌ์šฉํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ์‹œ๋„ ์ด๋ฉฐ, ๋‹ค์–‘ํ•œ pooling layer ๋น„๊ต๋ฅผ ํ†ตํ•ด ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ ํ™”์ž ํŠน์„ฑ์— ๋ฏธ์น˜๋Š” ํšจ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์คŒ




๐Ÿ“Œ Deep speaker embedding

  • ๊ธฐ์กด์˜ DNN์„ ์‚ฌ์šฉํ•œ speaker embedding ์ถ”์ถœ ๋ฐฉ๋ฒ•

input : acoustic feature (MFCC, filter-bank ๋“ฑ)
frame-level์˜ ํŠน์ง• ์ถ”์ถœ์„ ์œ„ํ•ด TDNN, CNN, LSTM ๋“ฑ์˜ Neural Network
๊ฐ€๋ณ€ ๊ธธ์ด์˜ frame-level ํŠน์ง•์„ ๊ณ ์ • ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•œ pooling layer
utterance-level์˜ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ fully-connected layer(hidden layer ์ค‘ ํ•˜๋‚˜์˜ node ์ˆ˜๋ฅผ ์ž‘๊ฒŒ ํ•˜์—ฌ bottleneck feature๋กœ ์‚ฌ์šฉ)


img




๐Ÿ“Œ High-order pooling with attention

< Statistics pooling - ๊ธฐ์กด์— ์‚ฌ์šฉํ•˜๋˜ pooling ๋ฐฉ๋ฒ• >

  • frame-level ํŠน์ง•์— ๋Œ€ํ•ด ํ‰๊ท (mean)๊ณผ ํ‘œ์ค€ ํŽธ์ฐจ(standard deviation) ๊ณ„์‚ฐ (โŠ™ : Hadamard ๊ณฑ)ํ•˜์—ฌ concatenation
img

< Attention mechanism >

  • ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์—์„œ ๊ธด ๋ฌธ์žฅ์˜ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์ด ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ ํŠน์ • ๋‹จ์–ด๋ฅผ ์ง‘์ค‘ํ•ด์„œ ๋ณด๋Š” ๋ฐฉ๋ฒ•์„ ๋„์ž…
imgimg


img

  • decoder์˜ ์‹œ๊ฐ„ i(ํ˜„์žฌ)์—์„œ hidden state ๋ฒกํ„ฐ๋Š” ์‹œ๊ฐ„ i-1(์ด์ „)์˜ hidden state ๋ฒกํ„ฐ์™€ ์‹œ๊ฐ„ i-1(์ด์ „)์—์„œ decoder์˜ output, ๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ„ i(ํ˜„์žฌ)์—์„œ์˜ context ๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๊ณ„์‚ฐ

img

  • context ๋ฒกํ„ฐ๋Š” ์‹œ๊ฐ„ i์—์„œ ์ž…๋ ฅ x์— ๋Œ€ํ•œ ๊ธธ์ด T ์ „์ฒด์— ๋Œ€ํ•œ encoder hidden state ๋ฒกํ„ฐ์˜ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ๊ณ„์‚ฐ

img

  • ์‹œ๊ฐ„ i์—์„œ j๋ฒˆ์งธ ๋‹จ์–ด์˜ energy๋Š” ์‹œ๊ฐ„ i-1(์ด์ „)์—์„œ decoder hidden state์™€ย j๋ฒˆ์งธ encoder hidden state๊ฐ€ ์ž…๋ ฅ์ธ aligment model(a) ๊ฒฐ๊ณผ๊ฐ’ (alignment model์€ tanh, ReLU ๋“ฑ activation function)

img


< Attentive statistics pooling >

imgimg

attention mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํ†ตํ•ด mean๊ณผ standard deviation์„ ๊ฐฑ์‹ 

img




๐Ÿ“Œ Experimental settings

i-vector

input : 60์ฐจ์› MFCC
UBM : 2048 mixture
TV matrix, i-vector : 400์ฐจ์›
Similarity score : PLDA


Deep speaker embedding

input : 20์ฐจ์›(SRE 12), 40์ฐจ์›(VoxCeleb) MFCC
hidden layer : 5-layer TDNN(activation function : ReLU, node : 512)
pooling dimension : 1500์ฐจ์›
acoustic feature vector(MFCC) 15๊ฐœ frame์œผ๋กœ frame-level ํŠน์ง• ์ƒ์„ฑ
2 fully-connected layer (1st : bottleneck feature - 512, activation function : ReLU, batch normalization)