Skip to main content

📎End-to-End Speech Recognition by Following my Development History - Guest Lecturer Shinji Watanabe

以前的筆記

End-to-End Speech Recognition by Following my Development History | Guest Lecturer Shinji Watanabe - YouTube

Impression in ~2015!

  • Attention based encoder decoder
  • No conditional independence assumption
  • Attention mechanism allows too flexible alignments(太難train)

argmaxWp(WX)=argmaxWjp(wjw<j,X)\arg \max_W p(W|X) = \arg \max_W \prod_j p(w_j|w_{<j}, X)

Implementation 2016

CTC

  • Relying on conditional independent assumptions(similar to HMM)
  • Output sequence is not well modeled(no LM) LAS
  • Attention allows too flexible alignments
    • Too hard to train from scratch

Input/output Alignment desired ASR system is monotonic (一條線) 但 End-to-end train 的模型很難是一條線

assets/images/Pasted image 20230405023640.png

Use both benefits of CTC and attention

Hybrid CTC/Attention network working well!

  • Attention based 的模型要 train 起來要花太多時間才能收斂,其平行訓練帶來的好處其實沒有比 RNN 多。

assets/images/Pasted image 20230405023716.png

+ joint decoding

assets/images/Pasted image 20230405023759.png

assets/images/Pasted image 20230405023815.png

Discussions

assets/images/Pasted image 20230405023843.png

Check

  • Attention pattern!
  • Learning curves!

assets/images/Pasted image 20230405023859.png

Speech recognition pipeline

傳統的 pipeline: 太多獨立假設才能解

assets/images/Pasted image 20230405023914.png