多說話人分離技術及應用進展分析報告-培訓課件_第1頁
多說話人分離技術及應用進展分析報告-培訓課件_第2頁
多說話人分離技術及應用進展分析報告-培訓課件_第3頁
多說話人分離技術及應用進展分析報告-培訓課件_第4頁
多說話人分離技術及應用進展分析報告-培訓課件_第5頁
已閱讀5頁,還剩15頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

多說話人分離技術及應用進展2024.3綱

要1.研究背景2.工業版本—模塊化系統3.

改進方案4.落地應用1.研究背景多說話人分離(說話人日志):給定一個包含多人交替說話的語音,系統需要判斷每個時間段是誰在說話。音頻分割信息多說話人分離系統1.研究背景應用場景:會議紀要,多說話人轉錄,智能客服,錄音質檢等

...終端設備:錄音筆智能手機個人電腦支持廠商:科大訊飛(智能辦公本)、(AI紀要)、聲云(語音轉寫)

...1.研究背景DIHARD(I,II,III)CHiME-6VoxSRC(20,21,22,23)AliMeetingRichTranscription(RT)MIXER62013CALLHOMEAMI競賽/數據集M2MeT,

AISHELL-4

M2MeT2.0,

CHiME-7200020022006

2009201820192020202120222023模塊化架構架構端到端架構研究趨勢:簡單場景→復雜場景挑戰:噪聲干擾,人數未知,語音重疊等應用:離線=>在線,單麥克風=>多麥克風,適配新場景1.研究背景—模塊化系統聚類方法:AHC[1]、SC

[2,3]

、VB

/VBx

[4,5]

、UIS-RNN

[6]

、DNC

[7][1]

K.

C.

Gowda

and

G.

Krishna,

“Agglomerative

Clustering

Using

the

Concept

of

Mutual

Nearest

Neighbourhood,”

Pattern

Recognition,

vol.

10,

pp.

105–112,

1978.[2]

U.

von

Luxburg,

“A

tutorial

on

spectral

clustering,”

Statistics

and

Computing,

vol.

17,

pp.

395–416,

2007.[3]

T.

Park,

Kyu

J.

Han,

Manoj

Kumar,

and

Shrikanth

S.

Narayanan,

“Auto-tuning

Spectral

Clustering

for

Speaker

Diarization

Using

Normalized

Maximum

Eigengap,”

IEEE

SignalProcessing

Letters,

vol.

27,

pp.

381–385,

2020.[4]

M.

Diez,

L.

Burget,

S.

Wang,

J.

Rohdin,

H.

Cernocky,

“Bayesian

HMM

based

x-vector

Clustering

for

Speaker

Diarization,”

Interspeech,

2019,

pp.346-350.[5]

M.

Diez,

L.

Burget,

F.

Landini,

J.

Cernocky,

"Analysis

of

Speaker

Diarization

based

on

Bayesian

HMM

with

Eigenvoice

Priors,"

IEEE/ACM

Transactions

on

Audio

Speech

andLanguage

Processing,

vol.

28,

p

355-368,

2020.[6]

A.

Zhang,

Q.

Wang,

Z.

Zhu,

J.

Paisley,

and

C.

Wang,

“Fully

Supervised

Speaker

Diarization,”

ICASSP,

2019.[7]

Q.J.

Li,

F.

L.

Kreyssig,

C.

Zhang,

P.C.

Woodland,

“Discriminative

Neural

Clustering

for

Speaker

Diarisation,”

IEEE

Spoken

Language

Technology

Workshop

(SLT

2021),

Jan

2021,Shenzhen,

China.1.研究背景—端到端系統基于Bi-LSTM的端到端模型EEND[1]SA-EEND[2]基于Transformer

encoder的端到端模型端到端系統EDA-EEND[3]可以預測人數的EEND模型…TS-VAD[4]目標說話人音頻端點檢測模型[1]

Y.

Fujita,

N.

Kanda,

S.

Horiguchi,

K.

Nagamatsu,

and

S.

Watanabe,

“End-to-end

Neural

Speaker

Diarization

with

Permutation-free

Objectives,”

in

Interspeech,

2019,

pp.

4300–4304.[2]

Y.

Fujita,

N.

Kanda,

S.

Horiguchi,

Y.

Xue,

K.

Nagamatsu

and

S.

Watanabe,

“End-to-End

Neural

Speaker

Diarization

with

Self-Attention,”

2019

IEEE

Automatic

Speech

Recognitionand

Understanding

Workshop

(ASRU),

SG,

Singapore,

2019,

pp.

296-303.[3]

S.

Horiguchi,

Y.

Fujita,

S.

Watanabe,

Y.

Xue,

and

K.

Nagamatsu,

“End-to-end

speaker

diarization

for

an

unknown

number

of

speakers

with

encoder-decoder

based

attractors,”

inInterspeech,

2020,

pp.

269–273.[4]

I.

Medennikov,

M.

Korenevsky,

et

al.,

“Target-speaker

Voice

Activity

Detection:

a

Novel

Approach

for

Multi-speaker

Diarization

in

a

Dinner

Party

Scenario,”

arXiv,

vol.abs/2005.07272,

2020.1.研究背景—聚類算法匯總聚類算法AHC訓練方式無監督聚類無監督聚類無監督聚類無監督聚類有監督聚類有監督聚類有監督聚類有監督聚類輸入特征x-vectori-vectorx-vectorx-vectord-vectord-vector聲學特征i-vector重疊檢測不支持不支持不支持不支持不支持支持預測人數閾值VB初始化調節初始化調節閾值/NME適合2人VBxSCUIS-RNNDNC輸出節點輸出節點輸出節點EENDTS-VAD支持支持在線版本:研究主要集中在EEND[1,2]或UIS-RNN[3,4]框架麥陣版本:多通道輸入TS-VAD[5]或前后端聯合優化特定場景:不同場景采用不同策略[6][1]

Y.

Xue,

S.

Horiguchi,

Y.

Fujita,

S.

Watanabe,

P.

Garcia,

and

K.

Nagamatsu,

“Online

end-to-end

neural

diarization

with

speaker-tracing

buffer,”

in

IEEE

Spoken

LanguageTechnology

Workshop

(SLT),

2021,

pp.

841–848.[2]

E.

Han,

C.

Lee,

and

A.

Stolcke,

“Bw-eda-eend:

Streaming

end-toend

neural

speaker

diarization

for

a

variable

number

of

speakers,”

in

ICASSP,

2021.[3]

E.

Fini

and

A.

Brutti,

“Supervised

online

diarization

with

sample

mean

loss

for

multi-domain

data,”

in

ICASSP,

2020,

pp.

7134–7138.[4]

X.

Wan,

K.

Liu,

H.

Zhou,

"Online

speaker

diarization

equipped

with

discriminative

modeling

and

guided

inference,”

in

Interspeech,

2021.[5]

I.

Medennikov,

M.

Korenevsky,

et

al.,

“Target-speaker

Voice

Activity

Detection:

a

Novel

Approach

for

Multi-speaker

Diarization

in

a

Dinner

Party

Scenario,”

arXiv,

vol.abs/2005.07272,

2020.[6]

Y.-X.

Wang,

J.

Du,

M.-K.

He,

S.-T.

Niu,

L.

Sun,

C.-H.

Lee,

"Scenario-dependent

speaker

diarization

for

DIHARD-III

challgenge,"

in

Interspeech,

2021.2.工業版本—模塊化系統2.1

分割音頻功能:轉換為聚類問題2.工業版本—模塊化系統2.2

提取說話人表征vectorvectorvector功能:提取段級別說話人表征2.工業版本—模塊化系統2.3

聚類—凝聚層次聚類(AHC)AHC功能:對相同說話人片段聚類K.

C.

Gowda

and

G.

Krishna,

“Agglomerative

Clustering

Using

the

Concept

of

Mutual

Nearest

Neighbourhood,”

Pattern

Recognition,

vol.

10,

pp.

105–112,

1978.2.工業版本—模塊化系統第一代產品(與ASV-Subtools*結合)語音端點檢測*/Snowdar/asv-subtools說話人日志(SD)語音識別識別后處理(VAD)(ASR)說話人1說話人2說話人3說話人4原始音頻算法流程:VAD->平均分割->Subtools提取x-vector->PCA降維->Cosine打分->AHC聚類2.工業版本—模塊化系統存在問題—語音重疊說話人混疊:

目標區域是否發生了說話重疊?

誰和誰發生了重疊?圖片和音頻:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進方案—神經網絡分割解決辦法:進行分段,每段用神經網絡判斷說話人,最多3人。圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進方案—神經網絡分割解決辦法:進行分段,每段用神經網絡判斷說話人,最多3人。每5秒一段,窗移2.5秒圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進方案—神經網絡分割解決辦法:進行分段,每段用神經網絡判斷說話人,最多3人。提取x-vector時,去除重疊語音,并合并同一人語音H.Bredin,

“Pyannote.audio

2.1speaker

diarization

pipeline:

principle,

benchmark,

and

recip

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論