RocketQA : An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

2021-08-20

3 min read

NLP , Papers

Abstract + Introduction

difficult to effectively train a dual-encoder for dense passage retrieval
- retriever needs to identify positive passages for each question from a large collection
- there might be a large number of unlabeled positives (IR資料集可能有錯誤)
  - it is expensive to acquire large-scale training data for open-domain QA

反正就是large scale 的open-domain QA Model很難train，所以提出了三種方法來改善整個流程

cross-batch negatives
denoised hard negatives
leverages large-scale unsupervised data “labeled” by a cross-encoder for data augmentation

主要相似度公式是採用dot-product

Untitled

in-batch negatives has been widely used in previous work
first compute the passage embeddings within each single GPU, and then share these passage embeddings among all the GPUs - 現在單個gpu算出embedding，再分享給其他gpu一起共用，negative的數量多了N倍 (N個GPU的話)

Untitled

Traditional method may bring false negatives (unlabeled positives)
- find that about 70% of them are actually positives or highly relevant in 100 questions
- 大型的IR資料集會有漏標positive的問題
utilize a well-trained cross-encoder to remove top-retrieved passages that are likely to be false negatives
Cross-encoder architecture is more powerful for capturing semantic similarity via deep interaction

Cross-Encoder判斷出來可能要低於一定的值(<0.1)才真正當作negative，透過Cross-Encoder挑出真正夠negative的來幫助Model學習

utilize cross-encoder to annotate unlabeled questions for data augmentation
only select the predicted positive and negative passages with high confidence scores estimated by the cross-encoder

也是用Cross-Encoder對unlabeled資料生成label，大於0.9=positive，小於0.1=negative

Untitled

先用labeled 資料 Train 一個 dual-encoder (cross-batch) ，這邊的negative 就是random取或是取BM25 的topk也行
Train一個Cross-encoder，training 資料中的positive 是原本data中的positive，negative是利用前一步的dual encoder猜出來的top-k passage並且不屬於positive passage，來當作negative (hard negative by Md)
train 另一個 dual-encoder ，用上一步的資料做 label denoised ，除了 hard negative by Md之外且cross-encoder (MC) 也要猜是negative 才算是真正的negative(Denoised)
pseudo label data MD1 先抽 topk 再用 Mc 標label最後在train一個MD2 Dual encoder 負責retrieve top k 個passage , cross encoder 負責denoised 跟生成unlabel data

Untitled

實驗結果

Untitled

這篇看下來我覺得最有用的是Denoised的部分，也是整體表現提升的關鍵點

Previous MacBERT：MLM as correction BERT

On This Page