Differentiable Data Selection (DDS)Differential Data Selection (DDS) is a general machine learning method for optimising the weighting of different training examples to improve a predetermined objective. In the paper, this objective is the average loss from different languages. They directly optimise the weights of training data from each language to maximise the objective on a multilingual development set. Specifically, DDS uses a technique called bilevel optimisation to learn a data scorer P(x,y;ψ), parameterised by ψ, where x,y represent a sentence pair from the training data, such that training using data sampled from the scorer optimises the model performance on the development set.
To test the effectiveness of the proposed method, the authors use 58-languages-to-English parallel data from Qi et al. (2018). They train the multilingual NMT model for each of the two sets of language pairs with different levels of language diversity:
Related: 4 Low Resource Languages (LRL)s (Azerbaijani:aze, Belarusian:bel, Glacian:glg, Slovak:slk) and a related High Resource Language (HRL) for each LRL (Turkish:tur, Rus-sian:rus, Portuguese:por, Czech:ces)
Diverse: 8 languages with varying amounts of data, picked without consideration for relatedness (Bosnian:bos, Marathi:mar, Hindi:hin, Macedonian:mkd, Greek:ell, Bulgar-ian:bul, French:fra, Korean:kor)
For each set of languages, they test two varieties of translation: 1) many-to-one (M2O): translating 8 languages to English; 2) one-to-many (O2M): translating English into 8 different languages.
Baselines: They compare the proposed method with the three standard heuristic based methods:
- Uniform (τ=∞): datasets are sampled uniformly, so that LRLs are over-sampled to match the size of the HRLs;
- Temperature: scales the proportional dis-tribution by τ=5 to slightly over-sample the LRLs;
- Proportional (τ=1): datasets are sampled propor-tional to their size, so that there is no oversampling of the LRLs.