Language-Agnostic Sentence Representations

Language-Agnostic Sentence Representations

According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages. We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory “tasks”). Our model was trained on the following languages:

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018. [6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, 26 Dec 2018.

Source: github.com