adirik / multilingual-e5-small

Multilingual E5-small language embedding model

  • Public
  • 46 runs
  • GitHub
  • Paper
  • License

Multilingual E5-Small

A language embedding model based on XLM-RoBERTa architecture that has been enhanced with a diverse set of multilingual datasets, granting it multilingual capabilities.

The model takes a single query and a passage to output text embeddings and a relevancy score. For example,

  • query: “What is the recommended sugar intake for women”
  • passage: “As a general guideline, the CDC’s average recommended sugar intake for women ages 19 to 70 is 20 grams per day.”

To generate embeddings, you have the option to process a single document directly. However, if you intend to include a query, it’s essential to specify the task simultaneously, as the model’s training aligns with this approach. Failing to do so may result in a decline in performance. The task should be defined by a concise, one-sentence instruction that clearly articulates the task at hand. This method allows for the customization of text embeddings to suit various scenarios by leveraging natural language instructions.

See the original paper and model page for more details.

How to use the API

The model returns embeddings of both the query and passage, along with a relevancy score. The API input arguments are as follows:

  • query: Question / topic to be queried
  • passage: The passage / text for which a relevancy score is to be computed with respect to the query.
  • normalize: Specifies whether to output normalized embeddings. The default value is False. If set to True, normalized embeddings will be output.

Limitations

Using this model for inputs longer than 512 tokens is not recommended.

Citation

@article{wang2023improving,
  title={Improving Text Embeddings with Large Language Models},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2401.00368},
  year={2023}
}

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}