Introduction
Youtu-Embedding-V1 is a powerful, general-purpose text representation model with excellent capabilities. It is accessible via an API and stands out from competitors for the following reasons:
- Innovative: Youtu-Embedding-V1 is trained on CoDi, a proprietary framework developed by Tencent. Our innovations in unified formatting, loss functions, and sampling strategies enable the model to achieve comprehensive convergence on multiple tasks with a smaller parameter size and at a lower training cost. For more details, please refer to our paper.
- Versatile: Youtu-Embedding-V1 demonstrates strong generalization across a range of tasks, including information retrieval, semantic textual similarity, natural language inference, text classification, and text clustering. Furthermore, we support custom instructions to adapt the model to specific downstream scenarios.
- Extensible: We plan to open-source the training framework used for Youtu-Embedding-V1 in the near future, which will allow you to easily transfer it to other types of tasks.
Performance
| Model | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS |
|---|---|---|---|---|---|---|---|---|
| gte-Qwen2-1.5B-instruct | 67.12 | 67.79 | 72.53 | 54.61 | 79.50 | 68.21 | 71.86 | 60.05 |
| bge-multilingual-gemma2 | 67.64 | 68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 |
| ritrieve_zh_v1 | 72.71 | 73.85 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 |
| Qwen3-Embedding-4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
| Qwen3-Embedding-8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 |
| Conan-embedding-v2 | 74.24 | 75.99 | 76.47 | 68.84 | 92.44 | 74.41 | 78.31 | 65.48 |
| Seed1.6-embedding | 75.63 | 76.68 | 77.98 | 73.11 | 88.71 | 71.65 | 79.69 | 68.94 |
| QZhou-Embedding | 76.99 | 78.58 | 79.99 | 70.91 | 95.07 | 74.85 | 78.80 | 71.89 |
| Youtu-Embedding-V1 | 77.46 | 78.74 | 78.04 | 79.67 | 89.69 | 73.85 | 80.95 | 70.28 |
Usage
API Integration Guide: Tencent Cloud API
import os
import json
import types
import numpy as np
from typing import List
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.lkeap.v20240522 import lkeap_client, models
def encode(client, inputs, is_query=False):
if is_query:
instruction = "Instruction: Given a search query, retrieve web passages that answer the question \nQuery: "
else:
instruction = ""
params = {
"Model": model_name,
"Inputs": inputs,
"Instruction": instruction
}
req = models.GetEmbeddingRequest()
req.from_json_string(json.dumps(params))
resp = client.GetEmbedding(req)
resp = json.loads(resp.to_json_string())
outputs =[item["Embedding"] for item in resp["Data"]]
return outputs
secret_id = os.getenv("TENCENTCLOUD_SECRET_ID")
secret_key = os.getenv("TENCENTCLOUD_SECRET_KEY")
cred = credential.Credential(secret_id, secret_key)
httpProfile = HttpProfile()
httpProfile.endpoint = "lkeap.test.tencentcloudapi.com"
clientProfile = ClientProfile()
clientProfile.httpProfile = httpProfile
client = lkeap_client.LkeapClient(cred, "ap-guangzhou", clientProfile)
model_name = "youtu-embedding-v1"
inputs = ["Regular exercise is the key to staying healthy."]
embeddings = encode(client, inputs, is_query=False)
print(embeddings)