Youtu-Embedding-V1

Introduction

Youtu-Embedding-V1 is a powerful, general-purpose text representation model with excellent capabilities. It is accessible via an API and stands out from competitors for the following reasons:

Innovative: Youtu-Embedding-V1 is trained on CoDi, a proprietary framework developed by Tencent. Our innovations in unified formatting, loss functions, and sampling strategies enable the model to achieve comprehensive convergence on multiple tasks with a smaller parameter size and at a lower training cost. For more details, please refer to our paper.
Versatile: Youtu-Embedding-V1 demonstrates strong generalization across a range of tasks, including information retrieval, semantic textual similarity, natural language inference, text classification, and text clustering. Furthermore, we support custom instructions to adapt the model to specific downstream scenarios.
Extensible: We plan to open-source the training framework used for Youtu-Embedding-V1 in the near future, which will allow you to easily transfer it to other types of tasks.

Performance

C-MTEB
Model	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retr.	STS
gte-Qwen2-1.5B-instruct	67.12	67.79	72.53	54.61	79.50	68.21	71.86	60.05
bge-multilingual-gemma2	67.64	68.52	75.31	59.30	86.67	68.28	73.73	55.19
ritrieve_zh_v1	72.71	73.85	76.88	66.50	85.98	72.86	76.97	63.92
Qwen3-Embedding-4B	72.27	73.51	75.46	77.89	83.34	66.05	77.03	61.26
Qwen3-Embedding-8B	73.84	75.00	76.97	80.08	84.23	66.99	78.21	63.53
Conan-embedding-v2	74.24	75.99	76.47	68.84	92.44	74.41	78.31	65.48
Seed1.6-embedding	75.63	76.68	77.98	73.11	88.71	71.65	79.69	68.94
QZhou-Embedding	76.99	78.58	79.99	70.91	95.07	74.85	78.80	71.89
Youtu-Embedding-V1	77.46	78.74	78.04	79.67	89.69	73.85	80.95	70.28

Usage

API Integration Guide: Tencent Cloud API


import os
import json
import types
import numpy as np
from typing import List
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.lkeap.v20240522 import lkeap_client, models


def encode(client, inputs, is_query=False):
    if is_query:
        instruction = "Instruction: Given a search query, retrieve web passages that answer the question \nQuery: "
    else:
        instruction = ""

    params = {
        "Model": model_name,
        "Inputs": inputs,
        "Instruction": instruction
    }

    req = models.GetEmbeddingRequest()
    req.from_json_string(json.dumps(params))

    resp = client.GetEmbedding(req)
    resp = json.loads(resp.to_json_string())
    outputs =[item["Embedding"] for item in resp["Data"]]
    return outputs

secret_id = os.getenv("TENCENTCLOUD_SECRET_ID")
secret_key = os.getenv("TENCENTCLOUD_SECRET_KEY")

cred = credential.Credential(secret_id, secret_key)

httpProfile = HttpProfile()
httpProfile.endpoint = "lkeap.test.tencentcloudapi.com"

clientProfile = ClientProfile()
clientProfile.httpProfile = httpProfile
client = lkeap_client.LkeapClient(cred, "ap-guangzhou", clientProfile)
model_name = "youtu-embedding-v1"

inputs = ["Regular exercise is the key to staying healthy."]
embeddings = encode(client, inputs, is_query=False)
print(embeddings)

介绍

Youtu-Embedding-V1 是一款功能强大、用途广泛的通用文本表示模型。它通过 API 提供服务，并因以下原因在竞争者中脱颖而出：

创新性: Youtu-Embedding-V1 基于腾讯自研的 CoDi 框架进行训练。我们在统一格式化、损失函数和采样策略方面的创新，使得模型能够以更小的参数规模和更低的训练成本在多个任务上实现全面收敛。更多细节请参考我们的论文。
通用性: Youtu-Embedding-V1 在信息检索、语义相似度、自然语言推断、文本分类和文本聚类等一系列任务上表现出强大的泛化能力。此外，我们支持自定义指令，以使模型适应特定的下游场景。
可扩展性: 我们计划在不久的将来开源用于 Youtu-Embedding-V1 的训练框架，这将使您能够轻松地将其迁移到其他类型的任务中。

性能

C-MTEB
模型	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retr.	STS
gte-Qwen2-1.5B-instruct	67.12	67.79	72.53	54.61	79.50	68.21	71.86	60.05
bge-multilingual-gemma2	67.64	68.52	75.31	59.30	86.67	68.28	73.73	55.19
ritrieve_zh_v1	72.71	73.85	76.88	66.50	85.98	72.86	76.97	63.92
Qwen3-Embedding-4B	72.27	73.51	75.46	77.89	83.34	66.05	77.03	61.26
Qwen3-Embedding-8B	73.84	75.00	76.97	80.08	84.23	66.99	78.21	63.53
Conan-embedding-v2	74.24	75.99	76.47	68.84	92.44	74.41	78.31	65.48
Seed1.6-embedding	75.63	76.68	77.98	73.11	88.71	71.65	79.69	68.94
QZhou-Embedding	76.99	78.58	79.99	70.91	95.07	74.85	78.80	71.89
Youtu-Embedding-V1	77.46	78.74	78.04	79.67	89.69	73.85	80.95	70.28

使用方法

API 使用指南: 腾讯云API


import os
import json
import types
import numpy as np
from typing import List
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.lkeap.v20240522 import lkeap_client, models


def encode(client, inputs, is_query=False):
    if is_query:
        instruction = "Instruction: Given a search query, retrieve web passages that answer the question \nQuery: "
    else:
        instruction = ""

    params = {
        "Model": model_name,
        "Inputs": inputs,
        "Instruction": instruction
    }

    req = models.GetEmbeddingRequest()
    req.from_json_string(json.dumps(params))

    resp = client.GetEmbedding(req)
    resp = json.loads(resp.to_json_string())
    outputs =[item["Embedding"] for item in resp["Data"]]
    return outputs

secret_id = os.getenv("TENCENTCLOUD_SECRET_ID")
secret_key = os.getenv("TENCENTCLOUD_SECRET_KEY")

cred = credential.Credential(secret_id, secret_key)

httpProfile = HttpProfile()
httpProfile.endpoint = "lkeap.test.tencentcloudapi.com"

clientProfile = ClientProfile()
clientProfile.httpProfile = httpProfile
client = lkeap_client.LkeapClient(cred, "ap-guangzhou", clientProfile)
model_name = "youtu-embedding-v1"

inputs = ["Regular exercise is the key to staying healthy."]
embeddings = encode(client, inputs, is_query=False)
print(embeddings)