在 LlamaIndex 中使用 Pipelines

Zilliz Cloud Pipelines 是一个可扩展的 API 服务，用于检索。您可以将 Zilliz Cloud Pipelines 用作 LLamaIndex 中的托管索引。该服务可以将文档转换为 Embedding 向量，并将它们存储在 Zilliz Cloud 中，以实现高效的语义搜索。

📘说明

Zilliz Cloud Pipelines 服务正处在逐步下线中，将于 2025 年第二季度末停止服务，被 “Data In, Data Out” 的新功能取代。该功能旨在简化 Milvus 和 Zilliz Cloud 中的向量化流程。自 2025 年 1 月 10 日起，Zilliz Cloud Pipelines 将不再接受新用户注册。现有用户可在每月 100 元人民币免费试用额度内继续使用服务直至下线日期。该服务不提供 SLA 支持。建议您使用模型提供商的Embedding API 或开源模型生成向量。

前提条件

开始前，请先：

安装 LLamaIndex Python SDK
```
pip install llama-index
```

配置OpenAI 和 Zilliz Cloud 账号鉴权信息

from getpass import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")

ZILLIZ_PROJECT_ID = getpass("Enter your Zilliz Project ID:")
ZILLIZ_CLUSTER_ID = getpass("Enter your Zilliz Cluster ID:")
ZILLIZ_TOKEN = getpass("Enter your Zilliz API Key:")

📘说明

获取 OpenAI API 密钥

获取 Zilliz Cloud 鉴权信息

为文档创建索引

Zilliz Cloud Pipelines 支持来自阿里云 OSS 和腾讯云 COS 对象存储的文件。您可以从对象存储生成预签名 URL并使用 from_document_url() 或 insert_doc_url() 上传文件。它可以自动索引文档并将文档块作为向量存储在 Zilliz Cloud 上。

from llama_index.indices import ZillizCloudPipelineIndex

zcp_index = ZillizCloudPipelineIndex.from_document_url(
    # a public or pre-signed url of a file stored on cloud object storage
    url="https://publicdataset.cloud.zilliz.com.cn/milvus_doc.md",
    project_id=ZILLIZ_PROJECT_ID,
    cluster_id=ZILLIZ_CLUSTER_ID,
    token=ZILLIZ_TOKEN,
    # optional
    metadata={"version": "2.3"},  # used for filtering
    collection_name="zcp_llamalection",  # change this value will specify customized collection name
)

# Insert more docs, eg. a Milvus v2.2 document
zcp_index.insert_doc_url(
    url="https://publicdataset.cloud.zilliz.com.cn/milvus_doc_22.md",
    metadata={"version": "2.2"},
)

# Output
# {'token_usage': 984, 'doc_name': 'milvus_doc_22.md', 'num_chunks': 7}

# # Delete docs by doc name
# zcp_index.delete_by_doc_name(doc_name="milvus_doc_22.md")

📘说明

如果未创建 Zilliz Cloud Pipelines，上述代码将自动创建 Pipeline。
您可以按需选择是否为每篇文档添加元数据。元数据可以在检索文档时用于过滤文档片段。

使用 Pipeline 作为查询引擎

使用 ZillizCloudPipelineIndex进行语义搜索时，您可以通过指定一些参数将其用作查询引擎 as_query_engine()：

search_top_k：要检索的文本节点/块数量。默认为 DEFAULT_SIMILARITY_TOP_K (2)。
filters：元数据过滤器。默认为 None。
output_metadata：要与检索到的文本节点一起返回的元数据字段的名称列表。默认为 []。

from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

query_engine_milvus23 = zcp_index.as_query_engine(
    search_top_k=3,
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="version", value="2.3")
        ]  # version == "2.3"
    ),
    output_metadata=["version"],
)

Milvus 2.3 文档语义搜索或检索增强生成（RAG）引擎已经准备就绪。

检索

以下代码片段演示了如何使用 Zilliz Cloud Pipelines 进行语义搜索。

question = "Can users delete entities by filtering non-primary fields?"
retrieved_nodes = query_engine_milvus23.retrieve(question)
print(retrieved_nodes)

# Output
# [NodeWithScore(node=TextNode(id_='447198459513870883', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\nThis topic describes how to delete entities in Milvus.  \nMilvus supports deleting entities by primary key or complex boolean expressions. Deleting entities by primary key is much faster and lighter than deleting them by complex boolean expressions. This is because Milvus executes queries first when deleting data by complex boolean expressions.  \nDeleted entities can still be retrieved immediately after the deletion if the consistency level is set lower than Strong.\nEntities deleted beyond the pre-specified span of time for Time Travel cannot be retrieved again.\nFrequent deletion operations will impact the system performance.  \nBefore deleting entities by comlpex boolean expressions, make sure the collection has been loaded.\nDeleting entities by complex boolean expressions is not an atomic operation. Therefore, if it fails halfway through, some data may still be deleted.\nDeleting entities by complex boolean expressions is supported only when the consistency is set to Bounded. For details, see Consistency.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.728226900100708), NodeWithScore(node=TextNode(id_='447198459513870886', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\n## Prepare boolean expression\n### Complex boolean expression\nTo filter entities that meet specific conditions, define complex boolean expressions.  \nFilter entities whose word_count is greater than or equal to 11000:  \n```python\nexpr = "word_count >= 11000"\n```  \nFilter entities whose book_name is not Unknown:  \n```python\nexpr = "book_name != Unknown"\n```  \nFilter entities whose primary key values are greater than 5 and word_count is smaller than or equal to 9999:  \n```python\nexpr = "book_id > 5 && word_count <= 9999"\n```', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.687866747379303), NodeWithScore(node=TextNode(id_='447198459513870884', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\n## Prepare boolean expression\nPrepare the boolean expression that filters the entities to delete.  \nMilvus supports deleting entities by primary key or complex boolean expressions. For more information on expression rules and supported operators, see Boolean Expression Rules.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.6814976334571838)]

带有过滤器的查询引擎仅检索带有 "版本 2.3" 标签的文本节点。

查询

以下代码片段展示了如何将查询引擎作为由 Zilliz Cloud Pipelines 和 OpenAI 的大型语言模型支持的 RAG 代理来使用。

response = query_engine_milvus23.query(question)
print(response.response)

# Output
# Yes, users can delete entities by filtering non-primary fields using complex boolean expressions in Milvus. The complex boolean expressions allow users to define specific conditions to filter entities based on non-primary fields, such as word_count or book_name. By specifying the desired conditions in the boolean expression, users can delete entities that meet those conditions. However, it is important to note that deleting entities by complex boolean expressions is not an atomic operation, and if it fails halfway through, some data may still be deleted.

高级用例

您可以在不进行数据摄取的情况下获取托管索引。要开始使用 Zilliz Cloud Pipelines，您需要提供管道 ID 或相关联的集合名称：

Pipeline IDs

1 个包含 INGESTION、 SEARCH 和 DELETION Pipelines ID 的字典（dictionary）。例如：{"INGESTION": "pipe-xx1", "SEARCH": "pipe-xx2", "DELETION": “pipe-xx3”}
Collection name

集合名称默认为 zcp_llamalection。如果没有提供 Pipeline ID，索引将尝试获取与相关联的集合名称相同的 Pipeline

from llama_index.indices import ZillizCloudPipelineIndex

advanced_zcp_index = ZillizCloudPipelineIndex(
    project_id=ZILLIZ_PROJECT_ID,
    cluster_id=ZILLIZ_CLUSTER_ID,
    token=ZILLIZ_TOKEN,
    collection_name="zcp_llamalection_advanced",
)

# Output
# No available pipelines. Please create pipelines first.

定制化 Pipelines

如果没有提供或找到 Pipelines，那么您可以使用以下可选参数手动创建和自定义 Pipelines：

metadata_schema: 带有字段名称作为键（key），数据类型作为值的元数据架构字典。例如：{"user_id": "VarChar"}
chunkSize: 使用 Token 作为单位的块大小整数。如果没有指定块大小，那么 Zilliz Cloud Pipeline 将使用内置默认块大小（500 Tokens）来分割文档。

更多其他可用参数，请参阅 Zilliz Cloud Pipelines。

For other applicable parameters, refer to Zilliz Cloud Pipelines for more available pipeline parameters.

advanced_zcp_index.create_pipelines(
    metadata_schema={"user_id": "VarChar"},
    chunkSize=350,
    # other pipeline params
)

# Output
# {'INGESTION': 'pipe-***********************,
#  'SEARCH': 'pipe-***********************',
#  'DELETION': 'pipe-***********************'}

多租户

通过将特定租户的值（例如用户 ID）作为元数据，托管索引可以通过应用元数据过滤器实现多租性。

通过指定元数据值，每个文档在摄取时都会被标记上特定租户的字段。

advanced_zcp_index.insert_doc_url(
    url="https://publicdataset.cloud.zilliz.com.cn/milvus_doc.md",
    metadata={"user_id": "user_001"},
)

# Output
# {'token_usage': 1247, 'doc_name': 'milvus_doc.md', 'num_chunks': 10}

然后，托管索引可以通过过滤特定于租户的字段，为每个租户构建一个查询引擎。

from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

query_engine_for_user_001 = advanced_zcp_index.as_query_engine(
    search_top_k=3,
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="user_id", value="user_001")]
    ),
    output_metadata=["user_id"],  # optional, display user_id in outputs
)

您可以更改过滤条件（filters）来构建具有不同条件的查询引擎。

question = "Can I delete entities by filtering non-primary fields?"

# search_results = query_engine_for_user_001.retrieve(question)
response = query_engine_for_user_001.query(question)
print(response.response)

# Output
# Yes, you can delete entities by filtering non-primary fields. Milvus supports deleting entities by complex boolean expressions, which allows you to filter entities based on specific conditions on non-primary fields. You can define complex boolean expressions using operators such as greater than or equal to, not equal to, and logical operators like AND and OR. By using these expressions, you can filter entities based on the values of non-primary fields and delete them accordingly.

前提条件​

为文档创建索引​

使用 Pipeline 作为查询引擎​

检索​

查询​

高级用例​

定制化 Pipelines​

多租户​

前提条件

为文档创建索引

使用 Pipeline 作为查询引擎

检索

查询

高级用例

定制化 Pipelines

多租户