跳到主要内容
版本:Cloud 开发指南

Struct Array
公测版

Struct Array 用于有序存放元素数据类型为 Struct 的 Array。Array 中的每个 Struct 共享同一个 Schema,可包含多个向量和标量字段。

下面是某个包含 Struct Array 字段的 Collection 中提取的一条 Entity 记录。

{
'id': 0,
'title': 'Walden',
'title_vector': [0.1, 0.2, 0.3, 0.4, 0.5],
'author': 'Henry David Thoreau',
'year_of_publication': 1845,
// highlight-start
'chunks': [
{
'text': 'When I wrote the following pages, or rather the bulk of them...',
'text_vector': [0.3, 0.2, 0.3, 0.2, 0.5],
'chapter': 'Economy',
},
{
'text': 'I would fain say something, not so much concerning the Chinese and...',
'text_vector': [0.7, 0.4, 0.2, 0.7, 0.8],
'chapter': 'Economy'
}
]
// hightlight-end
}

在该示例中,chunks 是一个 Struct Array 字段。其中的每个 Struct 元素都有由相同的字段组成,包括 texttext_vectorchapter

使用限制

  • 数据类型

    当您创建 Collection 时,您可以在定义 Array 字段时将它的元素数据类型指定为 Struct。但是您不可以为已有 Collection 添加 Struct Array 字段。另外,Zilliz Cloud 也不支持用 Struct 类型定义 Collection 的字段。

    Array 中的 Struct 元素有着相同的 Schema。您需要在创建 Array 字段前定义该 Schema。

    Struct 中可以包含向量和标量字段。目前,您可以在 Struct 的 Schema 定义中使用如下类型的字段:

    字段类型

    数据类型

    Vector

    FLOAT_VECTOR

    Scalar

    VARCHAR

    INT8/16/32/64

    FLOAT

    DOUBLE

    BOOLEAN

    您需要确保 Collection Schema 和 Struct Schema 中定义的向量字段的数量不超过您的集群支持的数量上限。更多内容,可以参考使用限制

  • Nullable 与默认值

    Struct Array 字段不可为 Null 也不接受任何默认值。

  • Function

    不支持使用 Function 从标量字段派生向量字段。

  • 索引类型和相似度类型

    Collection 中的所有向量字段都需要建立索引。对于 Struct Array 字段中的向量字段,Zilliz Cloud 使用 EmbeddingList 来组织各 Struct 元素中相同字段的向量,并为每个 EmbeddingList 创建索引。

    您可以使用 HNSW 作为索引类型,并使用下表中列出的索引类型为 EmbeddingList 创建索引。

    索引类型

    相似度类型

    备注

    HNSW

    MAX_SIM_COSINE

    适用于如下类型的 EmbeddingList:

    • FLOAT_VECTOR

    • FLOAT16_VECTOR

    • BFLOAT16_VECTOR

    • INT8_VECTOR

    MAX_SIM_IP

    MAX_SIM_L2

    MAX_SIM_HAMMING

    适用于 BINARY_VECTOR 类型的 EmbeddingList。

    MAX_SIM_JACCARD

    Struct Array 中的标量字段尚不支持索引。

  • Upsert 数据

    不支持使用 Upsert 的合并模式更新 Struct 元素中的字段。但是您仍旧可以使用覆盖模式来更新 Struct Array 字段。

  • 标量过滤

    不支持在 Search 和 Query 中使用针对 Struct 元素中的标量字段的过滤表达式。

添加 Struct Array

在 Zilliz Cloud clusters 中创建 Struct Array 字段,需要先在 Collection 中创建一个 Array 类型的字段,并将其元素的数据类型设置为 Struct。具体流程如下:

  1. 在向 Collection Schema 中添加 Array 字段时,将该字段的数据类型设置为 DataType.ARRAY

  2. 然后将该字段的 element_type 属性设置为 DataType.STRUCT

  3. 为 Struct 元素创建 Schema,并向其中添加需要的字段。然后在 Struct Array 字段的 struct_schema 属性中引用该 Schema。

  4. 设置 Struct Array 字段的 max_capacity 属性,用于指定每条 Entity 中可以包含的 Struct 元素的最大数量。

  5. 可选)您还可以为 Struct 元素中的字段设置 mmap.enabled 属性,以便实现冷热数据存储的平衡。

如下示例演示了如何定义一个包含了 Struct Array 字段的 Collection Schema。

from pymilvus import MilvusClient, DataType

schema = MilvusClient.create_schema()

# add the primary field to the collection
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)

# add some scalar fields to the collection
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="author", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="year_of_publication", datatype=DataType.INT64)

# add a vector field to the collection
schema.add_field(field_name="title_vector", datatype=DataType.FLOAT_VECTOR, dim=5)

# highlight-start
# Create a struct schema
struct_schema = MilvusClient.create_struct_field_schema()

# add a scalar field to the struct
struct_schema.add_field("text", DataType.VARCHAR, max_length=65535)
struct_schema.add_field("chapter", DataType.VARCHAR, max_length=512)

# add a vector field to the struct with mmap enabled
struct_schema.add_field("text_vector", DataType.FLOAT_VECTOR, mmap_enabled=True, dim=5)

# reference the struct schema in an Array field with its
# element type set to `DataType.STRUCT`
schema.add_field("chunks", datatype=DataType.ARRAY, element_type=DataType.STRUCT,
struct_schema=struct_schema, max_capacity=1000)
# highlight-end

上述高亮部分演示了如何在 Collection Schema 中添加一个 Struct Array 字段。

设置索引参数

您需要为每个向量字段创建索引,无论该向量字段在 Collection Schema 中,还是在 Struct Array 字段中的 Struct 元素中。

如需为一个 EmbeddingList 创建索引,您需要将其索引类型设置为 HNSW,然后使用 MAX_SIM_COSINE 作为相似度类型,以便让 Zilliz Cloud clusters 度量两个 EmbedingList 的相似度。

# Create index parameters
index_params = MilvusClient.prepare_index_params()

# Create an index for the vector field in the collection
index_params.add_index(
field_name="title_vector",
index_type="AUTOINDEX",
metric_type="L2",
)

# highlight-start
# Create an index for the vector field in the element Struct
index_params.add_index(
field_name="chunks[text_vector]",
index_type="AUTOINDEX",
metric_type="MAX_SIM_COSINE",
)
# highlight-end

创建 Collection

当 Schema 和索引参数都准备就绪后,您就可以创建一个带有 Struct Array 字段的 Collection 了。

client = MilvusClient(
uri="YOUR_CLUSTER_ENDPOINT",
token="YOUR_CLUSTER_TOKEN"
)

client.create_collection(
collection_name="my_collection",
schema=schema,
index_params=index_params
)

插入数据

在 Collection 创建成功后,您可以参考如下代码向该 Collection 中插入数据。

# Sample data
data = {
'title': 'Walden',
'title_vector': [0.1, 0.2, 0.3, 0.4, 0.5],
'author': 'Henry David Thoreau',
'year_of_publication': 1845,
'chunks': [
{
'text': 'When I wrote the following pages, or rather the bulk of them...',
'text_vector': [0.3, 0.2, 0.3, 0.2, 0.5],
'chapter': 'Economy',
},
{
'text': 'I would fain say something, not so much concerning the Chinese and...',
'text_vector': [0.7, 0.4, 0.2, 0.7, 0.8],
'chapter': 'Economy'
}
]
}

# insert data
client.insert(
collection_name="my_collection",
data=[data]
)
还需要更多数据?
import json
import random
from typing import List, Dict, Any

# Real classic books (title, author, year)
BOOKS = [
("Pride and Prejudice", "Jane Austen", 1813),
("Moby Dick", "Herman Melville", 1851),
("Frankenstein", "Mary Shelley", 1818),
("The Picture of Dorian Gray", "Oscar Wilde", 1890),
("Dracula", "Bram Stoker", 1897),
("The Adventures of Sherlock Holmes", "Arthur Conan Doyle", 1892),
("Alice's Adventures in Wonderland", "Lewis Carroll", 1865),
("The Time Machine", "H.G. Wells", 1895),
("The Scarlet Letter", "Nathaniel Hawthorne", 1850),
("Leaves of Grass", "Walt Whitman", 1855),
("The Brothers Karamazov", "Fyodor Dostoevsky", 1880),
("Crime and Punishment", "Fyodor Dostoevsky", 1866),
("Anna Karenina", "Leo Tolstoy", 1877),
("War and Peace", "Leo Tolstoy", 1869),
("Great Expectations", "Charles Dickens", 1861),
("Oliver Twist", "Charles Dickens", 1837),
("Wuthering Heights", "Emily Brontë", 1847),
("Jane Eyre", "Charlotte Brontë", 1847),
("The Call of the Wild", "Jack London", 1903),
("The Jungle Book", "Rudyard Kipling", 1894),
]

# Common chapter names for classics
CHAPTERS = [
"Introduction", "Prologue", "Chapter I", "Chapter II", "Chapter III",
"Chapter IV", "Chapter V", "Chapter VI", "Chapter VII", "Chapter VIII",
"Chapter IX", "Chapter X", "Epilogue", "Conclusion", "Afterword",
"Economy", "Where I Lived", "Reading", "Sounds", "Solitude",
"Visitors", "The Bean-Field", "The Village", "The Ponds", "Baker Farm"
]

# Placeholder text snippets (mimicking 19th-century prose)
TEXT_SNIPPETS = [
"When I wrote the following pages, or rather the bulk of them...",
"I would fain say something, not so much concerning the Chinese and...",
"It is a truth universally acknowledged, that a single man in possession...",
"Call me Ishmael. Some years ago—never mind how long precisely...",
"It was the best of times, it was the worst of times...",
"All happy families are alike; each unhappy family is unhappy in its own way.",
"Whether I shall turn out to be the hero of my own life, or whether that station...",
"You will rejoice to hear that no disaster has accompanied the commencement...",
"The world is too much with us; late and soon, getting and spending...",
"He was an old man who fished alone in a skiff in the Gulf Stream..."
]

def random_vector() -> List[float]:
return [round(random.random(), 1) for _ in range(5)]

def generate_chunk() -> Dict[str, Any]:
return {
"text": random.choice(TEXT_SNIPPETS),
"text_vector": random_vector(),
"chapter": random.choice(CHAPTERS)
}

def generate_record(record_id: int) -> Dict[str, Any]:
title, author, year = random.choice(BOOKS)
num_chunks = random.randint(1, 5) # 1 to 5 chunks per book
chunks = [generate_chunk() for _ in range(num_chunks)]
return {
"title": title,
"title_vector": random_vector(),
"author": author,
"year_of_publication": year,
"chunks": chunks
}

# Generate 1000 records
data = [generate_record(i) for i in range(1000)]

# Insert the generated data
client.insert(collection_name="my_collection", data=data)

针对 Struct Array 字段进行向量搜索

您可以如同在 Collection 中的向量字段上进行相似性搜索一样在 Struct Array 字段中的向量字段上进行相同操作。

值得注意的是,在搜索请求中指定向量字段名称(anns_field)时,需要按下方代码中的方式拼接 Struct Array 字段的名称及需要搜索的 Struct 元素中的向量字段名称。并使用 EmbeddingList 对象组织查询向量。

📘说明

Zilliz Cloud 提供了 EmbeddingList 对象,帮助您在针对 Struct Array 字段中的 EmbeddingList 进行相似性搜索时组织查询向量。每个 EmbeddingList 需要包含至少一个向量,并返回一组结果。

EmbeddingList 仅可用于 search() 请求中。但不支持 Range Search 或 Grouping Search。也不支持在 search_iterator() 中使用。

from pymilvus.client.embedding_list import EmbeddingList

# each query embedding list triggers a single search
embeddingList1 = EmbeddingList()
embeddingList1.add([0.2, 0.9, 0.4, -0.3, 0.2])

embeddingList2 = EmbeddingList()
embeddingList2.add([-0.2, -0.2, 0.5, 0.6, 0.9])
embeddingList2.add([-0.4, 0.3, 0.5, 0.8, 0.2])

# a search with a single embedding list
results = client.search(
collection_name="my_collection",
data=[ embeddingList1 ],
anns_field="chunks[text_vector]",
search_params={"metric_type": "MAX_SIM_COSINE"},
limit=3,
output_fields=["chunks[text]"]
)

在上述代码示例中,chunks[text_vector] 指代了 Struct 元素中的 text_vector 字段。您可以使用该格式来定义 anns_fieldoutput_fields

执行成功后,搜索结果为一个包含了三个与查询 EmbeddingList 最相似的三个 Entity。

搜索结果
# [
# [
# {
# 'id': 461417939772144945,
# 'distance': 0.9675756096839905,
# 'entity': {
# 'chunks': [
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'All happy families are alike; each unhappy family is unhappy in its own way.'}
# ]
# }
# },
# {
# 'id': 461417939772144965,
# 'distance': 0.9555778503417969,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'When I wrote the following pages, or rather the bulk of them...'},
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# },
# {
# 'id': 461417939772144962,
# 'distance': 0.9469035863876343,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# }
# ]
# ]

您也可以在搜索请求中的 data 参数中添加多个查询 EmbeddingList,以便通过该请求中发起多个相似性搜索。

# a search with multiple embedding lists
results = client.search(
collection_name="my_collection",
data=[ embeddingList1, embeddingList2 ],
anns_field="chunks[text_vector]",
search_params={"metric_type": "MAX_SIM_COSINE"},
limit=3,
output_fields=["chunks[text]"]
)

print(results)

此时,搜索结果中会返回与每一个查询 EmbeddingList 最相似的三个 Entity。

搜索结果
# [
# [
# {
# 'id': 461417939772144945,
# 'distance': 0.9675756096839905,
# 'entity': {
# 'chunks': [
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'All happy families are alike; each unhappy family is unhappy in its own way.'}
# ]
# }
# },
# {
# 'id': 461417939772144965,
# 'distance': 0.9555778503417969,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'When I wrote the following pages, or rather the bulk of them...'},
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# },
# {
# 'id': 461417939772144962,
# 'distance': 0.9469035863876343,
# 'entity': {
# 'chunks': [
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'},
# {'text': 'The world is too much with us; late and soon, getting and spending...'}
# ]
# }
# }
# ],
# [
# {
# 'id': 461417939772144663,
# 'distance': 1.9761409759521484,
# 'entity': {
# 'chunks': [
# {'text': 'It was the best of times, it was the worst of times...'},
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'},
# {'text': 'Whether I shall turn out to be the hero of my own life, or whether that station...'},
# {'text': 'He was an old man who fished alone in a skiff in the Gulf Stream...'}
# ]
# }
# },
# {
# 'id': 461417939772144692,
# 'distance': 1.974656581878662,
# 'entity': {
# 'chunks': [
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'},
# {'text': 'Call me Ishmael. Some years ago—never mind how long precisely...'}
# ]
# }
# },
# {
# 'id': 461417939772144662,
# 'distance': 1.9406685829162598,
# 'entity': {
# 'chunks': [
# {'text': 'It is a truth universally acknowledged, that a single man in possession...'}
# ]
# }
# }
# ]
# ]

在上述示例中,embeddingList1 包含了一个向量,而 embeddingList2 包含了两个向量。每个 EmbeddingList 都会触发一次独立的相似性搜索并返回 topK 个与之最相似的 Entity。

后续步骤

支持 Struct Array 对于 Zilliz Cloud 来说是一个大的跨越,提升了其处理复杂数据结构的能力。为了更好地理解 Struct Array 的使用场景,并最大化该特性带来的收益,建议您阅读使用 Struct Array 进行 Schema 设计