跳到主要内容
版本:BYOC 开发指南

随机采样

处理大规模数据集时,你通常无需处理所有数据就能获得见解或测试过滤逻辑。随机抽样提供了一种解决方案,它允许你处理具有统计代表性的数据子集,从而显著减少查询时间和资源消耗。

随机抽样在片段层面进行操作,确保在保持样本随机性的同时,在整个集合的数据分布中实现高效性能。

主要用例:

  • 数据探索:以最少的资源使用快速预览集合结构和内容

  • 开发测试:在全面部署之前,在可管理的数据样本上测试复杂的过滤逻辑

  • 资源优化:降低探索性查询和统计分析的计算成本

语法

filter = "RANDOM_SAMPLE(sampling_factor)"

参数:

  • sampling_factor:取值范围为 (0, 1) 的抽样因子,不包含边界值。例如,RANDOM_SAMPLE(0.001)大约选择 0.1% 的结果。

重要规则:

  • 该表达式不区分大小写(RANDOM_SAMPLErandom_sample

  • 采样因子必须在 (0, 1) 范围内,不包括边界

与其他过滤器结合

随机采样运算符必须使用逻辑 AND 与其他过滤表达式结合使用。在组合过滤器时,Milvus 首先应用其他条件,然后对结果集执行随机采样。

# Correct: Filter first, then sample
filter = 'color == "red" AND RANDOM_SAMPLE(0.001)'
# Processing: Find all red items → Sample 0.1% of those red items

# Incorrect: OR doesn't make logical sense
filter = 'color == "red" OR RANDOM_SAMPLE(0.001)' # ❌ Invalid logic
# This would mean: "Either red items OR sample everything" - which is meaningless

示例

示例1:数据探索

快速预览您的 Collection 结构:

from pymilvus import MilvusClient

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

# Sample approximately 1% of the entire collection
result = client.query(
collection_name="product_catalog",
# highlight-next-line
filter="RANDOM_SAMPLE(0.01)",
output_fields=["id", "product_name"],
limit=10
)

print(f"Sampled {len(result)} products from collection")

示例2:结合过滤与随机抽样

在可管理的子集上测试过滤逻辑:

# First filter by category and price, then sample 0.5% of results
filter_expression = 'category == "electronics" AND price > 100 AND RANDOM_SAMPLE(0.005)'

result = client.query(
collection_name="product_catalog",
# highlight-next-line
filter=filter_expression,
output_fields=["product_name", "price", "rating"],
limit=10
)

print(f"Found {len(result)} electronics products in sample")

示例3:快速分析

对过滤后的数据进行快速统计分析:

# Get insights from ~0.1% of premium customer data
filter_expression = 'customer_tier == "premium" AND region == 'North America' AND RANDOM_SAMPLE(0.001)'

result = client.query(
collection_name="customer_profiles",
# highlight-next-line
filter=filter_expression,
output_fields=["purchase_amount", "satisfaction_score", "last_purchase_date"],
limit=10
)

# Analyze sample for quick insights
if result:
average_purchase = sum(r["purchase_amount"] for r in result) / len(result)
average_satisfaction = sum(r["satisfaction_score"] for r in result) / len(result)

print(f"Sample size: {len(result)}")
print(f"Average purchase amount: ${average_purchase:.2f}")
print(f"Average satisfaction score: {average_satisfaction:.2f}")

示例4:结合向量搜索

在过滤搜索场景中使用随机抽样:

# Search for similar products within a sampled subset
search_results = client.search(
collection_name="product_catalog",
data=[[0.1, 0.2, 0.3, 0.4, 0.5]], # query vector
# highlight-next-line
filter='category == "books" AND RANDOM_SAMPLE(0.01)',
search_params={"metric_type": "L2", "params": {}},
output_fields=["title", "author", "price"],
limit=10
)

print(f"Found {len(search_results[0])} similar books in sample")

最佳实践

  • 从小处着手:初始探索时,从较小的采样因子 (0.001 - 0.01) 开始

  • 开发工作流程:在开发期间使用抽样,在生产查询中移除

  • 统计有效性:样本量越大,统计代表性越准确

  • 性能测试:监控查询性能并根据需要调整采样因子