create_collection()
This operation supports creating a collection in two distinct ways: quick setup or custom setup.
Request syntax
create_collection(
collection_name: str,
dimension: int,
primary_field_name: str = "id",
id_type: str = DataType,
vector_field_name: str = "vector",
metric_type: str = "COSINE",
auto_id: bool = False,
timeout: Optional[float] = None,
schema: Optional[CollectionSchema] = None, # Used for custom setup
index_params: Optional[IndexParams] = None, # Used for custom setup
**kwargs,
) -> None
PARAMETERS:
-
collection_name (str) -
[REQUIRED]
The name of the collection to create.
-
dimension (int) -
The dimension of the collection field to hold vector embeddings.
The value is usually determined by the model you use to generate vector embeddings and should be an integer greater than 1.
This parameter is designed for the quick setup of a collection and will be ignored if schema is not None and a field in the schema has its dim set to a positive integer.
-
primary_field_name (str) -
The name of the primary field in this collection.
The value defaults to id. You can use another name you see fit. Skip this parameter if you need to set up a collection with a customized schema.
This parameter is designed for the quick setup of a collection and will ignored if the schema is not None and a field in the schema has its is_primary set to True.
-
id_type (DataType) -
The data type of the primary field in this collection.
The value defaults to DataType.INT64. Possible values are DataType.INT64 and DataType.VARCHAR.
This parameter is designed for the quick setup of a collection and will be ignored if schema is not None.
-
vector_field_name (str) -
The name of the collection field to hold vector embeddings.
The value defaults to vector. You can use another name you see fit.
This parameter is designed for the quick setup of a collection and will be ignored if schema is not None.
-
metric_type (str) -
The algorithm used for this collection to measure similarities between vector embeddings.
The value defaults to COSINE. Possible values are L2, IP, and COSINE. For details on these metric types, refer to Similarity Metrics Explained.
This parameter is designed for the quick setup of a collection and will be ignored if schema is not None.
-
auto_id (bool) -
Whether the primary field automatically increments upon data insertions into this collection.
The value defaults to False. Setting this to True makes the primary field automatically increment. In this case, the primary field should not be included in the data to insert to avoid errors. The auto-generated IDs have a fixed length and cannot be altered.
This parameter is designed for the quick setup of a collection and will be ignored if schema is not None.
-
timeout (float | None) -
The timeout duration for this operation. Setting this to None indicates that this operation timeouts when any response returns or error occurs.
-
schema (CollectionSchema | None)
The schema of this collection.
Setting this to None indicates this collection will be created in a quick setup manner.
To set up a collection with a customized schema, you need to create a CollectionSchema object and reference it here. In this case, Zilliz Cloud ignores all other schema-related settings carried in the request.
-
index_params (IndexParams | None)
The parameters for building the index on the vector field in this collection. To set up a collection with a customized schema and automatically load the collection to memory, you need to create an IndexParams object and reference it here.
You should at least add an index for the vector field in this collection. You can also skip this parameter if you prefer to set up the index parameters later on.
-
kwargs -
-
enable_dynamic_field (bool) -
Whether to use a reserved JSON field named $meta to store undefined fields and their values in key-value pairs.
The value defaults to True, indicating that the $meta field is used.
This parameter is ignored if schema is not None.
-
num_shards (int) -
The number of shards to create along with the creation of this collection.
The value defaults to 1, indicating that one shard is to be created along with this collection.
📘What is sharding?Sharding refers to distributing write operations to different nodes to make the most of the parallel computing potential of a Milvus cluster for writing data.
By default, a collection contains one shard.
-
partition_key_field (str) -
The name of the field that serves as the partition key. Each collection can have one partition key.
This parameter is ignored if schema is not None and a field in the schema has its is_parition_key set to True.
📘What is the partition key?To facilitate partition-oriented multi-tenancy, you can set a field as the partition key field so that Zilliz Cloud hashes the field values and distributes entities among the specified number of partitions accordingly.
When retrieving entities, ensure that the partition key field is used in the boolean expression to filter out entities of a specific field value.
For details, refer to Use Partition Key and Multi-tenancy.
-
num_partitions (int) -
The number of partitions to create for the partition key feature.
The value defaults to 64, indicating that 64 partitions are to be created along with this collection. This parameter applies when partition_key_field is set to the name of a field.
-
consistency_level (int | str)
The consistency level of the target collection.
The value defaults to Bounded (2) with options of Strong (0), Session (1), Bounded (2), and Eventually (3).
📘What is the consistency level?Consistency in a distributed database specifically refers to the property that ensures every node or replica has the same view of data when writing or reading data at a given time.
Zilliz Cloud provides three consistency levels: Strong, Bounded Staleness, and Eventually, with Bounded Staleness set as the default.
You can easily tune the consistency level when conducting a vector similarity search or query to make it best suit your application.
-
properties (dict) -
Additional properties in key-value pairs.
-
collection.ttl.seconds (int)
The time-to-live (TTL) of a collection in seconds.
-
-
RETURN TYPE:
NoneType
RETURNS:
None
EXCEPTIONS:
-
PrimaryKeyException
This exception will be raised if the data type of the primary field is not an integer or a string.
-
MilvusException
This exception will be raised when any error occurs during this operation.
Examples
Set up a Milvus client
from pymilvus import MilvusClient
client = MilvusClient(
uri="https://inxx-xxxxxxxxxxxx.api.gcp-us-west1.zillizcloud.com:19530",
token="user:password"
)
Create a collection
You can choose between a quick setup or a customized setup as follows:
-
Quick setup
The quick setup collection has two mandatory fields: the primary and vector fields. It also allows the insertion of undefined fields and their values in key-value pairs in a dynamic field.
client.create_collection(
collection_name="test_collection",
dimension=5
)In the above setup,
-
The primary and vector fields use their default names (id and vector).
-
The metric type is also set to its default value (COSINE).
-
The primary field accepts integers and does not automatically increments.
-
The reserved JSON field named $meta is used to store non-schema-defined fields and their values.
You can modify the names of the primary and vector fields and change the metric type. Additionally, the primary field can be set to increment automatically.
client.create_collection(
collection_name="quick_setup",
dimension=5,
primary_field_name="my_id",
id_type="string",
vector_field_name="my_vector",
metric_type="L2",
auto_id=True,
max_length=512
)In the above code, the collection will be created, indexed, and loaded into memory.
-
-
Customized setup with index parameters
For a customized setup, create the schema and index parameters beforehand.
from pymilvus import MilvusClient, DataType
# 1. Create schema
schema = MilvusClient.create_schema(
auto_id=False,
enable_dynamic_field=False,
)
# 2. Add fields to schema
schema.add_field(field_name="my_id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="my_vector", datatype=DataType.FLOAT_VECTOR, dim=5)
# 3. Prepare index parameters
index_params = client.prepare_index_params()
# 4. Add indexes
index_params.add_index(
field_name="my_id",
index_type="STL_SORT"
)
index_params.add_index(
field_name="my_vector",
index_type="AUTOINDEX",
metric_type="L2",
params={"nlist": 1024}
)
# 5. Create a collection
client.create_collection(
collection_name="customized_setup",
schema=schema,
index_params=index_params
)In the above code, the collection will be created, indexed, and loaded into memory.
-
Customized setup without index parameters
from pymilvus import MilvusClient, DataType
# 1. Create schema
schema = MilvusClient.create_schema(
auto_id=False,
enable_dynamic_field=False,
)
# 2. Add fields to schema
schema.add_field(field_name="my_id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="my_vector", datatype=DataType.FLOAT_VECTOR, dim=5)
# 3. Create a collection
client.create_collection(
collection_name="customized_setup",
schema=schema
)In the above code, the collection will also be created. However, without
index_param
, data in the collection will not be indexed and loaded into memory.