lamindb.Collection¶
- class lamindb.Collection(artifacts: list[Artifact], key: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None)¶
Bases:
Record,IsVersioned,TracksRun,TracksUpdatesCollections of artifacts.
Collections provide a simple way of versioning collections of artifacts.
- Parameters:
artifacts –
list[Artifact]A list of artifacts.key –
strA file-path like key, analogous to thekeyparameter ofArtifactandTransform.description –
str | None = NoneA description.revises –
Collection | None = NoneAn old version of the collection.run –
Run | None = NoneThe run that creates the collection.meta –
Artifact | None = NoneAn artifact that defines metadata for the collection.reference –
str | None = NoneA simple reference, e.g. an external ID or a URL.reference_type –
str | None = NoneA way to indicate to indicate the type of the simple reference"url".
See also
Examples
Create a collection from a list of
Artifactobjects:>>> collection = ln.Collection([artifact1, artifact2], key="my_project/my_collection")
Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):
>>> collection = ln.Collection(data_artifact, key="my_project/my_collection", meta=metadata_artifact)
Attributes¶
- property data_artifact: Artifact | None¶
Access to a single data artifact.
If the collection has a single data & metadata artifact, this allows access via:
collection.data_artifact # first & only element of collection.artifacts collection.meta_artifact # metadata
- property name: str¶
Name of the collection.
Splits
keyon/and returns the last element.
- property ordered_artifacts: QuerySet¶
Ordered
QuerySetof.artifacts.Accessing the many-to-many field
collection.artifactsdirectly gives you non-deterministic order.Using the property
.ordered_artifactsallows to iterate through a set that’s ordered in the order of creation.
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
Simple fields¶
- uid: str¶
Universal id, valid across DB instances.
- key: str¶
Name or path-like key.
- description: str | None¶
A description or title.
- hash: str | None¶
Hash of collection content.
- reference: str | None¶
A reference like URL or external ID.
- reference_type: str | None¶
Type of reference, e.g., cellxgene Census collection_id.
-
meta_artifact:
Artifact| None¶ An artifact that stores metadata that indexes a collection.
It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field:
artifact._meta_of_collection.
- version: str | None¶
Version (default
None).Defines version of a family of records characterized by the same
stem_uid.Consider using semantic versioning with Python versioning.
- is_latest: bool¶
Boolean flag that indicates whether a record is the latest in its version family.
- created_at: datetime¶
Time of creation of record.
- updated_at: datetime¶
Time of last update to record.
Relational fields¶
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame.By default, shows all direct fields, except
updated_at.Use arguments
includeorfeatureto include other data.- Parameters:
include (
str|list[str] |None, default:None) – Related fields to include as columns. Takes strings of form"ulabels__name","cell_types__name", etc. or a list of such strings.features (
bool|list[str], default:False) – IfTrue, map all features of theFeatureregistry onto the resultingDataFrame. Only available forArtifact.limit (
int, default:100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A
QuerySet.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples:
ulabel = ln.ULabel.get("FvtpPJLJ") ulabel = ln.ULabel.get(name="my-label")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str|None) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
Methods¶
- async adelete(using=None, keep_parents=False)¶
- append(artifact, run=None)¶
Append an artifact to the collection.
This does not modify the original collection in-place, but returns a new version of the original collection with the appended artifact.
- Parameters:
- Return type:
Examples:
collection_v1 = ln.Collection(artifact, key="My collection").save() collection_v2 = collection.append(another_artifact) # returns a new version of the collection collection_v2.save() # save the new version
- async arefresh_from_db(using=None, fields=None, from_queryset=None)¶
- async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)¶
- cache(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows synching logic: only caches outdated artifacts.
Returns paths to locally cached on-disk artifacts.
- Parameters:
is_run_input (
bool|None, default:None) – Whether to track this collection as run input.- Return type:
list[UPath]
- clean()¶
Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- date_error_message(lookup_type, field_name, unique_for)¶
- delete(permanent=None)¶
Delete collection.
- Parameters:
permanent (
bool|None, default:None) – Whether to permanently delete the collection record (skips trash).- Return type:
None
Examples
For any
Collectionobjectcollection, call:>>> collection.delete()
- describe()¶
Describe relations of record.
- Return type:
None
Examples
>>> artifact.describe()
- get_constraints()¶
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- load(join='outer', is_run_input=None, **kwargs)¶
Stage and load to memory.
Returns in-memory representation if possible such as a concatenated
DataFrameorAnnDataobject.- Return type:
Any
- mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶
Return a map-style dataset.
Returns a pytorch map-style dataset by virtually concatenating
AnnDataarrays.If your
AnnDatacollection is in the cloud, move them into a local cache first viacache().__getitem__of theMappedCollectionobject takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnDataobjects in the collection. The dictionary has keys forlayers_keys(.Xis in"X"),obs_keys,obsm_keys(underf"obsm_{key}") and also"_store_idx"for the index of theAnnDataobject containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
This method currently only works for collections of
AnnDataartifacts.- Parameters:
layers_keys (
str|list[str] |None, default:None) – Keys from the.layersslot.layers_keys=Noneor"X"in the list retrieves.X.obs_keys (
str|list[str] |None, default:None) – Keys from the.obsslots.obsm_keys (
str|list[str] |None, default:None) – Keys from the.obsmslots.obs_filter (
dict[str,str|list[str]] |None, default:None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.join (
Literal['inner','outer'] |None, default:'inner') –"inner"or"outer"virtual joins. IfNoneis passed, does not join.encode_labels (
bool|list[str], default:True) – Encode labels into integers. Can be a list with elements fromobs_keys.unknown_label (
str|dict[str,str] |None, default:None) – Encode this label to -1. Can be a dictionary with keys fromobs_keysifencode_labels=Trueor fromencode_labelsif it is a list.cache_categories (
bool, default:True) – Enable caching categories ofobs_keysfor faster access.parallel (
bool, default:False) – Enable sampling with multiple processes.dtype (
str|None, default:None) – Convert numpy arrays from.X,.layersand.obsmstream (
bool, default:False) – Whether to stream data from the array backend.is_run_input (
bool|None, default:None) – Whether to track this collection as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Collection.get(description="my collection") >>> mapped = collection.mapped(obs_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
- open(is_run_input=None)¶
Return a cloud-backed pyarrow Dataset.
Works for
pyarrowcompatible formats.- Return type:
Dataset
Notes
For more info, see tutorial: Slice arrays.
- prepare_database_save(field)¶
- refresh_from_db(using=None, fields=None, from_queryset=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- restore()¶
Restore collection record from trash.
- Return type:
None
Examples
For any
Collectionobjectcollection, call:>>> collection.restore()
- save(using=None)¶
Save the collection and underlying artifacts to database & storage.
- Parameters:
using (
str|None, default:None) – The database to which you want to save.- Return type:
Examples
>>> collection = ln.Collection("./myfile.csv", name="myfile")
- save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)¶
Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.
The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.
- serializable_value(field_name)¶
Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.
Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.
- unique_error_message(model_class, unique_check)¶
- validate_constraints(exclude=None)¶
- validate_unique(exclude=None)¶
Check unique constraints on the model and raise ValidationError if any failed.
- view_lineage(with_children=True)¶
Graph of data flow.
- Return type:
None
Notes
For more info, see use cases: Data lineage.
Examples
>>> collection.view_lineage() >>> artifact.view_lineage()