User Metadata API#
This documents the qiime2.Metadata API.
This may be used by QIIME 2 plugin developers or users of the QIIME 2 Python 3 API.
The qiime.Metadata class#
- class qiime2.Metadata(dataframe, column_missing_schemes=None, default_missing_scheme='blank')[source]#
Store metadata associated with identifiers in a study.
Metadata is tabular in nature, mapping study identifiers (e.g. sample or feature IDs) to columns of metadata associated with each ID.
For more details about metadata in QIIME 2, including the TSV metadata file format, see the Metadata Tutorial at https://docs.qiime2.org.
The following text focuses on design and considerations when working with
Metadataobjects at the API level.A
Metadataobject is composed of zero or moreMetadataColumnobjects. AMetadataobject always contains at least one ID, regardless of the number of columns. Each column in theMetadataobject has an associated column type representing either categorical or numeric data. Each metadata column is represented by an object corresponding to the column’s type:CategoricalMetadataColumnorNumericMetadataColumn, respectively.A
Metadataobject is closely linked to its corresponding TSV metadata file format described at https://docs.qiime2.org. Therefore, certain requirements present in the file format are also enforced on the in-memory object in order to make serializedMetadataobjects roundtrippable when loaded from disk again. For example, IDs cannot begin with a pound character (#) because those IDs would be interpreted as comment rows when written to disk as TSV. See the metadata file format spec for more details about data formatting requirements.In addition to being loaded from or saved to disk, a
Metadataobject can be constructed from apandas.DataFrameobject. See the Parameters section below for details on how to constructMetadataobjects from dataframes.Metadataobjects have various methods to access, filter, and merge data. A dataframe can be retrieved from theMetadataobject for further data manipulation using the pandas API. IndividualMetadataColumnobjects can be retrieved to gain access to APIs applicable to a single metadata column.Missing values may be encoded in one of the following schemes:
- ‘blank’
The default, which treats None/NaN as the only valid missing values.
- ‘no-missing’
Indicates there are no missing values in a column, any None/NaN values should be considered an error. If a scheme other than ‘blank’ is used by default, this scheme can be provided to preserve strings as categorical terms.
- ‘INSDC:missing’
The INSDC vocabulary for missing values. The current implementation supports only lower-case terms which match exactly: ‘not applicable’, ‘missing’, ‘not provided’, ‘not collected’, and ‘restricted access’.
- Parameters:
dataframe (pandas.DataFrame) – Dataframe containing metadata. The dataframe’s index defines the IDs, and the index name (
Index.name) must match one of the required ID headers described in the metadata file format spec. Each column in the dataframe defines a metadata column, and the metadata column’s type (i.e. categorical or numeric) is determined based on the column’s dtype. If a column hasdtype=object, it may contain strings or pandas missing values (e.g.np.nan,None). Columns matching this requirement are assumed to be categorical. If a column in the dataframe hasdtype=floatordtype=int, it may contain floating point numbers or integers, as well as pandas missing values (e.g.np.nan). Columns matching this requirement are assumed to be numeric. Regardless of column type (categorical vs numeric), the dataframe stored within theMetadataobject will have any missing values normalized tonp.nan. Columns withdtype=intwill be cast todtype=float. To obtain a dataframe from theMetadataobject containing these normalized data types and values, useMetadata.to_dataframe().column_missing_schemes (dict, optional) – Describe the metadata column handling for missing values described in the dataframe. This is a dict mapping column names (str) to missing-value schemes (str). Valid values are ‘blank’, ‘no-missing’, and ‘INSDC:missing’. Column names may be omitted.
default_missing_scheme (str, optional) – The missing scheme to use when none has been provided in the file or in column_missing_schemes.
- classmethod load(filepath, column_types=None, column_missing_schemes=None, default_missing_scheme='blank')[source]#
Load a TSV metadata file.
The TSV metadata file format is described at https://docs.qiime2.org in the Metadata Tutorial.
- Parameters:
filepath (str) – Path to TSV metadata file to be loaded.
column_types (dict, optional) – Override metadata column types specified or inferred in the file. This is a dict mapping column names (str) to column types (str). Valid column types are ‘categorical’ and ‘numeric’. Column names may be omitted from this dict to use the column types read from the file.
column_missing_schemes (dict, optional) – Override the metadata column handling for missing values described in the file. This is a dict mapping column names (str) to missing-value schemes (str). Valid values are ‘blank’, ‘no-missing’, and ‘INSDC:missing’. Column names may be omitted.
default_missing_scheme (str, optional) – The missing scheme to use when none has been provided in the file or in column_missing_schemes.
- Returns:
Metadata object loaded from filepath.
- Return type:
- Raises:
MetadataFileError – If the metadata file is invalid in any way (e.g. doesn’t meet the file format’s requirements).
See also
save
- property columns#
Ordered mapping of column names to ColumnProperties.
The mapping that is returned is read-only. This property is also read-only.
- Returns:
Ordered mapping of column names to ColumnProperties.
- Return type:
- property column_count#
Number of metadata columns.
This property is read-only.
- Returns:
Number of metadata columns.
- Return type:
Notes
Zero metadata columns are allowed.
See also
id_count
- to_dataframe(encode_missing=False)[source]#
Create a pandas dataframe from the metadata.
The dataframe’s index name (
Index.name) will match this metadata object’sid_header, and the index will contain this metadata object’s IDs. The dataframe’s column names will match the column names in this metadata. Categorical columns will be stored asdtype=object(containing strings), and numeric columns will be stored asdtype=float.- Parameters:
encode_missing (bool, optional) – Whether to convert missing values (NaNs) back into their original vocabulary (strings) if a missing scheme was used.
- Returns:
Dataframe constructed from the metadata.
- Return type:
pandas.DataFrame
- get_column(name)[source]#
Retrieve metadata column based on column name.
- Parameters:
name (str) – Name of the metadata column to retrieve.
- Returns:
Requested metadata column (
CategoricalMetadataColumnorNumericMetadataColumn).- Return type:
See also
- get_ids(where=None)[source]#
Retrieve IDs matching search criteria.
- Parameters:
where (str, optional) – SQLite WHERE clause specifying criteria IDs must meet to be included in the results. All IDs are included by default.
- Returns:
IDs matching search criteria specified in where.
- Return type:
See also
ids,filter_ids,get_columnNotes
The ID header (
Metadata.id_header) may be used in the where clause to query the table’s ID column.
- merge(*others)[source]#
Merge this
Metadataobject with otherMetadataobjects.Returns a new
Metadataobject containing the merged contents of thisMetadataobject and others. The merge is not in-place and will always return a new mergedMetadataobject.The merge will include only those IDs that are shared across all
Metadataobjects being merged (i.e. the merge is an inner join).Each metadata column being merged must have a unique name; merging metadata with overlapping column names will result in an error.
- Parameters:
others (tuple) – One or more
Metadataobjects to merge with thisMetadataobject.- Returns:
New object containing merged metadata. The merged IDs will be in the same relative order as the IDs in this
Metadataobject after performing the inner join. The merged column order will match the column order ofMetadataobjects being merged from left to right.- Return type:
- Raises:
ValueError – If zero
Metadataobjects are provided in others (there is nothing to merge in this case).
Notes
The merged
Metadataobject will always have itsid_headerproperty set to'id', regardless of theid_headervalues on theMetadataobjects being merged.The merged
Metadataobject tracks all source artifacts that it was built from to preserve provenance (i.e. the.artifactsproperty on allMetadataobjects is merged).
- filter_ids(ids_to_keep)[source]#
Filter metadata by IDs.
- Parameters:
ids_to_keep (iterable of str) – IDs that should be retained in the filtered
Metadataobject. If any IDs in ids_to_keep are not contained in thisMetadataobject, aValueErrorwill be raised. The filteredMetadataobject will retain the same relative ordering of IDs in thisMetadataobject. Thus, the ordering of IDs in ids_to_keep does not determine the ordering of IDs in the filteredMetadataobject.- Returns:
The metadata filtered by IDs.
- Return type:
See also
- filter_columns(*, column_type=None, drop_all_unique=False, drop_zero_variance=False, drop_all_missing=False)[source]#
Filter metadata by columns.
- Parameters:
column_type (str, optional) – If supplied, will retain only columns of this type. The currently supported column types are ‘numeric’ and ‘categorical’.
drop_all_unique (bool, optional) – If
True, columns that contain a unique value for every ID will be dropped. Missing data (np.nan) are ignored when determining unique values. If a column consists solely of missing data, it will be dropped.drop_zero_variance (bool, optional) – If
True, columns that contain the same value for every ID will be dropped. Missing data (np.nan) are ignored when determining variance. If a column consists solely of missing data, it will be dropped.drop_all_missing (bool, optional) – If
True, columns that have a missing value (np.nan) for every ID will be dropped.
- Returns:
The metadata filtered by columns.
- Return type:
See also
Metadata columns#
- class qiime2.MetadataColumn(series, missing_scheme='blank')[source]#
Abstract base class representing a single metadata column.
Concrete subclasses represent specific metadata column types, e.g.
CategoricalMetadataColumnandNumericMetadataColumn.See the
Metadataclass docstring for details aboutMetadataandMetadataColumnobjects, including a description of column types.The main difference in constructing
MetadataColumnvsMetadataobjects is thatMetadataColumnobjects are constructed from apandas.Seriesobject instead of apandas.DataFrame. Otherwise, the same restrictions, considerations, and data normalization are applied as withMetadataobjects.- Parameters:
series (pd.Series) – The series to construct a column from.
missing_scheme ("blank", "no-missing", "INSDC:missing") – How to interpret terms for missing values. These will be converted to NaN. The default treatment is to take no action.
- property name#
Metadata column name.
This property is read-only.
- Returns:
Metadata column name.
- Return type:
- property missing_scheme#
The vocabulary used to encode missing values
This property is read-only.
- Returns:
“blank”, “no-missing”, or “INSDC:missing”
- Return type:
- to_series(encode_missing=False)[source]#
Create a pandas series from the metadata column.
The series index name (
Index.name) will match this metadata column’sid_header, and the index will contain this metadata column’s IDs. The series name will match this metadata column’s name.- Parameters:
encode_missing (bool, optional) – Whether to convert missing values (NaNs) back into their original vocabulary (strings) if a missing scheme was used.
- Returns:
Series constructed from the metadata column.
- Return type:
pandas.Series
See also
- to_dataframe(encode_missing=False)[source]#
Create a pandas dataframe from the metadata column.
The dataframe will contain exactly one column. The dataframe’s index name (
Index.name) will match this metadata column’sid_header, and the index will contain this metadata column’s IDs. The dataframe’s column name will match this metadata column’s name.- Parameters:
encode_missing (bool, optional) – Whether to convert missing values (NaNs) back into their original vocabulary (strings) if a missing scheme was used.
- Returns:
Dataframe constructed from the metadata column.
- Return type:
pandas.DataFrame
See also
- get_missing()[source]#
Return a series containing only missing values (with an index).
If the column was constructed with a missing scheme, then the values of the series will be the original terms instead of NaN.
- has_missing_values()[source]#
Determine if the metadata column has one or more missing values.
- Returns:
Trueif the metadata column has one or more missing values (np.nan),Falseotherwise.- Return type:
See also
- drop_missing_values()[source]#
Filter out missing values from the metadata column.
- Returns:
Metadata column with missing values removed.
- Return type:
See also
- get_ids(where_values_missing=False)[source]#
Retrieve IDs matching search criteria.
- Parameters:
where_values_missing (bool, optional) – If
True, only return IDs that are associated with missing values (np.nan). IfFalse(the default), return all IDs in the metadata column.- Returns:
IDs matching search criteria.
- Return type:
See also
- filter_ids(ids_to_keep)[source]#
Filter metadata column by IDs.
- Parameters:
ids_to_keep (iterable of str) – IDs that should be retained in the filtered
MetadataColumnobject. If any IDs in ids_to_keep are not contained in thisMetadataColumnobject, aValueErrorwill be raised. The filteredMetadataColumnobject will retain the same relative ordering of IDs in thisMetadataColumnobject. Thus, the ordering of IDs in ids_to_keep does not determine the ordering of IDs in the filteredMetadataColumnobject.- Returns:
The metadata column filtered by IDs.
- Return type:
See also