qiime2.Metadata API#

This documents the qiime2.Metadata API. This may be used by QIIME 2 plugin developers or users of the QIIME 2 Python 3 API.

The qiime.Metadata class#

Metadata(dataframe, column_missing_schemes=None, default_missing_scheme='blank')[source]#

Store metadata associated with identifiers in a study.

Metadata is tabular in nature, mapping study identifiers (e.g. sample or feature IDs) to columns of metadata associated with each ID.

For more details about metadata in QIIME 2, including the TSV metadata file format, see the Metadata Tutorial at https://docs.qiime2.org.

The following text focuses on design and considerations when working with Metadata objects at the API level.

A Metadata object is composed of zero or more MetadataColumn objects. A Metadata object always contains at least one ID, regardless of the number of columns. Each column in the Metadata object has an associated column type representing either categorical or numeric data. Each metadata column is represented by an object corresponding to the column’s type: CategoricalMetadataColumn or NumericMetadataColumn, respectively.

A Metadata object is closely linked to its corresponding TSV metadata file format described at https://docs.qiime2.org. Therefore, certain requirements present in the file format are also enforced on the in-memory object in order to make serialized Metadata objects roundtrippable when loaded from disk again. For example, IDs cannot begin with a pound character (#) because those IDs would be interpreted as comment rows when written to disk as TSV. See the metadata file format spec for more details about data formatting requirements.

In addition to being loaded from or saved to disk, a Metadata object can be constructed from a pandas.DataFrame object. See the Parameters section below for details on how to construct Metadata objects from dataframes.

Metadata objects have various methods to access, filter, and merge data. A dataframe can be retrieved from the Metadata object for further data manipulation using the pandas API. Individual MetadataColumn objects can be retrieved to gain access to APIs applicable to a single metadata column.

Missing values may be encoded in one of the following schemes:

‘blank’

The default, which treats None/NaN as the only valid missing values.

‘no-missing’

Indicates there are no missing values in a column, any None/NaN values should be considered an error. If a scheme other than ‘blank’ is used by default, this scheme can be provided to preserve strings as categorical terms.

‘INSDC:missing’

The INSDC vocabulary for missing values. The current implementation supports only lower-case terms which match exactly: ‘not applicable’, ‘missing’, ‘not provided’, ‘not collected’, and ‘restricted access’.

Parameters:
  • dataframe (pandas.DataFrame) – Dataframe containing metadata. The dataframe’s index defines the IDs, and the index name (Index.name) must match one of the required ID headers described in the metadata file format spec. Each column in the dataframe defines a metadata column, and the metadata column’s type (i.e. categorical or numeric) is determined based on the column’s dtype. If a column has dtype=object, it may contain strings or pandas missing values (e.g. np.nan, None). Columns matching this requirement are assumed to be categorical. If a column in the dataframe has dtype=float or dtype=int, it may contain floating point numbers or integers, as well as pandas missing values (e.g. np.nan). Columns matching this requirement are assumed to be numeric. Regardless of column type (categorical vs numeric), the dataframe stored within the Metadata object will have any missing values normalized to np.nan. Columns with dtype=int will be cast to dtype=float. To obtain a dataframe from the Metadata object containing these normalized data types and values, use Metadata.to_dataframe().

  • column_missing_schemes (dict, optional) – Describe the metadata column handling for missing values described in the dataframe. This is a dict mapping column names (str) to missing-value schemes (str). Valid values are ‘blank’, ‘no-missing’, and ‘INSDC:missing’. Column names may be omitted.

  • default_missing_scheme (str, optional) – The missing scheme to use when none has been provided in the file or in column_missing_schemes.

Metadata columns#

MetadataColumn(series, missing_scheme='blank')[source]#

Abstract base class representing a single metadata column.

Concrete subclasses represent specific metadata column types, e.g. CategoricalMetadataColumn and NumericMetadataColumn.

See the Metadata class docstring for details about Metadata and MetadataColumn objects, including a description of column types.

The main difference in constructing MetadataColumn vs Metadata objects is that MetadataColumn objects are constructed from a pandas.Series object instead of a pandas.DataFrame. Otherwise, the same restrictions, considerations, and data normalization are applied as with Metadata objects.

Parameters:
  • series (pd.Series) – The series to construct a column from.

  • missing_scheme ("blank", "no-missing", "INSDC:missing") – How to interpret terms for missing values. These will be converted to NaN. The default treatment is to take no action.

NumericMetadataColumn(series, missing_scheme='blank')[source]#

A single metadata column containing numeric data.

See the Metadata class docstring for details about Metadata and MetadataColumn objects, including a description of column types and supported data formats.

CategoricalMetadataColumn(series, missing_scheme='blank')[source]#

A single metadata column containing categorical data.

See the Metadata class docstring for details about Metadata and MetadataColumn objects, including a description of column types and supported data formats.

Exceptions#

MetadataFileError(message, include_suffix=True)[source]#

Common base class for all non-exit exceptions.