API Reference

This page provides a comprehensive reference for all functions in the Mievformer package, organized by functionality.

Core Model Functions

Functions for training the Mievformer model and computing embeddings.

mievformer.optimize_nicheformer(adata, model_path, ngpu=1, batch_size=512, max_epochs=1000, neighbor_num=100, latent_dim=20, kld_ld=0.05, pent_ld=0.05, dist_space='latent', cellrep_key='X_pca', batch_key=None, batch_correct=False)[source]

Optimize the NicheFormer model using masked self-supervised learning.

Mievformer learns microenvironmental representations by encoding the cellular states and spatial configurations of neighboring cells using a Transformer-based architecture. It masks the central cell position and maximizes the likelihood that the observed central cell state would be generated from the inferred microenvironmental embedding.

The training objective corresponds to the InfoNCE loss, maximizing the mutual information between microenvironmental embeddings and their corresponding central cell states.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing spatial transcriptomics data. Must contain spatial coordinates in adata.obsm[‘spatial’] and cell representations (e.g., PCA) in adata.obsm[cellrep_key].

  • model_path (str) – Path to save the trained model checkpoint (.pt file).

  • ngpu (int, optional) – Number of GPUs to use for training. Default is 1.

  • batch_size (int, optional) – Batch size for training. Default is 512.

  • max_epochs (int, optional) – Maximum number of epochs for training. Default is 1000.

  • neighbor_num (int, optional) – Number of neighbors to consider for the microenvironmental context. Default is 100.

  • latent_dim (int, optional) – Dimensionality of the latent microenvironmental embedding. Default is 20.

  • kld_ld (float, optional) – Weight for the KL divergence loss term (if applicable). Default is 0.05.

  • pent_ld (float, optional) – Weight for the entropy regularization term. Default is 0.05.

  • dist_space (str, optional) – Space in which to compute distances (‘latent’ or other). Default is ‘latent’.

  • cellrep_key (str, optional) – Key in adata.obsm containing the cell state representations (e.g., ‘X_pca’). Default is ‘X_pca’.

  • batch_key (str, optional) – Key in adata.obs indicating batch information for batch correction/splitting. Default is None.

  • batch_correct (bool, optional) – Whether to perform batch correction during training. Default is False.

Returns:

The input AnnData object updated with the following fields: - obsm[‘e’]: Microenvironmental embeddings. - obs[‘leiden_e’]: Leiden clusters of the microenvironmental embeddings.

Return type:

anndata.AnnData

mievformer.calculate_wb_ez(adata, model_path, batch_key=None, neighbor_num=100, latent_dim=20, cellrep_key='X_pca')[source]

Calculate the embeddings required for the score function and add them to the AnnData object.

The score function is defined as:

\[s_{\theta}(e_i, z_j) = w_e(e_i)^\top w_z(z_j) + b_z(z_j)\]

where \(w_e\) and \(w_z\) are neural networks mapping microenvironmental and cell-state embeddings to a shared hidden dimension, and \(b_z\) provides a cell-state-dependent bias.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix.

  • model_path (str) – Path to the trained model checkpoint.

  • batch_key (str, optional) – Key in adata.obs indicating batch information. Default is None.

  • neighbor_num (int, optional) – Number of neighbors used during training. Default is 100.

  • latent_dim (int, optional) – Latent dimension of the model. Default is 20.

  • cellrep_key (str, optional) – Key in adata.obsm containing the cell state representations. Default is ‘X_pca’.

Returns:

The input AnnData object updated with: - obsm[‘w_e’]: Projected microenvironmental embeddings. - obsm[‘w_z’]: Projected cell state embeddings. - obsm[‘b_z’]: Cell state bias terms. - obsm[‘e’]: Microenvironmental embeddings (if not already present).

Return type:

anndata.AnnData

Niche Density Ratio and Membership

Functions for computing the per-cell niche density ratio p(e|z)/p(e) and aggregating it into a per-cell soft membership over niche clusters.

mievformer.calculate_niche_density_ratio(adata, ref_num=1000, stratify_key='leiden_e', min_ratio=0.01, ref_adata=None)[source]

Compute per-cell density ratios over a panel of reference niches.

For each cell \(i\) and reference niche \(j\) drawn by stratified sampling on stratify_key, the log density ratio is

\[\log r_{ij} = \log p(e_j \mid z_i) - \log p(e_j) = (w_z(z_i)^\top w_e(e_j) + b_z(z_i)) - \log \sum_{k \in \mathrm{ref}} \exp(w_z(z_k)^\top w_e(e_j) + b_z(z_k)).\]

The matrix is then softmax-normalized per cell over reference niches, so each row of adata.obsm['dist_e'] is a probability distribution over the sampled reference niches that emphasizes niches whose environment becomes more likely under the cell’s state than under the marginal.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing w_e, w_z, and b_z in obsm (produced by calculate_wb_ez()).

  • ref_num (int, optional) – Number of reference niches to sample. Default is 1000.

  • stratify_key (str, optional) – Key in adata.obs to use for stratified sampling of reference niches. Default is ‘leiden_e’.

  • min_ratio (float, optional) – Clusters with frequency below this fraction are dropped from stratified sampling. Default is 0.01.

  • ref_adata (anndata.AnnData, optional) – External reference. If None, a subset of adata is used.

Returns:

Updated with obsm['dist_e'] (softmax-normalized density ratios of shape (n_cells, ref_num)) and uns['dist_e']['ref_obs'] (obs names of the sampled reference niches). The dist_e key name is preserved for backward compatibility with existing h5ad artifacts.

Return type:

anndata.AnnData

mievformer.calculate_niche_cluster_membership(adata, cluster_key='leiden_e')[source]

Aggregate per-cell density ratios into a soft membership over niche clusters.

Averages the columns of adata.obsm['dist_e'] within each value of adata.obs[cluster_key] (typically leiden_e niche clusters), yielding adata.obsm['dist_e_agg'] of shape (n_cells, n_niche_clusters): entry [i, c] is the mean density ratio \(p(e \mid z_i)/p(e)\) evaluated at reference cells in cluster c, interpretable as a soft assignment of cell i to niche cluster c.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing obsm['dist_e'] (see calculate_niche_density_ratio()). If absent, it is computed with defaults.

  • cluster_key (str, optional) – Key in adata.obs containing niche cluster labels. Default is ‘leiden_e’.

Returns:

Updated with obsm['dist_e_agg']: per-cell niche-cluster membership (columns are niche cluster labels). The dist_e_agg key name is preserved for backward compatibility with existing h5ad artifacts used by figure scripts.

Return type:

anndata.AnnData

Downstream Analysis

Functions for biological interpretation and visualization.

mievformer.estimate_population_density(adata, group, cluster_key, max_cell_num=1000)[source]

Estimate the density (existence probability) of a specific cell population in each microenvironment.

By integrating \(P(z|e)\) over all cell states belonging to a specific cell population, this function obtains the density of that population in microenvironment \(e\).

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix.

  • group (str) – The label of the cell population (e.g., a specific cell type) to estimate density for.

  • cluster_key (str) – Key in adata.obs containing the cell type/cluster labels.

  • max_cell_num (int, optional) – Maximum number of cells to sample from the group for density estimation. Default is 1000.

Returns:

The input AnnData object updated with a new column in obs (e.g., {group}_density) representing the estimated density of the specified population for each cell’s microenvironment.

Return type:

anndata.AnnData

mievformer.analyze_density_correlation(adata, density_col, gene_list=None, file_path=None)[source]

Analyze the correlation between estimated cell population density and gene expression.

This analysis helps identify gene expression signatures associated with colocalization with specific cell populations. For example, identifying genes upregulated in tumor cells when they colocalize with endothelial cells.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing expression data and the density column.

  • density_col (str) – Name of the column in adata.obs containing the estimated density values.

  • gene_list (list of str, optional) – List of genes to include in the correlation analysis. If None, uses all genes in adata.var_names.

  • file_path (str, optional) – Path to save the visualization plot (bar plot of top/bottom correlated genes). If None, the plot is not saved.

Returns:

A Series containing the correlation coefficients for each gene, indexed by gene name.

Return type:

pandas.Series

mievformer.analyze_niche_membership(adata, n_clusters=15, file_path=None)[source]

Cluster cells by their niche-cluster membership vectors and visualize the result.

Uses adata.obsm['dist_e_agg'] (per-cell soft membership over niche clusters produced by calculate_niche_cluster_membership()) as the feature space, performs Ward hierarchical clustering to partition cells into n_clusters groups, and draws a clustermap of the membership matrix with row-color annotations.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing obsm['dist_e_agg'].

  • n_clusters (int, optional) – Number of cell clusters to form. Default is 15.

  • file_path (str, optional) – Path to save the resulting clustermap image. If None, the plot is not saved.

Returns:

The input AnnData with obs['niche_composition_cluster'] added (cell cluster labels). The niche_composition_cluster key name is preserved for backward compatibility with existing h5ad artifacts.

Return type:

anndata.AnnData