API Reference
This page provides a comprehensive reference for all functions in the Mievformer package, organized by functionality.
Core Model Functions
Functions for training the Mievformer model and computing embeddings.
- mievformer.optimize_nicheformer(adata, model_path, ngpu=1, batch_size=512, max_epochs=1000, neighbor_num=100, latent_dim=20, kld_ld=0.05, pent_ld=0.05, dist_space='latent', cellrep_key='X_pca', batch_key=None, batch_correct=False)[source]
Optimize the NicheFormer model using masked self-supervised learning.
Mievformer learns microenvironmental representations by encoding the cellular states and spatial configurations of neighboring cells using a Transformer-based architecture. It masks the central cell position and maximizes the likelihood that the observed central cell state would be generated from the inferred microenvironmental embedding.
The training objective corresponds to the InfoNCE loss, maximizing the mutual information between microenvironmental embeddings and their corresponding central cell states.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix containing spatial transcriptomics data. Must contain spatial coordinates in adata.obsm[‘spatial’] and cell representations (e.g., PCA) in adata.obsm[cellrep_key].
model_path (str) – Path to save the trained model checkpoint (.pt file).
ngpu (int, optional) – Number of GPUs to use for training. Default is 1.
batch_size (int, optional) – Batch size for training. Default is 512.
max_epochs (int, optional) – Maximum number of epochs for training. Default is 1000.
neighbor_num (int, optional) – Number of neighbors to consider for the microenvironmental context. Default is 100.
latent_dim (int, optional) – Dimensionality of the latent microenvironmental embedding. Default is 20.
kld_ld (float, optional) – Weight for the KL divergence loss term (if applicable). Default is 0.05.
pent_ld (float, optional) – Weight for the entropy regularization term. Default is 0.05.
dist_space (str, optional) – Space in which to compute distances (‘latent’ or other). Default is ‘latent’.
cellrep_key (str, optional) – Key in adata.obsm containing the cell state representations (e.g., ‘X_pca’). Default is ‘X_pca’.
batch_key (str, optional) – Key in adata.obs indicating batch information for batch correction/splitting. Default is None.
batch_correct (bool, optional) – Whether to perform batch correction during training. Default is False.
- Returns:
The input AnnData object updated with the following fields: - obsm[‘e’]: Microenvironmental embeddings. - obs[‘leiden_e’]: Leiden clusters of the microenvironmental embeddings.
- Return type:
- mievformer.calculate_wb_ez(adata, model_path, batch_key=None, neighbor_num=100, latent_dim=20, cellrep_key='X_pca')[source]
Calculate the embeddings required for the score function and add them to the AnnData object.
The score function is defined as:
\[s_{\theta}(e_i, z_j) = w_e(e_i)^\top w_z(z_j) + b_z(z_j)\]where \(w_e\) and \(w_z\) are neural networks mapping microenvironmental and cell-state embeddings to a shared hidden dimension, and \(b_z\) provides a cell-state-dependent bias.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix.
model_path (str) – Path to the trained model checkpoint.
batch_key (str, optional) – Key in adata.obs indicating batch information. Default is None.
neighbor_num (int, optional) – Number of neighbors used during training. Default is 100.
latent_dim (int, optional) – Latent dimension of the model. Default is 20.
cellrep_key (str, optional) – Key in adata.obsm containing the cell state representations. Default is ‘X_pca’.
- Returns:
The input AnnData object updated with: - obsm[‘w_e’]: Projected microenvironmental embeddings. - obsm[‘w_z’]: Projected cell state embeddings. - obsm[‘b_z’]: Cell state bias terms. - obsm[‘e’]: Microenvironmental embeddings (if not already present).
- Return type:
Niche Density Ratio and Membership
Functions for computing the per-cell niche density ratio p(e|z)/p(e) and aggregating it into a per-cell soft membership over niche clusters.
- mievformer.calculate_niche_density_ratio(adata, ref_num=1000, stratify_key='leiden_e', min_ratio=0.01, ref_adata=None)[source]
Compute per-cell density ratios over a panel of reference niches.
For each cell \(i\) and reference niche \(j\) drawn by stratified sampling on
stratify_key, the log density ratio is\[\log r_{ij} = \log p(e_j \mid z_i) - \log p(e_j) = (w_z(z_i)^\top w_e(e_j) + b_z(z_i)) - \log \sum_{k \in \mathrm{ref}} \exp(w_z(z_k)^\top w_e(e_j) + b_z(z_k)).\]The matrix is then softmax-normalized per cell over reference niches, so each row of
adata.obsm['dist_e']is a probability distribution over the sampled reference niches that emphasizes niches whose environment becomes more likely under the cell’s state than under the marginal.- Parameters:
adata (anndata.AnnData) – Annotated data matrix containing
w_e,w_z, andb_zinobsm(produced bycalculate_wb_ez()).ref_num (int, optional) – Number of reference niches to sample. Default is 1000.
stratify_key (str, optional) – Key in
adata.obsto use for stratified sampling of reference niches. Default is ‘leiden_e’.min_ratio (float, optional) – Clusters with frequency below this fraction are dropped from stratified sampling. Default is 0.01.
ref_adata (anndata.AnnData, optional) – External reference. If
None, a subset ofadatais used.
- Returns:
Updated with
obsm['dist_e'](softmax-normalized density ratios of shape(n_cells, ref_num)) anduns['dist_e']['ref_obs'](obs names of the sampled reference niches). Thedist_ekey name is preserved for backward compatibility with existing h5ad artifacts.- Return type:
- mievformer.calculate_niche_cluster_membership(adata, cluster_key='leiden_e')[source]
Aggregate per-cell density ratios into a soft membership over niche clusters.
Averages the columns of
adata.obsm['dist_e']within each value ofadata.obs[cluster_key](typicallyleiden_eniche clusters), yieldingadata.obsm['dist_e_agg']of shape(n_cells, n_niche_clusters): entry[i, c]is the mean density ratio \(p(e \mid z_i)/p(e)\) evaluated at reference cells in clusterc, interpretable as a soft assignment of cellito niche clusterc.- Parameters:
adata (anndata.AnnData) – Annotated data matrix containing
obsm['dist_e'](seecalculate_niche_density_ratio()). If absent, it is computed with defaults.cluster_key (str, optional) – Key in
adata.obscontaining niche cluster labels. Default is ‘leiden_e’.
- Returns:
Updated with
obsm['dist_e_agg']: per-cell niche-cluster membership (columns are niche cluster labels). Thedist_e_aggkey name is preserved for backward compatibility with existing h5ad artifacts used by figure scripts.- Return type:
Downstream Analysis
Functions for biological interpretation and visualization.
- mievformer.estimate_population_density(adata, group, cluster_key, max_cell_num=1000)[source]
Estimate the density (existence probability) of a specific cell population in each microenvironment.
By integrating \(P(z|e)\) over all cell states belonging to a specific cell population, this function obtains the density of that population in microenvironment \(e\).
- Parameters:
adata (anndata.AnnData) – Annotated data matrix.
group (str) – The label of the cell population (e.g., a specific cell type) to estimate density for.
cluster_key (str) – Key in adata.obs containing the cell type/cluster labels.
max_cell_num (int, optional) – Maximum number of cells to sample from the group for density estimation. Default is 1000.
- Returns:
The input AnnData object updated with a new column in obs (e.g., {group}_density) representing the estimated density of the specified population for each cell’s microenvironment.
- Return type:
- mievformer.analyze_density_correlation(adata, density_col, gene_list=None, file_path=None)[source]
Analyze the correlation between estimated cell population density and gene expression.
This analysis helps identify gene expression signatures associated with colocalization with specific cell populations. For example, identifying genes upregulated in tumor cells when they colocalize with endothelial cells.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix containing expression data and the density column.
density_col (str) – Name of the column in adata.obs containing the estimated density values.
gene_list (list of str, optional) – List of genes to include in the correlation analysis. If None, uses all genes in adata.var_names.
file_path (str, optional) – Path to save the visualization plot (bar plot of top/bottom correlated genes). If None, the plot is not saved.
- Returns:
A Series containing the correlation coefficients for each gene, indexed by gene name.
- Return type:
- mievformer.analyze_niche_membership(adata, n_clusters=15, file_path=None)[source]
Cluster cells by their niche-cluster membership vectors and visualize the result.
Uses
adata.obsm['dist_e_agg'](per-cell soft membership over niche clusters produced bycalculate_niche_cluster_membership()) as the feature space, performs Ward hierarchical clustering to partition cells inton_clustersgroups, and draws a clustermap of the membership matrix with row-color annotations.- Parameters:
adata (anndata.AnnData) – Annotated data matrix containing
obsm['dist_e_agg'].n_clusters (int, optional) – Number of cell clusters to form. Default is 15.
file_path (str, optional) – Path to save the resulting clustermap image. If
None, the plot is not saved.
- Returns:
The input AnnData with
obs['niche_composition_cluster']added (cell cluster labels). Theniche_composition_clusterkey name is preserved for backward compatibility with existing h5ad artifacts.- Return type: