pygot.tools.traj.determine_source_state#

pygot.tools.traj.determine_source_state(adata, embedding_key, graph_dist=True, n_neighbors=30, split_m=30, kernel='dpt', n_comps=15, down_sampling=True, n_obs=3000, cytotrace=True, alpha=0.1, smooth_k=5, connect_anchor=False)[source]#

Determine souce cell for snapshot data

In most developing biological scenario, source cells will develop into multiple different cells.

By setting cell \(r\) as start cell, the pseudotime \(\hat{t}(x_i)\) can be computed, and the empirical distribution can be divided into \(m\) portions that \(X_1, X_2, ..., X_m\), according to time \(\hat{t}(x_i)\). The transport cost of this time-vary distribution \(p_t(x|r)\) can be quantified by optimal transport with graphical metrics.

\[W_2^2(r)=\sum_{i=1}^{m-1}\inf_{\pi}\sum_{x \in X_i}\sum_{y \in X_{i+1}}c(x,y | G)\pi(x,y)\]

where \(c(x,y|G)\) is the shorest path distance between two cells \(x,y\) in graph \(G\). According to the energy-saving hypothesis, the defined transport cost of real source cell will be smallest, that

\[{r}^* = arg \min_{r} W_2^2(r)\]

Note

This assumption may fails in the case of linear progression that souce cell only developing in one direction. In that case, the transport cost of real source cell and terminate cell will be very close. So this function will detect linear progression and compute cytotrace score with very low weight (default 0.1) to choose the optimal source cell.

To accelerate the computation, we suggest to down sample the dataset to 3000 cells (default) and use the down sampled data to compute the transport cost.

Arguments:#

adata: AnnData: Annotated data matrix.
embedding_key: str: Name of latent space, in adata.obsm
graph_dist: bool (default: True): Using shorest path distance or euclidean distance
n_neighbors: int (default: 30): Number of neighbors of kNN which is used to compute shortest path distance
split_m: int (default: 30): Number of split. This number should NOT be too small
kernel: ‘dpt’ or ‘palantir’ or ‘euclidean’ (default: ‘dpt’): Pseudotime method, ‘dpt’ is recommended
n_comps: int (default: 15): Number of diffmap components, which is used for DPT computation
down_sampling: bool (default: True): Down sampling dataset to accelerate computation
n_obs: int (default: 3000): Number of down sampling size
cytotrace: bool (default: True): Use cytotrace to help. Note cytorace is implemented by Cellrank2
alpha: float (default: 0.1): Weight of cytotrace. We do NOT suggest increase the weight
smooth_k: int (default: 5): Number of neighbors which is used to smoothes the final score
time_key: str (default: None): Name of time label, in adata.obs, use if the model input contains time label

returns:

ot_root (.uns) (int) – best source cell index using transport cost only
ot_ct_root (.uns) (int) – best source cell index using both transport cost and cytotrace
root_score (.obs) (np.ndarray) – source cell score (higher score higher probability to be source)
ot_root_score (.obs) (np.ndarray) – source cell score + alpha * cytotrace score (higher score higher probability to be source)

pygot.tools.traj.determine_source_state

Contents

pygot.tools.traj.determine_source_state#

Arguments:#