pygot.tools.traj.determine_source_state

pygot.tools.traj.determine_source_state#

pygot.tools.traj.determine_source_state(adata, embedding_key, graph_dist=True, n_neighbors=30, split_m=30, kernel='dpt', n_comps=15, down_sampling=True, n_obs=3000, cytotrace=True, alpha=0.1, smooth_k=5, connect_anchor=False)[source]#

Determine souce cell for snapshot data

In most developing biological scenario, source cells will develop into multiple different cells.

By setting cell \(r\) as start cell, the pseudotime \(\hat{t}(x_i)\) can be computed, and the empirical distribution can be divided into \(m\) portions that \(X_1, X_2, ..., X_m\), according to time \(\hat{t}(x_i)\). The transport cost of this time-vary distribution \(p_t(x|r)\) can be quantified by optimal transport with graphical metrics.

\[W_2^2(r)=\sum_{i=1}^{m-1}\inf_{\pi}\sum_{x \in X_i}\sum_{y \in X_{i+1}}c(x,y | G)\pi(x,y)\]

where \(c(x,y|G)\) is the shorest path distance between two cells \(x,y\) in graph \(G\). According to the energy-saving hypothesis, the defined transport cost of real source cell will be smallest, that

\[{r}^* = arg \min_{r} W_2^2(r)\]

Note

This assumption may fails in the case of linear progression that souce cell only developing in one direction. In that case, the transport cost of real source cell and terminate cell will be very close. So this function will detect linear progression and compute cytotrace score with very low weight (default 0.1) to choose the optimal source cell.

To accelerate the computation, we suggest to down sample the dataset to 3000 cells (default) and use the down sampled data to compute the transport cost.

Arguments:#

adata: AnnData

Annotated data matrix.

embedding_key: str

Name of latent space, in adata.obsm

graph_dist: bool (default: True)

Using shorest path distance or euclidean distance

n_neighbors: int (default: 30)

Number of neighbors of kNN which is used to compute shortest path distance

split_m: int (default: 30)

Number of split. This number should NOT be too small

kernel: ‘dpt’ or ‘palantir’ or ‘euclidean’ (default: ‘dpt’)

Pseudotime method, ‘dpt’ is recommended

n_comps: int (default: 15)

Number of diffmap components, which is used for DPT computation

down_sampling: bool (default: True)

Down sampling dataset to accelerate computation

n_obs: int (default: 3000)

Number of down sampling size

cytotrace: bool (default: True)

Use cytotrace to help. Note cytorace is implemented by Cellrank2

alpha: float (default: 0.1)

Weight of cytotrace. We do NOT suggest increase the weight

smooth_k: int (default: 5)

Number of neighbors which is used to smoothes the final score

time_key: str (default: None)

Name of time label, in adata.obs, use if the model input contains time label

returns:
  • ot_root (.uns) (int) – best source cell index using transport cost only

  • ot_ct_root (.uns) (int) – best source cell index using both transport cost and cytotrace

  • root_score (.obs) (np.ndarray) – source cell score (higher score higher probability to be source)

  • ot_root_score (.obs) (np.ndarray) – source cell score + alpha * cytotrace score (higher score higher probability to be source)