pygot.tools.traj.determine_source_state#
- pygot.tools.traj.determine_source_state(adata, embedding_key, graph_dist=True, n_neighbors=30, split_m=30, kernel='dpt', n_comps=15, down_sampling=True, n_obs=3000, cytotrace=True, alpha=0.1, smooth_k=5, connect_anchor=False)[source]#
Determine souce cell for snapshot data
In most developing biological scenario, source cells will develop into multiple different cells.
By setting cell \(r\) as start cell, the pseudotime \(\hat{t}(x_i)\) can be computed, and the empirical distribution can be divided into \(m\) portions that \(X_1, X_2, ..., X_m\), according to time \(\hat{t}(x_i)\). The transport cost of this time-vary distribution \(p_t(x|r)\) can be quantified by optimal transport with graphical metrics.
\[W_2^2(r)=\sum_{i=1}^{m-1}\inf_{\pi}\sum_{x \in X_i}\sum_{y \in X_{i+1}}c(x,y | G)\pi(x,y)\]where \(c(x,y|G)\) is the shorest path distance between two cells \(x,y\) in graph \(G\). According to the energy-saving hypothesis, the defined transport cost of real source cell will be smallest, that
\[{r}^* = arg \min_{r} W_2^2(r)\]Note
This assumption may fails in the case of linear progression that souce cell only developing in one direction. In that case, the transport cost of real source cell and terminate cell will be very close. So this function will detect linear progression and compute cytotrace score with very low weight (default 0.1) to choose the optimal source cell.
To accelerate the computation, we suggest to down sample the dataset to 3000 cells (default) and use the down sampled data to compute the transport cost.
Arguments:#
- adata:
AnnData
Annotated data matrix.
- embedding_key: str
Name of latent space, in adata.obsm
- graph_dist: bool (default: True)
Using shorest path distance or euclidean distance
- n_neighbors: int (default: 30)
Number of neighbors of kNN which is used to compute shortest path distance
- split_m: int (default: 30)
Number of split. This number should NOT be too small
- kernel: ‘dpt’ or ‘palantir’ or ‘euclidean’ (default: ‘dpt’)
Pseudotime method, ‘dpt’ is recommended
- n_comps: int (default: 15)
Number of diffmap components, which is used for DPT computation
- down_sampling: bool (default: True)
Down sampling dataset to accelerate computation
- n_obs: int (default: 3000)
Number of down sampling size
- cytotrace: bool (default: True)
Use cytotrace to help. Note cytorace is implemented by Cellrank2
- alpha: float (default: 0.1)
Weight of cytotrace. We do NOT suggest increase the weight
- smooth_k: int (default: 5)
Number of neighbors which is used to smoothes the final score
- time_key: str (default: None)
Name of time label, in adata.obs, use if the model input contains time label
- returns:
ot_root (.uns) (int) – best source cell index using transport cost only
ot_ct_root (.uns) (int) – best source cell index using both transport cost and cytotrace
root_score (.obs) (np.ndarray) – source cell score (higher score higher probability to be source)
ot_root_score (.obs) (np.ndarray) – source cell score + alpha * cytotrace score (higher score higher probability to be source)
- adata: