2.4. khipu model

Comprehensive construction of empCpds via generic tree structures (khipu). Each khipu = empCpd[“MS1_pseudo_Spectra”]. This module contains two classes: Weavor is constructed once and provides the main algorithms and functions; Khipu is mainlty the data structures and built as one instance per emprical compound. So from one experiment/feature table, one may get 10**3 khipus.

class khipu.model.Khipu(subnetwork)[source]

2-tier tree representation of an empirical compound. An empTree follows the matrix notion of a family of peaks: a tree of 2 levels - isotopes and adducts. A true root of khipu is the compound in neutral formula. A peudo root of khipu is the ion of lowest m/z. The representative adduct should have the most abundance. The matrix notion of a khipu is a DataFrame: columns are Adduct labels; rows isotopes. trunk : default for adducts branch : default for isotopes

branch_abstraction(isotopic_edges, adduct_edges)[source]

Abstract a group of connecgted isotopic featrures into a branch. Reduce the input network to a set of abstracted_adduct_edges (hence new tree) of B-nodes.

Returns:

abstracted_adduct_edges – list of nonredundant directed edges with data tag
branch_dict – dictionary, branch ID to member features/nodes.

Note

Membership of B-nodes is returned as branch_dict, which is needed to realign to khipu grid. Without branch constraint, the grid realignment is error prone. Not checking if abstracted_adduct_edges are fully connected.

build_khipu(WeavorInstance, mz_tolerance_ppm=5, check_fitness=False)[source]

Convert a network of two types of edges, isotopic and adduct, to khipu instances of unique nodes. Use the grid to enforce unique ion in each position, as the initial network can contain erroneously redundant leaves/nodes.

WeavorInstance : Weavor instance to solve the grid. check_fitness : if True, chech if replacing with any redundant nodes improves fitness. This is a placeholder as a good fitness function is not easy to implement and it slows down computing. Future use can be if check_fitness: self.select_by_fitness()

Updates:

self.khipu_grid (pd.DataFrame)
self.neutral_mass (inferred neutral mass for the khipu compound)

clean(WeavorInstance, mz_tolerance_ppm)[source]

Clean up the input subnetwork, only using unique features to build a khipu frame. Redundant features are kept aside. The leftover features will be sent off to build new khipus.

Updates:

self.feature_dict
self.mzstr_dict
self.median_rtime
self.nodes_to_use
self.redundant_nodes
self.sorted_mz_peak_ids
self.root # temporary

Note

mzstr_dict can be problematic in data that were not processed well, because minor potential shift can confuse ion relations. This clean() may cut some initial edges.

down_size()[source]: If input_network is too big, down size by selected the features of highest abundance. The extra nodes are not kept, because they can be picked up by extended search later if they fit this Khipu.

export_json()[source]: Placeholder.

extended_search(mztree, adduct_search_patterns_extended, mz_tolerance_ppm=5, rt_tolerance=2)[source]

Find additional adducts from unassigned_peaks using adduct_search_patterns_extended. mztree here are indexed unassigned_peaks.

Updates:

self.feature_map
self.khipu_grid

Returns:

added_peaks

Return type:

list of peak ids

Note

annotation_dict may have fewer nodes than nodes_to_use, but is preferred here for cleaner results.

format_to_epds(id='')[source]: Format khipu to empirical compound, with added ion notions. A small number of features do not map the khipu grid (e.g. their edges violate DAG rules). They are kept in empCpd as “undetermined”.

get_feature_dict(peak_dict, mz_tolerance_ppm, has_parent_masstrack)[source]

Index all input features; establish str identifier for features of same/close m/z values. Base on asari mass track IDs; keep unique m/z only. It’s more efficient to use, since feature_dict is much smaller than peak_dict.

Parameters:

peak_dict (dict of peaks/features indexed by IDs. Must have fields) – ‘id’, ‘mz’, ‘rtime’, ‘representative_intensity’.
mz_tolerance_ppm (ppm tolerance in examining m/z groups.) –

Returns:

feature_dict (feature_dict with ‘parent_masstrack_id’.)
mzstr_dict (dict indexed by parent_masstrack_id, e.g. {‘trackx’: [f1, f2], …})

get_khipu_intensities()[source]: Return abundance_matrix as DataFrame in same layout as self.khipu_grid

get_khipu_mzgrid_print()[source]: Return str m/z matrix as DataFrame in same layout as self.khipu_grid, for visual purpose.

get_pruned_network()[source]

Get extra features and edges that are not fit in this khipu.

Updates:

self.redundant_nodes
self.pruned_network
self.nodes_to_use

plot(savepdf=''): Plot the khipu grid as diagram. Use MatPlotLib as default engine.

plot_khipu_diagram(savepdf='')[source]: Plot the khipu grid as diagram. Use MatPlotLib as default engine.

print(): Print khipu using adducts as trunk and isotopes as branches (preferred)

print2(): Print khipu using isotopes as trunk and adducts as branches

print3(): Return str m/z matrix as DataFrame in same layout as self.khipu_grid, for visual purpose.

print_khipu()[source]: Print khipu using adducts as trunk and isotopes as branches (preferred)

print_khipu_rotated()[source]: Print khipu using isotopes as trunk and adducts as branches

sort_nodes_by_mz()[source]: sort the nodes by increasing m/z

class khipu.model.Weavor(peak_dict, isotope_search_patterns, adduct_search_patterns, mz_tolerance_ppm=5, mode='pos', charge=1, parent_masstrack=False)[source]

For each experiment, this class sets up a grid of isotopes and adducts, and provide function solve to produce matched features and inferred neutral mass.

build_branch_only_grid(sorted_mz_peak_ids)[source]: When there’s only a single adduct, this builds a branch for adduct_index[0]. isotope_index is fixed based on inital isotope_search_patterns. returns neutral_mass, grid

build_full_grid(abstracted_adduct_edges, branch_dict, nodes_to_use)[source]

Build a khipu grid, after the input network is cleaned to isotopic_edges and adduct_edges.

Get isotopic branches first, and treat them each branch as a node. This converts (U, V) to (U, B). Done in Khipu.branch_abstraction().
Get optimal order of abstracted adduct_edges. Done in Weavor.trunk_solver(), by topology not involving m/z.
Generate grids starting from the smallest m/z in each branch. Get the grid of best feature map.
Use the optimal feature map to set grids.

Parameters:

abstracted_adduct_edges – list of nonredundant directed edges with data tag. A node here can be a feature or a branch, which is a list of isotopes.
branch_dict – dictionary, branch ID to member features/nodes.

Returns:

neutral_mass – inferred neutral mass by linear regression
grid – dataframe of features, coordinates as adduct/isotope
best_feature_map – optimal feature map as dict, {feature_id: (isotope_index, adduct_index), …}

Note

We don’t know the real root to start with, and can’t assume the lowest m/z is M+H+ or M-H-. The root should be inferred from best overall pattern match. A pseudo-root is not necessarily M0, which may not be detected in perfect labeling experiments.

Enforce unique node per grid, by best rtime with future option of a fitness function (Khipu.clean()). Extra nodes go to a new khipu.

build_simple_pair_grid(e)[source]: A khipu of only two nodes (one edge) does not need to go through full grid process, and is formatted here. The full grid process would work for these simple cases, but less efficient. neutral_mass is taken as average of two fitted values.

build_trunk_only_grid(adduct_edges)[source]: When no isotopes, only trunk is needed to describe adducts. returns neutral_mass, grid

fitness()[source]: Fitness function of the khipu. More a placeholder for now. Because unique assignment of a feature to a khipu can be based on a) closest retention time, and b) similar abundance patterns among adducts. The a) is depdendent on how well the samples were analyzed and data were preprocessed. The b) is not reliable as a pair of ions from another compound can still get good correlation by disrupting the adduct patterns together. Default to a) is good enough for now.

make_grid()[source]

Create a grid of m/z values as DataFrame. adduct_pattern is computed using neutral mass as 0 offset. The orders of isotope_search_patterns and adduct_search_patterns are kept in grid, to enable easy mapping of edges to the grid.

Updates:

self.isotope_index
self.isotope_dict
self.adduct_pattern
self.adduct_dict
self._size, _M, _N
self.mzgrid (pd.DataFrame with isotopes as rows and adducts as cols)

regress_neutral_mass(feature_map)[source]: Get neutral mass by regression on mapped features. feature_map : {feature_id: (isotope_index, adduct_index)} Returns inferred neutral mass.

score_graph_on_grid(root_corrected_mz_features, mz_error=0.01)[source]: Check how many values in root_corrected_mzs match to self.mzgrid; count = score. returns score, feature_map

score_graph_on_trunk(root, adduct_edges, weights)[source]: Use weighted algorithm on reference adduct tree to calcualte match score. returns score, root, selected_edges

select_by_fitness()[source]: After a khipu frame is built, among redundant features, select the ones of best fitness score for the khipu.

trunk_solver(adduct_edges, branch_dict={})[source]

Find best solution to fit a set of edges on the trunk. This uses score_graph_on_trunk to score matched graph, which can be abstracted_adduct_edges or regular adduct edges.

adduct_edges : m/z ordered edge with tag on edge ion relationship, [(n1, n2, relation), …]. branch_dict : when abstract branch is used, this is the dict for memberships.

returns selected best root and subset of edges.