Complete Function Documentation#

_images/h2m-logo-final.png
h2m.genome_loader(path)#

Load the refernce genome file.

Parameters:
  • path (str): path of the genome file.

Return:
  • reference genome records and the index list of chromosomes.

Example:
>>> records_h, index_list_h = h2m.genome_loader(path_h_ref)
h2m.anno_loader(path)#

Load the GENCODE annotation file.

Parameters:
  • path (str): path of the annotation file.

Return:
  • a FeatureDB

Example:
>>> db_h = h2m.anno_loader(path_h_anno)
h2m.cbio_reader(path, keep=False)#

Generate h2m input from cbioportal data.

Parameter:
  • path (str): the path of mutation data in txt format.

  • keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.

Output:

An input dataframe for h2m modeling.

Example:
>>> h2m.cbio_reader('.../data_mutations.txt', keep=False)
h2m.clinvar_reader(path, list_of_ids=None, keep=False)#

Generate h2m input from ClinVar data.

Parameter:
  • path (str): the path of clinvar renference vcf.gz data.

  • list_of_ids (list): the list of variation ids.

  • keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.

Output:

An input dataframe for h2m modeling.

Example:
>>> filepath = '.../GrCh37_clinvar_20230923.vcf.gz'
>>> variation_ids = [925574, 925434, 926695, 925707, 325626, 1191613, 308061, 361149, 1205375, 208043]
>>> df = clinvar_reader(filepath, variation_ids)
h2m.genomad_reader(path, gene_name, keep=False)#

Generate h2m input from GenomeAD data.

Parameter:
  • path (str): the path of GenomeAD csv data.

  • gene_name (str): the gene name.

  • keep (bool): True: keep all the original columns in the dataframe/ False: keep the necesssary columns for h2m only. Default to False.

Output:

An input dataframe for h2m modeling.

Example:
>>> filepath = '.../gnomAD_v4.0.0_ENSG00000141510_2024_02_07_11_36_03.csv'
>>> df = h2m.genomad_reader('','TP53')
h2m.get_tx_id(id, species, ver=None, ty='default', show=True)#

Query a human or mouse gene for coordinate and information of all its transcripts. Internet needed.

Parameters:
  • id (str):, identification of a human gene. Multiple input forms are accepted, including gene name, stable ensembl gene id with or without version number.

  • species (str): ‘h’ for human or ‘m’ for mouse.

  • ver (int): specify the version of human, one of 37/38. It is a necessary parameter.

  • ty (str): OPTIONAL. type of your input id. string, one of ‘name’/’gene_id’.

  • show (bool): OPTIONAL. print summary of output or not.

Return:
  • A list [chromosome, start location(of gene), end location(of gene), canonical transcript id, list of all transcript id (the canonical one included and always at the first place), a list of additional information of each transcript]

Example:
>>> h2m.get_tx_id('TP53','h',ver=37)
h2m.get_tx_batch(df, species, ver=None)#

Batch query of canonical transcript IDs of human or mouse genes.

Parameters:
  • df (Pandas DataFrame): Must include a column of gene names named ‘gene_name_h’/’gene_name_m’, depending on the species. An index column is recommended.

  • species (str): ‘h’ for human or ‘m’ for mouse.

  • ver (int): specify the version of human, one of 37/38.

Return:
  • Two dataframes. The first dataframe is the processed original dataframe with canonical transcirpt id attached in the column named ‘tx_id_h’/’tx_id_m’. The second dataframe contains all rows that are not successfully processed.

Example:
>>> h2m.get_tx_batch(df,'h',ver=37)
h2m.query(id, db=None, direction='h2m', ty='default', show=True)#

Query homologous mouse genes of human genes.

Parameters:
  • id (str): name/gene_id/tx_id of human.

  • direction (str): OPTIONAL. query from human gene to the mouse gene (‘h2m’) or vise versa (‘m2h’).

  • db (FeatureDB): OPITONAL. The transcript annotation database of specific version.

  • ty (str): OPITONAL. Specify the id type. one of ‘gene_id’/’tx_id’/’name’.

Return:
  • a list of human gene name, mouse gene name, mapping type and sequence similarity.

Example:
>>> h2m.query('TP53')
h2m.query_batch(df, direction='h2m')#

Batch query of orthologous mouse gene of given human genes.

Parameters:
  • df (Pandas DataFrame): Must include a column of gene names named ‘gene_name_h’. An index column is recommended.

  • direction: OPTIONAL. query from human gene to the mouse gene (‘h2m’) or vise versa (‘m2h’).

Return:
  • Two dataframes. The first dataframe is the processed original dataframe with canonical transcirpt id attached in the column named ‘gene_name_m’. The second dataframe contains all rows that are not successfully processed.

Example:
>>> h2m.query_batch(df)
h2m.model(records_h, index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, start, end, ref_seq, alt_seq, ty_h=None, ver=None, direction='h2m', param='default', coor='nc', search_alternative=True, max_alternative=5, nonstop_size=300, flank_size=2, splicing_size=30, batch=False, show_sequence=False, align_input=None, memory_protect=True, memory_size=10000)#

Model human variants in the mouse genome.

Parameters:
  • records_h, index_list_h, records_m, index_list_m: human and mouse reference genome.

  • db_h, db_m: human and mouse GENCODE annotation.

  • tx_id_h, tx_id_m: human and mouse transcript id (could get by h2m.get_tx_id()). Transcript ids of input and output variants if use direction = ‘h2h’ or direction = ‘m2m’.

  • start_h, end_h: int, start and end location of the mutation on the chromosome.

  • ref_seq_h: str, human mutation reference sequence. Reference sequence of the input variant if use direction = ‘m2h’/’h2h’/’m2m’.

  • alt_seq_h: str, human mutation alternate sequence. Alternate sequence of the input variant if use direction = ‘m2h’/’h2h’/’m2m’.

  • ty_h: str, human variantion type. One of [‘SNP’, ‘DNP’, ‘ONP’, ‘INS’, ‘DEL’].

  • ver: int, human ref genome number. 37 or 38.

  • direction (optional): str, set the modeling direction by ‘h2m’ (default) or ‘m2h’, ‘h2h’, ‘m2m’.

  • param (optional): set param = ‘BE’ and will only output base editing modelable results.

  • coor (optional): default = ‘nc’. set input = ‘aa’ and will be compatable with input of amino acid variants.

  • search_alternative (optional): set search_alternative = False and will only output original modeling results.

  • max_alternative (optional): the maximum number of output alternatives of one human variants.

  • nonstop_size (optional): the length of neucleotides that are included after the stop codon for alignment and translation in case of the nonstop mutations or frame shifting mutations.

  • flank_size (optional): the number of amino acids or neucleotides (for non-coding mutations) that are included on each side of the mutation site.

  • splicing_size (optional): the number of amino acids or neucleotides (for non-coding mutations) that are included after the top codon for the consideration of frame-shifting effect.

  • batch (optional): set batch = True and will use input align_dict to save time in batch processing.

  • show_sequence (optional): set batch = True and will output the whole sequences.

  • align_dict (optional): input a prepared dictionary of alignment indexes to save time in batch processing.

  • memory_protect (optional): default True. Break long alignments that may lead to death of the kernel.

  • memory_size (optional): maxlength of aligned sequence when memory_protect == True.

Other rules:

1. If the mutation falls in the coding and non-coding regions at the same time, it would be considered and processed as a ORIGIAL-MODELING ONLY mutation. 3. The alt_seq input should be in the positive strand and the start_h coordinate should be smaller than or equal the end_h coordinate. 4. If the ref-seq or alt-see has no length, it could be input as ‘’ or ‘-‘.

Example:
>>> h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C', 'T', ty_h = 'SNP', ver = 37)
h2m.model_batch(df, records_h, index_list_h, records_m, index_list_m, db_h, db_m, ver, param='default', direction='h2m', coor='nc', search_alternative=True, max_alternative=5, nonstop_size=300, flank_size=2, splicing_size=30, show_sequence=False, align_input=None, memory_protect=True, memory_size=10000)#

Batch modeling of human variants in the mouse genome.

Parameters:
  • df (Pandas DataFrame): Must include columns {‘start_h’,’end_h’,’type_h’,’ref_seq_h’,’alt_seq_h’,’tx_id_h’,’tx_id_m’,’index’}.

  • ecords_h, index_list_h, records_m, index_list_m: reference genome files

  • db_h, db_m: genomic annotation files

  • ver (int): specify the version of human, one of 37/38.

  • param (optional): set param = ‘BE’ and will only output base editing modelable results.

  • direction (optional): str, set the modeling direction by ‘h2m’ (default) or ‘m2h’.

  • coor (optional): default = ‘nc’. set input = ‘aa’ and will be compatable with input of amino acid variants.

  • search_alternative (optional): set search_alternative = False and will only output original modeling results.

  • max_alternative (optional): the maximum number of output alternatives of one human variants.

  • nonstop_size (optional): the length of neucleotides that are included after the stop codon for alignment and translation in case of the nonstop mutations or frame shifting mutations.

  • flank_size (optional): the number of amino acids or neucleotides (for non-coding mutations) that are included on each side of the mutation site.

  • splicing_size (optional):

  • batch (optional): set batch = True and will use input align_dict to save time in batch processing.

  • show_sequence (optional): set batch = True and will output the whole sequences.

  • align_dict (optional): input a prepared dictionary of alignment indexes to save time in batch processing.

  • memory_protect (optional): default True. Break long alignments that may lead to death of the kernel.

  • memory_size (optional): maxlength of aligned sequence when memory_protect == True.

Return:
  • Two dataframes. The first dataframe is the processed original dataframe. The second dataframe contains all rows that are not successfully processed.

Example:
>>> h2m.model_batch(df, records_h, index_list_h, records_m, index_list_m, db_h, db_m, ver = 37, param = 'BE')
h2m.visualization(model_result, flank_size=2, print_size=6)#

Visualize h2m modeling results.

Parameter:
  • model_result (list): the output of h2m.model(show_sequence = True) function.

  • flank_size (int).

  • print_size (int): lenth of neucleotide/peptide included on both sides of the flank region.

Output:

A visualization plot.

Example:
>>> model_result = h2m.model(records_h,index_list_h, records_m, index_list_m, db_h, db_m, tx_id_h, tx_id_m, 7577120, 7577120, 'C','T', ty_h = 'SNP', ver = 37, show_sequence=True)
>>> h2m.visualization(model_result)