Metatranscriptomics bipype

Module functions

auto_tax_read(db_loc)[source]

Reads pickled {KEGG GENES number: set[KO identifiers]} dict.

config_from_file(_file)[source]

Reads parameters from configuration _file. Prepares target.txt and templates for SARTools.

Parameters:_file – configuration file for metatranscriptomic pipeline
Returns:
  • ref_cond: reference condition defined by user
  • all_conds: set of conditions (groups) from target.txt
  • fastqs: list of fastq files on which analysis will be done
Return type:(ref_cond, all_conds, fastqs)
connect_db(db)[source]

Connects database

Parameters:db – Path to SQL database
Returns:Cursor object to database
dicto_reduce(present, oversized)[source]

Removes all elements from dictionaries, which keys aren’t present in both.

Parameters:
  • present – dict
  • oversized – dict
Returns:

tuple of dicts

Return type:

(oversized, present)

Warning

Order of parametres is opposite to results.

Example

>>> dict_1={'a':1,'c':3,'d':4}
>>> dict_2={'a':3,'b':4,'c':4}
>>> dicto_reduce(dict_1, dict_2)
({'a': 3, 'c': 4}, {'a': 1, 'c': 3})
fastq_to_fasta(fastq)[source]

Runs fastq_to_fasta on fastq.

GLOBAL:
  • path to fastq_to_fasta program: PATH_FQ2FA
get_kegg_name(ko)[source]

Returns name assigned to given KO identifier (from kegg.jp)

Parameters:ko – KO identifier (string)
Returns:name assigned to ko (string)
get_ko_fc(ko_dict, ref_cond, filepath, deseq=False)[source]

From given table file (SARTool), adds found fold changes to ko_dict.

Parameters:
  • ko_dict{KO_id:{cond1:value1, cond2:value2...}...} dict
  • ref_cond – reference condition (string)
  • filepath – filepath to output table file from edgeR or DESeq2
  • deseq – True, if filepath points to DESeq2 table file False, if filepath points to edgeR table file
Returns:

ko_dict with added fold changes from table file

get_kopathways(database)[source]

Makes dictionaries from kopathways table from SQLite3 database.

Parameters:database – Cursor object to SQLite3 database.
Returns:id -> pathways:
{KO identifier: set[KEGG_Pathway_ids]}

For example:

{
    'K01194': set(['ko00500','ko00600',...]),
    'K04501': set(['ko04390',...])
}

pathway -> ids mappings:

{KEGG_Pathway_id: set[KO identifiers]}

For example:

{ko12345: set([K12345, K12346,...]),...}
Return type:Two dictionaries
get_pathways(database)[source]

Make dictionary from pathways table from SQLite3 database.

Parameters:database – Cursor object to SQLite3 database.
Returns:dictionary in following format:
{KEGG_Pathway_id:Name}

For example:

{
    'ko04060': 'Cytokine-cytokine receptor interaction',
    'ko00910': 'Nitrogen metabolism'
}
Return type:dict
get_tables(database)[source]

Prints all tables included in SQLite3 database.

Parameters:database – Cursor object to SQLite3 database.
low_change(ko_dict, all_conds)[source]

For every KO adds condition: 0, if condition is missing.

Parameters:
  • ko_dict{KO_id:{cond1:value1, cond2:value2...}...} dict
  • all_conds – list of conditions (list of strings)
Returns:

suplemented ko_dict

For example:

low_change(
    {
        'K12345': {'pH5': 1.41, 'pH6': 1.73},
        'K23456': {'pH6': 2.0, 'pH8': 2.24}
    },
    ['pH5', 'pH6', 'pH8']
)

gives:

{
    'K12345': {'pH5': 1.41, 'pH6': 1.73, 'pH8': 0.0},
    'K23456': {'pH5': 0.0, 'pH6': 2.0, 'pH8': 2.24}
}

m8_to_ko(file_, multi_id)[source]

Assigns and counts KEGG GENES identifiers from BLAST Tabular (flag: -m 8) output format file, for every KO from multi_id.

After mapping, writes data to output file.

Parameters:
  • file_ – Path to BLAST Tabular (flag: -m 8) format file
  • multi_id – Dict {KEGG GENES identifier : set[KO identifiers]}

Output file (outname) has following name:

outname = file_.replace('txt.m8', 'count')

and following format:

K00161  2
K00627  0
K00382  11
mapper(ko_dict, ko_set)[source]

Assings every KO_id from ko_dict to KEGG_Pathway_id from ko_set

Parameters:
  • ko_dict{KO_id:{cond1:value1, cond2:value2...}...} dict
  • ko_set{KEGG_Pathway_id:set[KO identifiers]} dict
Returns:

Dict with structure:

{KEGG_Pathway_id:{KO_id:{cond1:value1, cond2:value2...}...}...}

Return type:

dict

mapper_write(ko_path_dict, all_conds, out_dir)[source]

Writes file with KO and corresponding fold change, for every combination of condition & KEGG_Pathway_id.

Parameters:
  • ko_path_dict{KEGG_Pathway_id:{KO_id:{cond1:value1, cond2:value2...}...}...}
  • all_conds – list of conditions (list of strings)
  • out_dir – relative output directory path

Output file has following path:

out_dir/condX/
        , following name:
KEGG_Pathway_id.txt
        , following header:
# KO KEGG_Pathway_id
        & following format:
KO_id corresponding_fold_change
metatranscriptomics(opts)[source]

Performs analyse of metagenomic data.

out_content(filelist, kopath_values, path_names, method='DESeq2')[source]

For every item in ‘kopath_values’ dictionary and for every file in ‘filelist’, writes to output file line with KOs, which are common for item.value and the set of KOs obtained from file.

Parameters:
  • filelist – List of paths to tab-delimited .txt files, where first column is a KO identifier.
  • kopath_values

    {KEGG_Pathway_id:set[KO identifiers]} dict.

    For example:

    {ko12345:set([K12345, K12346,...]),...}
    
  • path_names

    Dictionary in {KEGG_Pathway_id:Name} format.

    For example:

    {
        'ko04060': 'Cytokine-cytokine receptor interaction',
        'ko00910': 'Nitrogen metabolism'
    }
    
  • method – Argument used only as a part of output file name

Output file has following name:

    (method+'_'+filename.replace('txt', 'path_counts.csv'))
where:
    filename = filepath.split('')[-1], if '' in filepath.
    filename = filepath.split('/')[-1],  if '/' in filepath.
    filename = filepath,                 in other cases.

anf following headline:

ko_path_id;ko_path_name;percent common;common KOs

Writes only lines with non-zero common KOs.

pickle_or_db(pickle, db)[source]

Reads pickle or SQL database, than makes a dict.

If appropriate pickle (a dict) is available, it is read. In the other case function reads ‘kogenes’ table from SQL database and makes missing pickle. Eventually returns dict.

Parameters:
  • pickle – Path to pickled dict in following format: {KEGG GENES identifier : set[KO identifiers]}
  • db – Cursor object to SQL database with ‘kogenes’ table (KO identifier          KEGG GENES identifier)
Returns:

Dict in {KEGG GENES identifier: set[KO identifiers]} format.

Some information for Bipype’s developers (delete this before final version): Code from this fuction was not a fuction in previous version and ‘args’ was hardcoded to: ‘kogenes.pckl’ & c (variable with db’s cursor)

progress(what, estimated_percentage=None, done=True)[source]

Prints specially formatted information about progress.

Parameters:
  • what – a string with name of operation which was just performed, and should be reported to standard output as don or failed,
  • estimated_percentage

    (int)

    Percent should be calculated as part of whole execution; first and last 5 percent should be reserved for programs which runs ‘metatranscriptomics’, for pre- and postprocessing,

  • done – informs whether the operation from ‘what’ argument failed or was successfully done.
rapsearch2(input_file, threads)[source]

Runs rapsearch2 for input_file in fasta format.

Writes outputs in “m8/” directory.

GLOBALS:
  • path to RAPSearch2 program: PATH_RAPSEARCH
  • path to similarity search database: PATH_REF_PROT_KO
run_SARTools()[source]

Runs SARTools in R.

HARDCODED:
R templates:
  • edger: template_script_DESeq2.r
  • deseq: template_script_edgeR.r
run_cat_pairing()[source]

Merges fasta files with paired-end reads in cwd.

run_fastq_to_fasta(fastqs)[source]

Runs fastq_to_fasta() for every .fastq in fastqs.

run_ko_csv(ko_dict_deseq, ko_dict_edger, all_conds, kopath_keys, path_names, ref_cond)[source]

For given ko_dicts writes CSV files with pathways and foldchanges

Parameters:
  • ko_dict{KO_id:{cond1:value1, cond2:value2...}...} dict
  • all_conds – list of conditions (list of strings)
  • kopath_keys{KO identifier:set[KEGG_Pathway_ids]} dict
  • path_names{KEGG_Pathway_id:Name} dict
  • filepath – output filepath
Output files have following format (and header)::
KO_id;Gene_name;paths ids;paths names;FC vs cond1;FC vs cond2;...;
HARDCODED:
Output files paths:
  • deseq: ‘deseq.csv’
  • edger: ‘edger.csv’
run_ko_map()[source]

Runs m8_to_ko() for every .m8 file in cwd.

GLOBALS:
  • path to KO database: PATH_KO_DB
  • pickle to dict from KO GENES table from KO database: PATH_KO_PCKL
run_ko_remap(deseq_files, edger_files, kopath_values, path_names)[source]

Runs out_content(files, kopath_values, path_names (,'edgeR')) for files from edger_paths and deseq_paths.

Parameters:
  • deseq_diles – list of DESeq outputs paths
  • edger_files – list of edgeR outputs paths
  • kopath_values{KEGG_Pathway_id: set[KO identifiers]} dict
  • path_names{KEGG_Pathway_id: Name} dict
run_new_ko_remap(deseq_files, edger_files, kopath_values, all_conds, ref_cond)[source]

Runs get_ko_fc(), low_change(), mapper() and mapper_write() in appropriate way for files from deseq_files and edger_files.

Parameters:
  • deseq_diles – list of DESeq outputs paths
  • edger_files – list of edgeR outputs paths
  • ref_cond – Reference condition (group) - string
  • kopath_values{KEGG_Pathway_id:set[KO identifiers]} dict
  • all_conds – list of conditions (list of strings)
Returns:

{KO_id:{cond1:value1, cond2:value2...}...} dict ko_dict_edger: {KO_id:{cond1:value1, cond2:value2...}...} dict

Return type:

ko_dict_deseq

HARDCODED:
Output directories paths:
  • deseq: ‘new_ko_remap/deseq/’
  • edger: ‘new_ko_remap/edger/’
run_pre_ko_remap()[source]

Prepares args for run_ko_remap() or run_new_ko_remap()

Returns:{KEGG_Pathway_id:Name} dict kopath_keys: {KO identifier:set[KEGG_Pathway_ids]} dict kopath_values: {KEGG_Pathway_id:set[KO identifiers]} dict edger_files: list of edgeR outputs paths deseq_diles: list of DESeq outputs paths
Return type:path_names
HARDCODED:
Paths to files from SARTools:
  • edger: ‘edger/*[pn].txt’
  • deseq: ‘deseq/*[pn].txt’
GLOBALS:
  • path to KO database: PATH_KO_DB
run_rapsearch(threads)[source]

Runs rapsearch2() for every .tmp.fasta in cwd.