Refseq bipype

Module functions

MH(mode, e, glob_out_dir, t, cat, pair, rap=False, presets='meta-large', megan=False)[source]

Runs Megahit (also rapsearch() and run_megan())

Parameters:
  • mode – if mode is other than ‘run’ program prints commands without executing them
  • e – if true, then program updates todo list with exist_check() function.
  • t – number of threads
  • cat – name of current folder
  • pair – tuple of paired_end reads
  • rap – if true, program runs rapsearch() function.
  • presets – argument for megahit
  • megan – if true, program runs megan() function.
GLOBAL:
PATH_MEGAHIT
MV(mode, e, global_out_dir, k_mers, cat, pair, ins_len, rap=False, megan=False)[source]

Runs Velvet. Runs gzip_MV() function on folder with Velvet results.

The name of the folder is derived from pair argument:

pair_uni_name(pair) + '_velvh_out'

The function can also run rapsearch() and megan() functions when called with appropriate arguments.

Parameters:
  • mode – if mode is ‘run’ program runs velveth or velvetg or metavelvetg
  • e – if true, then program changes todo list with exist_check(): function
  • k_mers – k-length nucleotides reads list
  • cat – name of current folder
  • pair – tuple of paired_end reads
  • ins_len – if 9999 given, then ins_len will be retrieved as an output of ins_len_read() function
  • rap – if true, then program runs rapsearch() function.
  • megan – if true, then program runs megan() function.
GLOBALS:
PATH_VELVETH PATH_VELVETG PATH_METAVELVETG
SSU_read(loc, headers_type='ITS')[source]

Extracts from specially formatted FASTA file taxonomic data and returns them as hierarchically organised dict.

Headers of FASTA files should follow schemas presented by following examples:

  • ITS format (used in UNITE database):

    >DQ233785|uncultured ectomycorrhizal fungus|Fungi|Thelephora terrestris|Fungi; Basidiomycota; Agaricomycotina; Agaricomycetes; Incertae sedis; Thelephorales; Thelephoraceae; Thelephora; Thelephora terrestris
    
  • 16S format (alternative):

    >AF093247.1.2007 Eukaryota;Amoebozoa;Mycetozoa;Myxogastria;;Hyperamoeba_sp._ATCC50750
    
Parameters:
  • loc – location - path to the FASTA file
  • headers_type – tells which format of header is present in given file. If specified, indicates use of alternative format.
Returns:

A dict with taxonomic data, where
  • keys are sequence identifiers;
  • values are lists of taxonomic terms, in order: from the most generic to the most specific one.

Example:

{
    'DQ482017':
    [
        'Fungi',
        'Basidiomycota',
        'Agaricomycotina',
        'Agaricomycetes',
        'Incertaesedis',
        'Thelephorales',
        'Thelephoraceae',
        'Tomentella',
        'Tomentellasublilacina'
    ],
    'EF031133':
    [
        'Fungi',
        'Basidiomycota',
        'Agaricomycotina',
        'Agaricomycetes',
        'Incertaesedis',
        'Thelephorales',
        'Thelephoraceae',
        'Thelephora',
        'Thelephoraterrestris'
    ]
}

adapter_read(filename)[source]
Replace ‘NNNNNN’ part of adp_2 by this part of the filename,
which contains only letters that are symbols of nucleotides.

Example

input:

Amp18_BFp_B_p_GACGAC_L001_R1_001.fastq

output:

(
    'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT',
    'AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGACGACATCTCGTATGCCGTCTTCTGCTTG'
)
HARDCODED:
Adapter sequences
adapter_read_bck(adapter_file, filename)[source]

Function reads all lines of given adapter_file and separates words in each line. If first word in a line is equal to filename, then second and third words are returned as a tuple.

Parameters:filename (adapter_file,) –
Returns:tuple (fin_adap_1, fin_adap_2)
auto_tax_read(db_loc)[source]

Read {GI:TaxID} & {TaxID:scientific_name} dicts from pickle.

Parameters:db_loc – path to pickle file
Returns:{GI:TaxID} Keys are integers, values are strings {TaxID:scientific_name} Keys and values are integers
Return type:Two dicts
bam_idxstating(mode, sorted_bam, idxstats)[source]

Launch samtools idxstats.

Parameters:
  • mode – if mode is ‘run’, launch the program.
  • sorted_bam – path to file in BAM format, which is the input to samtools.
  • idxstats – path to file, where the output from samtools will be written.
bam_indexing(mode, sorted_bam)[source]

If mode is ‘run’ creates an index file sorted_bam.bam.bai for the sorted_bam.bam file.

Parameters:
  • mode – string type
  • sorted_bam – BAM file
bam_make(mode, sam, bam)[source]

If mode is ‘run’ and @SQ lines are present in the header, the function converts SAM to BAM

Parameters:
  • mode – string type
  • sam – SAM file
  • bam – BAM file
bam_sorting(mode, bam, sorted_name)[source]

If mode is ‘run’ the function reads bam (file in BAM format), sort it by aligned read position and write it out to BAM file whose name is: sorted_name.

Parameters:
  • mode – string type
  • bam – BAM file
  • sorted_name – name of output file
bowtie2_run(mode, proc, ref, out, inp1, inp2=False)[source]

If mode is ‘run’, launches Bowtie 2.

Parameters:
  • mode – should bowtie be runned?
  • proc

    [-p in Bowtie 2]

    Number of alignment threads to launch.

  • ref

    [-x in Bowtie 2]

    The basename of the index for the reference genome.

  • out

    [-S in Bowtie 2]

    File to write SAM alignments to.

  • inp1

    [-1 in Bowtie 2]

    Comma-separated list of files containing mate 1s (filename usually includes _1), for example:

    -1 flyA_1.fq,flyB_1.fq
    
  • inp2

    [-2 in Bowtie 2]

    Comma-separated list of files containing mate 2s (filename usually includes _2), for example:

    -2 flyA_2.fq,flyB_2.fq
    

    If inp2 is False: -2 argument will not be passed to Bowtie 2.

cat_read(mode, fileext, paired_end=True)[source]

Returns a dict with paths to sequence files from current working directory.

The dict has following format:

{
    directory_path1: [
        file1_in_this_directory_path,
        file2_in_this_directory_path,
        file3_in_this_directory_path
    ],
    directory_path2: [
        file4_in_this_directory_path,
        file5_in_this_directory_path,
        file6_in_this_directory_path
    ]
}

Paths to files (values of dict) are relative.

Parameters:
  • mode – If mode is ‘run’, the gunzip command will be executed to extract compressed files.
  • fileext – Extension of sequence files, which will be putted into the dict.
  • paired_end – If True, only paired-end reads are included in the dict (default True).
create_krona_html(krona_html_name, krona_xml_name)[source]

Runs script which creates Krona html file from krona xml file

create_krona_xml(xml_string, out_xml)[source]

Writes xml_string into the file of name out_xml. If the file does not exist, creates a new file.

Parameters:
  • xml_string – a string with content to write
  • out_xml – a string with path to file
cutadapt(mode, e, cat, R1_file, R2_file, adapter_file, usearch_16S=False, usearch_ITS=False, threads=False)[source]

Performs preparation of reads to be used in usearch in three steps:

  1. cutadapt - searches for the adapters in reads from input files, removes them, and creates output files with .cutadapt.fastq extension.
  2. runs FLASH (software tool to merge paired-end reads) with those files and gets .amplicons.cutadapt.flash.merged.fastq files from output.
  3. these results are input for fastq_to_fasta which converts file format from fastq to fasta

Optionally usearch can be run on files prepared this way.

Parameters:
  • mode – if mode is ‘run’, function operate on data
  • e – if true, checks if a part of the workflow is actually done and omits these parts, to avoid duplicating this job.
  • cat – name of folder with R_1 and R_2 files
  • R2_file (R1_file,) – input files
  • adapter_file – If ‘use_filenames’, the function will use adapters returned by adapter_read(R1_file)`. Else, function will use adapters returned by adapter_read_bck(adapter_file, R1_file)
  • usearch_16S – if true runs usearch(mode, e, '16S', outname_uni_fasta, usearch_16S, threads) where outname_uni_fasta is cutadapt output.
  • usearch_ITS
    if true runs usearch(mode, e, 'ITS', outname_uni_fasta, usearch_ITS, threads) where outname_uni_fasta is cutadapt output.
    See Also:
    Follwoing functions may provide you with better background. - adapter_read_bck() - adapter_read()
GLOBALS:
PATH_FQ2FA
deunique(node)[source]

Extract name of a node from string representing taxonomic position of this node.

Example

input:

'lvl1_____lvl2_____lvl_3_____lvln'

output:

'lvln'
Parameters:node – a string with a list of ancestors of the node and the node itself, separated by ‘_____’.
Returns:A string with pure name of the node.

See also

linia_unique() function

dict_prepare(typ, indict, SSU)[source]

Runs files analysis and then, creates two dicts with summarized taxonomic data. The resulting dicts keep information about files from which given taxa comes.

Parameters:
  • typ – an “output type” - typically: ‘ITS’, ‘16S’ or ‘txt’.
  • indict
    A dict with file structure, where:
    • keys are directories,
    • values are lists of files.

    Example in description od output of cat_read() function.

  • SSU
    A dict with taxonomic data, where:
    • keys are sequence identifiers;
    • values are lists of taxonomic terms, in order: from the most generic to the most specific one.

    Example in description of output of SSU_read() function.

Returns:

(all_dicts, all_tax_dict)

all_dicts: A dict, where
  • keys are file identifiers,
  • values are taxonomic trees with numbers of occurrences of particular species placed inside leafs

Example:

{
    'filename_1.ext':
    {
        'Bacillaceae': {'Anoxybacillus': {'subsum': 15}}
    },
    'filename_2.ext':
    {
        'Listeriaceae':
        {
            'Listeria':
            {
                'Lgrayi': {'subsum': 22},
                'Linnocua': {'subsum': 11}
            }
        }
    }
}

all_tax_dict: A dict, where

  • keys are file identifiers,

  • values are dicts, where
    • keys are names of taxa,
    • values are numbers of occurrences of particular taxon.

Example:

{
    'filename_1.ext':
    {
        'Bacillaceae': 5, 'Anoxybacillus': 5
    },
    'filename_2.ext':
    {
        'Listeriaceae': 19, 'Listeria': 19, 'Lgrayi': 16, 'Linnocua': 3
    }
}

Return type:

A tuple

dict_purify(bac_dict)[source]

Removes from given dict values below the hardcoded threshold.

Parameters:bac_dict – a dict.
Returns:A new dict without entries, within values were lower than threshold.
HARDCODED:
threshold = 10
dict_sum(dicto, val)[source]

Sums the values in the given dict, using recurrence to handle nested dicts.

Parameters:
  • dicto – a dict, where values are of any type with defined + operation
  • val – an int - initial value
Returns:

an int - sum of all values in given dict plus value given as second argument

Return type:

val

exist_check(program, names, todo)[source]

Checks if a part of program is actually done.

Function tries to find output files from different parts of program and basing on the results, removes corresponding tokens from todo list.

In one special case (program == ‘usearch_0’) function checks size of the file specified by <names> argument in bits. If size == 0: todo=[] and PANIC MODE information is printed.

Parameters:
  • program

    One of the following strings representing programs (de facto functions):

    • ‘refseq’ (&)
    • ‘usearch’ (@)
    • ‘usearch_0’ (@)
    • ‘MV’ (^)
    • ‘rapsearch’ (@)
    • ‘cutadapt’ (^)
  • names

    There are three possibilities:

    • (&) A dict in following format:
      {file extension : path to file with corresponding extension}
      
    • (@) A string (one path)
    • (^) A list of paths

    Paths are hypothetical - if one really exists, appropriate token is removed from todo list.

  • todo – List of tokens representing different parts of function specified by ‘program’ argument.

Please, pay attention to mutual compatibility of arguments.

Result:
todo: Checked list of tokens.

See also

For more information please refer to descriptions of the following functions:

fastq_dict(seq_dict, root, file_)[source]

Adds file_ to list available under seq_dict[root].

If key root is not present, it will be created.

file_analysis(typ, name, SSU=None)[source]

Using given SSU as database, performs ‘statistical’ analysis of taxonomy from file name. It counts occurrences of different species in given file and returns result in form of two dicts.

The threshold hardcoded in dict_purify is used, to remove neglectable entities.

Parameters:
  • typ – an “output type” - typically: ‘ITS’, ‘16S’ or ‘txt’.
  • name – name of file to analise
  • SSU
    A dict with taxonomic data, where:
    • keys are sequence identifiers;
    • values are lists of taxonomic terms, in order: from the most generic to the most specific one.

    Explicit examples included in description of SSU_read() function.

Returns:

If file of name <name> was not found:

(‘NA’, ‘NA’)

If data of type ‘16S’ was successfully analysed:

(full_bac_dict, tax_bac_dict)

If data of type ‘ITS’ was successfully analysed:

(full_fun_dict, tax_fun_dict)

full_bac_dict / full_fun_dict: a dict of dicts.
It represents taxonomic tree. Generally in this dict:
  • keys are names of taxa,
  • values are nested dicts of the same type.

Values in the deepest levels (“leafs”) are counts of occurrences.

Example:

{
    'Bacillaceae':
    {
        'Anoxybacillus': {'subsum': 15}
    },
    'Listeriaceae':
    {
        'Listeria':
        {
            'Lgrayi': {'subsum': 22},
            'Linnocua': {'subsum': 11}
        }
    }
}
tax_bac_dict / tax_fun_dict: a dict, with taxonomic statistics:
  • keys are names of taxa
  • values are counts of occurrences

Example:

{
    'Bacillaceae': 15,
    'Anoxybacillus': 15,
    'Listeriaceae': 33,
    'Listeria': 33,
    'Lgrayi': 22,
    'Linnocua': 11
}

Return type:

A 2-tuple

HARDCODED:
In ‘16S’ analysis, if spec is ‘Phaseolus_acutifolius_(tepary_bean)’, then it is not counted neither as bacteria nor archaea.
TODO:
There was a suggestion, that there might be something wrong with handling of ‘name’ argument, when passed in only file_analysis call present in this file. The issue is linked to “full path or basename?” question. It should be investigated, in fully functional testing environment.
graphlan_to_krona(input_d)[source]

Modifies files given by inupt_d from Graphlan format to set of xml strings, which may be assembled into a input for Krona.

Parameters:input_d
A dict where:
  • keys are names of directories,
  • values are lists with filenames, which are in the directory.

Example:

{'dir_name_1': ['file_1','file_2'], 'dir_name_2': ['file_1']}

Explicit example in description of output of cat_read().

Returns:A tuple
xml_names: readable identifiers of files derived from filenames.
Filenames comes from input_d dict. Example:
[identifier_1, identifier_2, ..., identifier_x]
xml_dict: Contains lists of numbers of occurrences of nodes, grouped by node:
  • keys are names of nodes,
  • values are lists of constant length, where every item on position X means: number of occurrences of nodes with prefix equal to name of node in file X. file X is that file, which is on position X on list xml_names.

Note, that for all i: len(xml_dict[i]) is equal to len(xml_names).

Example:

{node: [count_1, count_2, ... count_x], node_2: [count_1, count_2, ... count_x]}
tax_tree: A dict of dicts, etc.
In form of nested dicts, it represents taxonomic tree. Example:
{a:{aa:{}, ab:{aba:{}, abb:{abba:{}}}}}

Explicit example in description of total_tax_tree in tax_tree_graphlan(),

name_total_count: A dict.
Contains summed numbers of occurrences of nodes by file.
  • keys are identifiers (derived from filenames - check xml_names description),
  • values are sums.

Example:

{identifier_1: count_1, identifier_2: count_2}
Return type:xml_names, xml_dict, tax_tree, name_total_count
gzip_MV(MV_dir)[source]

Return archive with catalogs Sequences, Roadmaps, PreGraph, Graph2, LastGraph and files with statistics.

Parameters:MV_dir – folder with Velvet results
humann(mode, e, m8_dict, typ='m8')[source]

Copies humann to the current directory, moves input (*.m8) files to the input directory, copies hmp_metadata.dat file to the input directory and runs humann.

GLOBALS:
  • path to humann program: PATH_HUMANN
  • path to data for humann: PATH_HUMANN_DATA
Parameters:
  • mode – if ‘run’, then humann will be copied to the current directory
  • e – if true, then function checks existence of humann-0.99 results folder
  • m8_dict – the similarty search results folders
  • typ – default typ=”m8”, in that case new catalog humann-0.99 will be created in rapsearch result folder; in other case humann analysis results will be added in rapsearch result folder.
idx_map(mode, file_, tax_name_dict, tax_id_dict, outfile)[source]

Parses and writes data from samtools idxstats output file.

Firstly, using idx_reader(file_), creates {GI:#_mapped_reads} (a dict).

Secondly, replaces every GI (key) with TaxID if appropriate one is available in tax_id_dict.

Than, replaces every GI /TaxID (key) with scientific name if appropriate one is available in tax_name_dict.

Finally, writes data (sorted by numbers of mapped reads in decreasing order) to outfile in key;value format, where:

  • key is GI/TaxID/scientific name
  • value is number of mapped reads
Parameters:
  • mode – If (mode == ‘run’), function do mentioned things. Elif (mode == ‘test’) function prints file_ and outfile arguments. Else function do nothing.
  • file_ – Path to samtools idxstats output file. For more information refer to idx_reader() function.
  • tax_name_dict – {TaxID:scientific_name} dict. For more information refer to tax_name_reader() function.
  • tax_id_dict – {GI:TaxID} dict.__weakrefoffset__ For more information refer to tax_id_reader() function.
  • outfile – Path to output file.
idx_reader(file_path)[source]

Reads file in samtools idxstats output format.

Parameters:file_path

path to samtools idxstats output file.

File is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads

Returns:a dict with {GI: #_mapped_reads} (keys and values are integers).
idxstat_perling(mode, idxstats, map_count)[source]
Counts and writes to map_count both:
  • sum of all (numbers of mapped reads) in idxstats,
  • sum of all (numbers of unmapped reads) in idxstats.

Function launches short “one-liner” in perl.

Parameters:
  • mode – If mode is ‘run’ the perl “one-liner” will be launched.
  • idxstats – Path to samtools idxstats output file. Please refer to idx_reader() for more information.
  • map_count – Path to file, where differences will be written.
Output:

Output has following format:

#mapped - #unmapped

Example:

123 - 456
TODO:
This should be rewritten to python (why to use perl here?)
input_locations(mode, out_types)[source]

Generates lists of locations where input files are located. Lists are grouped by directories and later, by “output types”.

As input files considered are only files, which meet all the following conditions:

  • they are located in current working directory or in subdirectories of current working directory,
  • have suffix of filename equal to out_type,
  • if out_type is ‘ITS’ or ‘16S’, filenames also have to contain ‘usearch_’ before the out_type in name.
Parameters:
  • mode – a parameter passed to cat_read() function. If mode is ‘run’, the gunzip command will be executed, to extract compressed files.
  • out_types – a list with “output types”, typically consisting of: ‘ITS’, ‘16S’ or ‘txt’.
Returns:

A dict where keys are “output types”, values are dicts describing file locations returned by cat_read() function.

Example:

{
    'ITS':
    {
        directory_path1:
            [file1_in_this_directory_path,
            file2_in_this_directory_path,
            file3_in_this_directory_path],
        directory_path2:
            [file4_in_this_directory_path,
            file5_in_this_directory_path,
            file6_in_this_directory_path]
    }
}

ins_len_read(pair, cat)[source]

Return 200 if the read length is less then 200 or 500 in otherwise.

Parameters:
  • pair – tuple of paired_end read
  • cat – name of folder with the sample files

Example

input:

('Amp15_BFk_B_p_CGTACG_L001_R1_001.fastq','Amp15_BFk_B_p_CGTACG_L001_R2_001.fastq'), 'catalog_name'

output:

500
linia_unique(linia)[source]

Creates list where every element includes information about previous elements, in order, separated by “____”.

Example

input:

[‘1’,‘2’,‘3’],

output:

[‘1’, ‘1_____2’, ‘1_____2_____3’]
Parameters:linia – a list.
Returns:A list.

See also

deunique() function

name_total_reduction(xml_names, file_total_count)[source]

Rewrites given <file_total_count> dict, to group total counts by names (datasets), instant of grouping by filenames.

Keep in mind, that this step - along with generation of xml_names in xml_name_parse - groups files referring to the same dataset under single name, and technically is vulnerable for some errors.

See also

description of xml_vals() function

Parameters:
  • xml_names

    readable identifiers of groups of files referring to common dataset; derived from filenames. Example:

    ['filename_1', 'filename_2']
    
  • file_total_count

    A dict within information about files are kept and values of every leaf are summed into values (usually representing total count of species inside particular file)

    Example:

    {
        'filename_1.ext': 15,
        'filename_2.ext': 33,
        'filename_2_part_2.ext': 33
    }
    
Returns:

A dict, where keys are names as present in xml_names and values are summed total counts (usually of occurrences of different species).

Example:

{'filename_1': 15, 'filename_2': 66}

out_namespace(curr, out_types)[source]

Generates paths to places, where krona xml and krona html files are or will be placed. Path composition varies, according to given arguments and always contains some string representation of out_type. Paths may start in current working directory or in directory specified by curr.

If ‘txt’ is one of output_types, the basename of output files will be ‘humann-graphlan’; otherwise, the basename will be concatenation of elements from ‘out_types’ list, with underscore ‘_’ as glue.

In case of multiple elements on <out_types> list, they will be sorted, so the output files will have consistent names regardless to parameters’ order.

Examples of generated filenames, basing on <out_types>:

  • for [‘ITS’]: ‘ITS.krona’, ‘ITS.html’
  • for [‘ITS’, ‘16S’]: ‘16S_ITS.krona’, ‘16S_ITS.html’
  • for [‘txt’]: ‘humann-graphlan.krona’, ‘humann-graphlan.html’
Parameters:
  • curr – A string - path to directory where the files are/will be placed. If curr is ‘in_situ’, the path will start in current working directory.
  • out_type – A list that determines the basename of output files. In practice, we expect list of elements: ‘ITS’, ‘16S’ or ‘txt’.
Returns:

A tuple

out_xml: a string with path to file.

The filename is derived from (out_type) and file has ‘.krona’ extension.

out_html: a string with path to file.

The filename is derived from (out_type) and file has ‘.html’ extension.

Return type:

out_xml, out_html

pair_uni_name(file_pair)[source]

Returns one name for given tuple containing pair of filenames.

Parameters:file_pair – tuple of paired_end read

Example

input:

('Amp15_BFk_B_p_CGTACG_L001_R1_001.fastq','Amp15_BFk_B_p_CGTACG_L001_R2_001.fastq')

output:

'Amp15_BFk_B_p_CGTACG_L001_001'
paired_end_match(seq_dict)[source]

Returns a dict with paths to paired-end reads only. Files will be matched as pair if their names differs only by parts:

  • R1 and R2, if the parts of filename are separated by underscores _
  • 1 and 2, if the parts of filename are separated by dots .
Parameters:seq_dict

A dict determining locations of paired-end sequences.

Example:

{directory_path:file_path}

Returns: a dict

The returned dict has following format:

{
    directory_path1: [
        (paired-end_read1_R1_path, paired-end_read1_R2_path),
        (paired-end_read2_R1_path, paired-end_read2_R2_path),
        (paired-end_read3_R1_path, paired-end_read3_R2_path)
    ],
    directory_path2: [
        (paired-end_read4_R1_path, paired-end_read4_R2_path)
    ]
}

Paths to files (both in input and output dict) are relative.

prepare_taxonomy_dicts(opts, input_dict)[source]

Prepares dicts with taxonomy stats.

See also

prepare_taxonomy_stats() for more information.

prepare_taxonomy_stats(opts)[source]

Performs statistical analysis of taxonomy from appropriate files from current working directory: counts occurrences of different taxa and prepares the results to be presented in HTML format. Results will be converted to HTML (with the Krona program), but only when opts.mode is set to ‘run’.

Parameters:opts

A namespace, where:

opts.output_type:
Allows to choice on which files the analysis will be performed and also determines basenames of output files.

One of: [‘ITS’, ‘16S’, ‘txt’]

opts.mode:
A mode in which the program will be run.

One of: [‘test’, ‘run’]

opts.out_dir:
Indicates, where the output files should be located. To specify current working directory, use ‘in_situ’.

One of: [‘in_situ’, a_string_with_path_to_dir]

opts.db_taxonomy_16S:
Path to 16S database used in taxonomy classification (fasta with specially formatted headers)
opts.db_taxonomy_ITS:
Path to ITS database used in taxonomy classification (fasta with specially formatted headers)

Input:

Analysis will be performed on files, meeting all the following conditions:
  • files are located in current working directory or in subdirectories of current working directory,
  • suffix of filename is equal to opts.output_type,
  • if opts.output_type is ‘ITS’ or ‘16S’, filenames contains ‘usearch_’ before the opts.output_type in name.

Output:

Output files will be placed in opts.out_dir directory, with basenames depending on opts.output_type.

Examples of filenames:
  • for ‘ITS’: (‘ITS.krona’, ‘ITS.html’)
  • for [‘ITS’, ‘16S’]: (‘ITS_16S.krona’, ‘ITS_16S.html’)
prettify(elem)[source]

Creates a string representation of an XML element by calling xml.etree.ElementTree.tostring() and then, parses this string again to obtain pretty-printed XML string, where indents are four spaces long.

Parameters:elem – instance of class Element from xml.etree.ElementTree
Returns:A string containing pretty-printed XML of <elem>
rapsearch(mode, e, contig_loc, rap_out, KEGG=None)[source]

Runs RAPSearch with using KEGG databases for similarity search.

Parameters:
  • mode – if mode is ‘run’, then program runs rapsearch
  • e – if e=True, then function runs exist_check function
  • contig_loc – a query file
  • rap_out – output file name
  • KEGG – default is None, if KEGG= KO, then ko.pep.rapsearch.db is chosen as protein database
HARDCODED:
if KEGG=’masl’
GLOBALS:
  • path to RAPSearch program: PATH_RAPSEARCH

  • paths to data for RAPSearch:

    PATH_REF_PROT_MASL28, PATH_REF_PROT_KO, PATH_REF_PROT_ELSE

reconstruct(mode, thr, e, pair, cat, prefix, rec_db_loc)[source]

Runs bwa. Find the SA coordinates of the input reads. Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.

Parameters:
  • mode – if ‘run’, then commands “bwa aln” and “bwa samse” were run
  • thr – Number of threads (multi-threading mode)
  • e – (!!!! Wasn’t used !!!!)
  • pair – tuple of paired_end read
  • cat – name of folder with the sample files
  • prefix – prefix for output files names
  • rec_db_loc – input database with fasta files
refseq_mapping(mode, e, directory, pair, postfix, refseq, tax_name_dict, tax_id_dict, threads, map_dir, refseq_2=False)[source]

Aligns reads to reference sequence(s) using Bowtie 2, than parses and writes data to files.

  • Firstly, Bowtie 2 align reads to reference sequence (or sequences, if refseq_2 is not False and merge output SAM files).

  • then, BAM files are made, sorted and indexed,

  • the function launches ‘samtools idxstats’,

  • in the next step, perl ‘one-liner’ counts and writes to file sums of mapped reads and unmapped reads,

  • finally, idx_map() function is called: it parses data from samtools idxstats output file and writes data to outfile in key;value format, where:

    • key is GI/TaxID/scientific name
    • value is number of mapped reads.
Parameters:
  • mode – If mode is ‘run’, the function operates on data and prints information about it. Otherwise, prints information, without operating on data (it is a kind of test).
  • e

    If true, checks if a part of workflow is actually done and doesn’t duplicate this jobs.

    For more information, please refer to e argument in exist_check() function.

  • directory – Path to directory, where files will be written.
  • pair – Tuple of paths to file (paired-end reads) OR a string with path to a sequence file.
  • postfix

    String added to the end of files basenames.

    Argument for refseq_ref_namespace().

  • refseq

    The basename of the index for the reference genome.

    Argument for bowtie2_run().

  • tax_name_dict

    A dict: {TaxID:scientific_name}

    For more information refer to tax_name_reader().

    Argument for idx_map().

  • tax_id_dict

    A dict: {GI:TaxID}

    For more information refer to tax_id_reader().

    Argument for idx_map().

  • threads – Number of threads for Bowtie 2 calculations.
  • map_dir – Directory where sums of mapped and unmapped reads will be written. For more information refer to idxstat_perling().
  • refseq_2

    The basename of the index for the reference genome.

    Argument for bowtie2_run().

    If selected, launches Bowtie 2 on it, then merges output with output from Bowtie 2 launched on ‘refseq’.

    If refseq_2 == False, Bowtie 2 is working only on ‘refseq’ argument.

refseq_ref_namespace(directory, seq, postfix, out_dir='in_situ', map_dir='in_situ')[source]
Returns a dict within:
  • keys are following file extensions: fastq, sam, sam2, bam, sorted, sorted.bam, idx_stats, tax_count, map_count
  • values are paths to file with corresponding extension.

Filenames have following format:

sample_name + '_' + postfix + extension

where sample_name is basename of seq (if seq is a file) or output of pair_uni_name(seq) (if seq is a tuple)

Parameters:
  • directory – Path to directory, where file with .fastq extension will be written
  • seq – Tuple of paired_end read or sequence file.
  • postfix – String added to the end of file basenames.
  • out_dir – Path to directory, where files with .sam, .sam2, .bam,.sorted, .sorted.bam, .idx_stats and .tax_count extensions will be written. If out_dir is ‘in_situ’ (default), out_dir=directory
  • map_dir – Path to directory, where file with .map_count extension will be written. If map_dir is ‘in_situ’(default),map_dir=out_dir.
run_megan(out_dir, m8, contigs, tax_gi_file='/home/pszczesny/workingdata/mapping/gi_taxid_prot.dmp')[source]

Makes and launches scripts for MEGAN

Parameters:
  • out_dir – output directory
  • m8 – list of m8 files
  • contigs – contigs file
  • tax_gi_file – taxGIFile
HARDCODED:
default path to taxGIFile
sam_merge(sam1, sam2)[source]

Merges two files in SAM format into one.

Parameters:
  • sam1 – SAM file
  • sam2 – SAM file
Returns:

sam1 file in SAM format which contains now lines from both files. Function also removes sam2 file.

sample(opts)[source]

Runs programs such as Velvet, bwa, Bowtie 2, humann; cuts adapters or run RAPSearch if these options were chosen.

Parameters:opts

A namespace, where:

opts.to_calculate:
Input sequences type (Plant or fungi). TODO (this description seems to be misleading)
opts.mode:
A mode in which the program will be run. One of: [‘test’, ‘run’].
opts.db_NCBI_taxonomy:
Path to pickle file.
opts.reconstruct:
Determines if reconstruct() function should be run.
opts.assembler:
Assembly program One of: [‘MH’,’MV’].
opts.MV:
k-length nucleotids reads list or None.
opts.db_reconstruct:
Input database with fasta files.
opts.threads:
Number of threads (multi-threading mode).
opts.e:
If true, check if file already exists.
opts.ins_len:
200 if read length is less than 200 and 500 in otherwsie.
opts.db_refseq_fungi:
A list of paths to refseq databases to use in fungi analysis. Up to two paths are allowed.

Warning: If there are multiple databases with filenames like <your_path><second_part_of_name>, all will be loaded.

opts.db_refseq_plant:
A list of paths to refseq databases to use in plants analysis. Up to two paths are allowed. Also check warning in opts.db_refseq_fungi description.
opts.our_dir:
Directory where sums of mapped and unmapped reads will be written
opts.postfix:
TODO
opts.cutadapt:
List of adapter types: ‘16S’, ‘ITS’ or ‘both’.
opts.db_16S:
database of 16S adapters, which will be -db parameter for Usearch program; if it is not empty, then cutadapt function`s parameter usearch_16S will be True.__truediv__
opts.db_ITS:
database of ITS adapters, which will be -db parameter for Usearch program; if it isn’t empty,then cutadapt function`s parameter usearch_ITS will be True.
GLOBALS:
PATH_FQ2FA
HARDCODED:
actions dependent on existence of ‘seq’ substring in some paths.
tax_id_reader()[source]

Returns a dict with {GI:TaxID} from file.

Keys and values are integers.

File has following format:

13  9913
15  9915
16  9771
17  9771

where first column is a GI, second is a TaxID.

GLOBLAS:
PATH_GI_TAX
tax_name_reader()[source]

Returns a dict with {TaxID:scientific_name} from file.

Lines without scientific names are omitted.

Keys are integers, values are strings.

Example

File:

2   |       prokaryotes     |       prokaryotes <Bacteria>  |       in-part |
6   |       Azorhizobium    |               |       scientific name |
6   |       Azorhizobium Dreyfus et al. 1988        |               |       synonym |
6   |       Azotirhizobium  |               |       equivalent name |
7   |       ATCC 43989      |               |       type material   |
7   |       Azorhizobium caulinodans        |               |       scientific name |

Output:

{6:'Azotirhizobium', 7:'Azorhizobium caulinodans'}
GLOBALS:
PATH_TAX_NAME
tax_tree_extend(tax_tree, linia)[source]

Recursively adds elements form list linia into the tree tax_tree.

Parameters:
  • tax_tree – tree represented as dict of dicts.
  • linia – a list in order: from the oldest ancestor to the youngest descendant. Elements should be formatted to contain information about ancestors (like it does linia_unique() function).
Returns:

Extended tree in form of dict of dicts.

tax_tree_graphlan(input_d)[source]

Reads and interprets information about taxonomic relations, from files in Graphlan-like format.

Parameters:input_d

A dict where keys are names of directories, values are lists with filenames, which are in the directory.

Example:

{'dir_name_1': ['file_1','file_2'], 'dir_name_2': ['file_1']}

Explicit example in description of output of cat_read().

Example of featured file:

Listeriaceae.Listeria.Lgrayi
Listeriaceae.Listeria.Linnocua
Returns:A tuple
total_tax_tree:
A dict of dicts, etc. In form of nested dicts, represents phylogenetic trees of entities described by files featured in the input. This is a total tree - nodes from all files are merged here.

Example:

{
    'Bacillaceae':
    {
        'Bacillaceae_____Anoxybacillus': {}
    },
    'Listeriaceae':
    {
        'Listeriaceae_____Listeria':
        {
            'Listeriaceae_____Listeria_____Lgrayi': {},
            'Listeriaceae_____Listeria_____Linnocua': {},
        }
    }
}
per_file_tax_tree:
A dict of dicts, etc. Keys are filenames. In form of nested dicts, represents phylogenetic trees of entities, described by files featured in the input.

Example:

{
    'annot_1.txt':
    {
        'Bacillaceae':
        {
            'Bacillaceae_____Anoxybacillus': {}
        }
    }
    'annot_2.txt':
    {
        'Listeriaceae':
        {
            'Listeriaceae_____Listeria':
            {
                'Listeriaceae_____Listeria_____Lgrayi': {},
                'Listeriaceae_____Listeria_____Linnocua': {},
            }
        }
    }
}
multi_flat_tax_tree: A dict of dicts, where:
  • keys are filenames,
  • values are dicts, where:
    • keys are names of nodes,
    • values are numbers of nodes, with names starting from “node_name”

Example:

{
    'annot_1.txt':
    {
        'Bacillaceae': 8,
        'Bacillaceae_____Anoxybacillus': 8
    },
    'annot_2.txt':
    {
        'Listeriaceae': 16,
        'Listeriaceae_____Listeria': 16,
        'Listeriaceae_____Listeria_____Lgrayi': 8,
        'Listeriaceae_____Listeria_____Linnocua': 8
    }
}
Return type:total_tax_tree, per_file_tax_tree, multi_flat_tax_tree
taxa_read(read_mode, db_loc=None)[source]

Returns {GI:TaxID} & {TaxID:scientific_name} dicts.

Parameters:
  • read_mode
    if read_mode is equal to ‘manual’, then:
    tax_id_reader() and tax_name_reader() functions will be used and data is read from text files. Refer to mentioned functions for more information.
    otherwise:
    auto_tax_read() function (with db_loc as an argument) will be used and data will be read from a pickle file. Refer to mentioned function for more information.
  • db_loc – path to pickle file, used when read_mode is other than ‘manual’
Returns:

  • {GI:TaxID} Keys are integers, values are strings.
  • {TaxID:scientific_name} Keys and values are integers.

Return type:

Two dicts

tree_of_life(full_dict)[source]

Rewrites the data from a dict containing information arranged by type and then by file into two dicts:

  • first represents full taxonomic tree
  • second presents how many species are inside particular files
Parameters:full_dict

a dict of dicts, with structure like:

{
    'output_type_1':
    {
        'filename_1.ext':
        {
            'Bacillaceae': {'Anoxybacillus': {'subsum': 15}}
        }
    }
}

Longer example available in description of xml_format().

Returns:A tuple (full_tree, file_total_count)
full_tree:
A dict without information about type and file.

Example:

{
    'Bacillaceae':
    {
        'Anoxybacillus': {}
    },
    'Listeriaceae':
    {
        'Listeria':
        {
            'Lgrayi': {}, 'Linnocua': {}
        }
    }
}
file_total_count:
A dict within information about files are kept and values of every leaf are summed into values representing total count of species inside particular file.

Example:

{'filename_1.ext': 15, 'filename_2.ext': 33}
tuple_to_dict(tuple_dict)[source]

Input dict has tuples as keys and int type values. In the dict returned by function:

  • keys are elements from input dict’s tuples,
  • values are dicts.
Parameters:tuple_dict

a dict of tuples

Example:

{('first','second','third'): 10, ('first','second'): 30, ...}

Example will show why this function can be useful:

input:

{
    ('Animalia','Mammalia','Canis lupus familiaris'): 10,
    ('Bacteria','Enterobacteriaceae','Escherichia coli'): 20,
    ('Animalia','Mammalia','Felis catus'): 30
}

output:

{
    'Animalia':
    {
        'Mammalia':
        {
            'Canis lupus familiaris': {'subsum': 10},
            'Felis catus': {'subsum': 30}
        }
    }
    'Bacteria':
    {
        'Enterobacteriaceae':
        {
            'Escherichia coli': {'subsum': 20}
        }
    }
}
tuple_to_xml_dict(tuple_dict)[source]

Input dict has tuples as keys and int type values. In dict returned by function all elements from input dict’s tuples are single keys but their current values are the sum of the values of all tuples, in which they were.

Parameters:tuple_dict

tuple to be modified

Example:

{
    ('first','second','third'): 10,
    ('first','third','fifth': 30)
    ...
}

Example

input:

{
    ('Animalia','Mammalia','Canis lupus familiaris'): 10,
    ('Bacteria','Enterobacteriaceae','Escherichia coli'): 20,
    ('Animalia','Mammalia','Felis catus'): 30
}

output:

{
    'Animalia': 40,
    'Felis catus': 30,
    'Enterobacteriaceae': 20,
    'Mammalia': 40,
    'Escherichia coli': 20,
    'Bacteria': 20,
    'Canis lupus familiaris': 10
}
txt_dict_clean(dicto)[source]

Cleans given file structure by removing files which do not contain words “graplan” or “tree”. Emptied directories are also removed.

Parameters:dicto
A dict where:
  • keys are names of directories,
  • values are lists with filenames, which are in the directory.

Example:

{'dir_name_1': ['file_1','file_2'], 'dir_name_2': ['file_1']}

Explicit example in description of output of cat_read().

Returns:A dict in the same format, as given one (without unwanted files)
update_dict(tax_tree, curr_tax)[source]

Updates the dict with keys form another one. If during walking the dict, the ‘subsum’ string is found, recurrence stops and nodes beneath are ignored. The dicts have to contain only other dicts or ‘subsum’!

Parameters:
  • tax_tree – a dict to update
  • curr_tax – a dict to be merged into the tax_tree
Returns:

updated dict

Return type:

tax_tree

Example

input (tax_tree, curr_tax):

(
    {
        'Listeriaceae':
        {
            'Listeria': { 'Lgrayi': {} }
        }
    },
    {
        'Listeriaceae':
        {
            'Listeria': { 'Linnocua': {'subsum': 11} }
        }
    }
)

output:

{
    'Listeriaceae':
    {
        'Listeria':
        {
            'Lgrayi': {},
            'Linnocua': {}
        }
    }
}
usearch(mode, e, search_type, infile, database, threads)[source]

Launches Usearch with -usearch_local command.

HARDCODED:
  • Usearch options:
    • evalue = 0.01
    • id = 0.9
    • strand = both
Parameters:
  • mode – if mode is ‘run’ Usearch will be launched.
  • e

    if true, checks if a part of workflow is actually done and doesn’t duplicate this jobs.

    For more information, please refer to e argument in exist_check() function.

  • search_type

    &

  • infile – a name of input file for usearch. Also determines name of [-blast6out in Usearch] output file.
  • database – [-db in Usearch]
  • threads – [-threads in Usearch] Number of threads used in calculations.
GLOBALS:
PATH_USEARCH
xml_counts_graphlan(per_file_tax_tree, xml_names, multi_flat_tax_tree)[source]
Groups and/or sums numbers of nodes (obtained from multi_flat_tax_tree):
  1. by node_name - grouping
  2. by identifier (obtained from filename) - summing
Parameters:
  • per_file_tax_tree – A dict of dicts, etc. Top-level keys are filenames. Taxonomic tree represented by nested dicts. Similar to tax_tree. Example included in description of tax_tree_graphlan().
  • xml_names

    readable identifiers of files derived from filenames. Example:

    ['filename_1', 'filename_2']
    
  • multi_flat_tax_tree
    A dict of dicts, where:
    • keys are filenames,
    • values are dicts, where:
      • keys are names of nodes,
      • values are numbers of nodes with names starting from “node_name”

    Explicit example in description of tax_tree_graphlan().

Returns:

A tuple – xml_dict: A dict. Contains lists of numbers of occurrences of nodes by node.

  • keys are names of nodes.
  • values are lists of constant length, where every item on position X means: number of occurrences of nodes with prefix equal to name of node in file X. file X is that file, which is on position X on list xml_names.

Example:

{node: [count_1, count_2], node_2: [count_1, count_2]}
name_total_count: A dict. Contains summed numbers of occurrences of nodes by file.
  • keys are identifiers (derived from filenames - look for xml_names),
  • values ale sums.

Example:

{identifier_1: count_1, identifier_2: count_1}

Return type:

xml_dict, name_total_count

xml_format(full_dict, tax_dict)[source]

Creates a set of dicts, which are useful to generate xml files from.

Parameters:
  • full_dict

    a dict of dicts with structure like:

    {
        'output_type_1':
        {
            'filename_1.ext':
            {
                'Bacillaceae': {'Anoxybacillus': {'subsum': 15}}
            },
            'filename_2.ext':
            {
                'Listeriaceae':
                {
                    'Listeria':
                    {
                        'Lgrayi': {'subsum': 22},
                        'Linnocua': {'subsum': 11}
                    }
                }
            }
        },
        'output_type_2':
        {
            'filename_2.ext':
            {
                'Bacillaceae': {'Anoxybacillus': {'subsum': 15}}
            }
        }
    }
    

    “output type” typically will be ‘ITS’, ‘16S’.

  • tax_dict
    A dict, where:
    • keys are file identifiers,
    • values are dicts, where:
      • keys are names of taxa,
      • values are numbers of occurrences of particular taxon.

    Example:

    {
        'filename_1.ext':
        {
            'Bacillaceae': 5, 'Anoxybacillus': 5
        },
        'filename_2.ext':
        {
            'Bacillaceae': 1, 'Anoxybacillus': 1,
            'Listeriaceae': 19, 'Listeria': 19,
            'Lgrayi': 16, 'Linnocua': 3
        }
    }
    
Returns:

(xml_names, xml_dict, tax_tree, name_total_count)

xml_names:

readable identifiers of groups of files referring to common dataset; derived from filenames.

Example:

['filename_1', 'filename_2']
xml_dict:
a dict, where:
  • keys are names of taxa,
  • values are lists with counters of occurences of particular taxon in subsequent groups of files. Order on this list is defined by order of names in xml_names list.

Example:

{
    'Anoxybacillus': [5, 1],
    'Bacillaceae': [5, 1],
    'Lgrayi': [0, 16],
    'Linnocua': [0, 3],
    'Listeria': [0, 19],
    'Listeriaceae': [0, 19]
}
tax_tree:

a dict without information about type and file.

Example:

{
    'Bacillaceae':
    {
        'Anoxybacillus': {}
    },
    'Listeriaceae':
    {
        'Listeria':
        {
            'Lgrayi': {}, 'Linnocua': {}
        }
    }
}
name_total_count:

a dict, where keys are names as present in xml_names and values are summed total counts (usually of occurrences of different species).

Example:

{'filename_1': 5, 'filename_2': 20}

Return type:

A tuple

xml_name_parse(full_dict)[source]

Creates a list of simplified filenames (full filenames comes from full_dict). The list is guaranteed to not have any duplicates.

If filename contains sequence of nucleotides from set {A, C, T, G}, separated from other parts of the name by underscore (_), then simplified filename will be substring of filename from beginning to this sequence (inclusive). This set of nucleotides in most cases will represent “index” from naming schema for fastq files, so since this step, many files will belong to one alias.

Otherwise, the first chunk of filename will be used. In this case, filename will be splitted with dot ‘.’ as delimiter. It also indicates grouping files with similar filenames since this step.

Keep in mind, that this step groups files referring to the same dataset under single name, and technically is vulnerable for errors: If filenames given to this functions do not follow appropriate naming schema or when these filenames are not named in prefix-code convention, then some false-positives might be generated in some of functions which use results of xml_name_parse().

Example

Following filenames:

'16S_ArchV3V4_M_BF_02_TAGCTT_L001_001.amplicons.cutadapt.flash.merged.fastq.extendedFrags.fasta.usearch_16S'
'16S_ArchV3V4_M_BF_02_TAGCTT_L001_001.cutadapt.amplicons.cutadapt.flash.merged.fastq.extendedFrags.fasta.usearch_16S'
'Amp45_BFp_B_CAAAAG_L001_R12.fasta.usearch_ITS'
'meta-velvetg.contigs.fa.usearch_ITS'

will become:

'16S_ArchV3V4_M_BF_02_TAGCTT' # note: this groups first two files
'Amp45_BFp_B_CAAAAG'
'meta-velvetg'
Parameters:full_dict

a dict of dicts, with structure like:

{
    'output_type_1':
    {
        'filename_1.ext':
        {
            'Bacillaceae': {'Anoxybacillus': {'subsum': 15}}
        }
    }
}

Longer example avaliable in description of xml_format().

Returns:A list with simplified filenames. The list is guaranteed to not have any duplicates.

Example:

['filename_1', 'filename_2']
xml_names_graphlan(input_d)[source]

Extracts values from given dict, puts the values on a list, and then removes from every value those parts that are prefixes and suffixes common for all items on the list.

Parameters:input_d

A dict where: keys are names of directories, values are lists with filenames, which are in the directory.

Example

{‘dir_name_1’: [‘file_1’,’file_2’], ‘dir_name_2’: [‘file_1’]}

Explicit example in description of output of cat_read().

Returns:A list with all filenames which were given on input, trimmed by common prefixes and suffixes.
xml_prepare(xml_names, xml_dict, tax_tree, name_total_count, unit='reads')[source]

Generates XML string, dedicated to use with Krona. XML is generated with use of default python xml module.

Parameters:
  • xml_names

    a list of readable identifiers of datasets.

    Example:

    ['filename_1', 'filename_2']
    
  • xml_dict

    a dict with lists of occurrences of nodes by node. Keys are names of nodes. Values are lists of constant length, where every item on position X means: number of occurrences of node in dataset on position X on list xml_names.

    Note, that for all i: len(xml_dict[i]) is equal to len(xml_names).

    Example:

    {node_1: [count_1, count_2, ... count_x],
    node_2: [count_1, count_2, ... count_x]}
    
  • tax_tree

    a dict of dicts, etc. In form of nested dicts, represents phylogenetic tree. Example:

    {a:{aa:{}, ab:{aba:{}, abb:{abba:{}}}}}
    
  • name_total_count

    a dict with numbers of occurrences of nodes by dataset. Keys are names of dataset from xml_names, values are counts. Example:

    {identifier_1: count_1, identifier_2: count_2}
    
  • unit – A string. Defines units to be used in the Krona.
Returns:

A single string with generated XML.

HARDCODED:
Number of levels to iterate in tax_tree is hardcoded to 9. As original comment states: ‘however it ignores all absent levels! which is actually kind of cool’
xml_vals(xml_names, tax_dict)[source]

Reformats information extracted from tax_dict into another dict, to allow use of this data in xml-generation process.

Parameters:
  • xml_names

    readable identifiers of groups of files referring to common dataset, derived from filenames.

    Example:

    ['filename_1', 'filename_2']
    
  • tax_dict
    A dict, where:
    • keys are file identifiers,
    • values are dicts, where:
      • keys are names of taxa,
      • values are numbers of occurrences of particular taxon.

    Example:

    {
        'filename_1.ext':
        {
            'Bacillaceae': 5, 'Anoxybacillus': 5
        },
        'filename_2.ext':
        {
            'Bacillaceae': 1, 'Anoxybacillus': 1
            'Listeriaceae': 19, 'Listeria': 19, 'Lgrayi': 16, 'Linnocua': 3
        }
    }
    
Returns: dict:
A dict, where:
  • keys are names of taxa,
  • values are lists with counters of occurrences of particular taxon in subsequent groups of files. Order on this list is defined by order of names in xml_names list.

Example:

{
    'Anoxybacillus': [5, 1],
    'Bacillaceae': [5, 1],
    'Lgrayi': [0, 16],
    'Linnocua': [0, 3],
    'Listeria': [0, 19],
    'Listeriaceae': [0, 19]
}

Warning

Keep in mind, that this step - along with generation of xml_names in xml_name_parse() - groups files referring to the same dataset under a single name, and technically is vulnerable for some errors: If filenames given to these functions do not follow appropriate naming schema or when these filenames are not named in prefix-code convention, then some false-positives might be generated when grouping results and this will influence results.