# ========================================================= # # GUMPP: General Unified Microbiome Profiling Pipeline # # ========================================================= # # Config Template for # GUMPP version 1p, released on 2023-Mar-05 # # Last modification of this document: 2023-Mar-15 # # GUMPP's web page: # http://gumpp.fe.uni-lj.si # # GUMPP users' manual: # http://gumpp.fe.uni-lj.si/gumpp_manual.pdf # # GUMPP developers: # Blaz Stres, blaz.stres@fgg.uni-lj.si # Bostjan Murovec, bostjan.murovec@fe.uni-lj.si # # License: # -------- # Creative Commons Attribution CC BY license # https://creativecommons.org/licenses # # # # ------------------------------------------------------------- # If you use GUMPP or its derivatives, please cite: # # Bostjan Murovec, Leon Deutsch, Blaz Stres, # General Unified Microbiome Profiling Pipeline (GUMPP) # for Large Scale, Streamlined and Reproducible Analysis of # Bacterial 16S rRNA Data to Predicted Microbial Metagenomes, # Enzymatic Reactions and Metabolic Pathways. # Metabolites 2021, 11, 336. # https://doi.org/10.3390/metabo11060336 # # Please also cite software that GUMPP contains. # Please see chapter Credits in GUMPP Users' Manual. # ------------------------------------------------------------- # # # # IMPORTANT: GUMPP is developed and disseminated in a good # faith and desire to work according to expectations, but # authors DO NOT give any guarantees about it correctness. # # USE IT AT YOUR OWN RISK. # # Authors cannot be held legally or morally responsible for any # consequences that may arise from using or misusing GUMPP. # # IMPORTANT: GUMPP is a skeleton application for a synergic # execution of several externally developed pieces of software. # These are disseminated as integrated parts of GUMPP to # provide user-friendly out-of-the-box experience. # Nonetheless, every included piece of software remains OWNED # and COPYRIGHTED by its respective developers. # Please see chapter Credits in GUMPP Users' Manual. ##################################################### # # General instructions # ##################################################### # This config file describes available parameters, # and is intended to serve as a template for making # actual GUMPP configuration files. All parameters # are accompanied with their explanations, to make # this document fairly self contained. # # Nonetheless, this file is not a full substitute for # the GUMPP users' manual, which presents information # in a didactically more suitable way: # http://gumpp.fe.uni-lj.si/gumpp_manual.pdf # # This configuration template is available at: # http://gumpp.fe.uni-lj.si/config_template.txt # ---------------------------------------------------------- # Usage: # ---------------------------------------------------------- # # Fill in parameters below according to your preferences. # # This document is fairly long because it contains complete # instructions about setting GUMPP parameters. For production, # a user will typically remove the majority of comments out of # this file to make it easier to navigate. # # Many parameters do not need to be set. In many cases default # values suffice. Please do not be overwhelmed by the number of # available settings below. As the first step, it suffices to # specify directory with input files. # # -------------------------------------------------------------- # # To perform the actual analysis, a few MANDATORY parameters # need to be set. Their choice cannot be automated, since # the settings require an expertise of an operator. # Nonetheless, GUMPP can help. If at least one of the # mandatory parameters is not set, the workflow enters # a special PROBE mode of operation, where your input sequences # are probed to obtain suggestions(!!!) for the values of the # mandatory parameters. # # Hence, GUMPP is typically used in the following manner. # First, the initial probing run is done on input sequences. # This step requires only input directory to be # specified by the below introduced parameter in_dir. # # After the probing run is finished, GUMPP provides an # on-screen report (which is also written to a screen dump # file in the output directory) together with some # suggestions for setting the mandatory parameters. # # Second, on the basis of this report, proper values # of mandatory parameters are set by the operator # based on his or her judgement. Then GUMPP is re-run # to execute the intended analysis and deliver the # expected results. # # The probing in the first run takes some time. # However, the steps taken are the same for the actual analysis. # Hence, the steps that are needed for probing need not to be # repeated during the second workflow run, if the history # feature is enabled (as described further on and in the # users' manual). # # -------------------------------------------------------------- # # The following syntax applies to configuration files. # Lines that begin with a dash (#) are comments and they # are ignored by the workflow. Empty lines and lines that # contain only spaces and tabular (tab) characters are # ignored as well. # # Each parameter is specified on a single line, which consists # of a parameter's name, an equal sign (=), and a potential # parameter's value. Names of parameters are lowercase. Spaces # may be optionally inserted to the left and to the right of the # equal sign. Everything after the equal sign and any potential # spaces that follow it, constellates the value of the parameter. # The value may be empty, in which case the equal sign still # needs to be present. # # Parameters may be set in any order. Default settings apply # to the majority of undefined parameters, which eliminates # the need to define everything. Typically, only a small subset # of known parameters is specified in any config file. # This greatly simplifies configuration process. # -------------------------------------------- # The most straightforward way to LAUNCH GUMPP # -------------------------------------------- # # The First Step # .............. # # A directory needs to be created, and populated with an # arbitrary number of paired input fastq sequences # (R1 and R2 files). Alternatively, GUMPP may take as an # input one fasta file that is accompanied with a groups # file (both are expected to be created by Mothur's command # make.contigs from fastq reads (and possibly processed # further with Mothur). fasta files that are obtained by # other means are not supported by GUMPP. # # Fastq files need to have either extension .fastq or # .fq., which is a more tolerant requirement in comparison # to e.g. Mothur, which strictly requires extension .fastq. # Fasta files need to have either extension .fasta, .fa, or .fna. # # Furthermore, GUMPP is tolerant to variations in extension # capitalization, so extensions may also be e.g. .FASTQ or # .FQ. These variations may freely exist between input files; # GUMPP does not require that all files have the same # extension, since it normalizes file names before subjecting # these to the analysis (i.e. to Mothur). # # Input files may optionally be gzipped in which case # they are required to have extension .gz (like .fastq.gz), # again in an arbitrary capitalization (.FASTQ.GZ or # .fastq.GZ, etc. is permitted). # # Gzipped and unzipped files may be freely intermixed as well. # # Input directory is typically located within a user's home # directory, but it can also be some other directory that is # mapped to a Singularity's internal directory structure. On many # systems suitable directories are also /scratch and /data. The # availability depends on the actual Singularity installation. # Please consult a system's administrator, if you are not configuring # the running system by yourself. HPC systems typically deploy # some specific settings. Generally, it is necessary to follow # instructions that are provided by the HPC's personnel. # The Second Step # ............... # # A configuration file needs to be prepared according to this # very template file. Within it it is necessary to specify the # previously created and populated input directory, by setting # parameter in_dir, which is described further on. In short, # the line looks like one of the following possibilities # (without the leading character #): # #in_dir = /absolute/path/to/directory_with_reads #in_dir = /home/user/path/to/directory_with_reads #in_dir = ~/path/to/directory_with_reads (the same as above) #in_dir = ./relative_to_current_dir/directory_with_reads #in_dir = relative_to_current_dir/directory_with_reads (the same as above) # # for example: #in_dir = /home/john/samples_of_xy # GUMPP integrates a few taxonomy databases (various Silva versions # as well as GreenGene_13_8_99) which are needed for creation of a # biom file and the associated fasta file. # Database silva_v138 is selected by default. Different database may be # specified by the following parameter. #taxonomy_db = database name # # for example (silva_v138 is default, and needs not to be set) #taxonomy_db = silva_v138_1 #taxonomy_db = silva_seed_v138_1 #taxonomy_db = silva_v138 #taxonomy_db = silva_seed_v138 #taxonomy_db = silva_v132 #taxonomy_db = silva_seed_v132 #taxonomy_db = green_genes_13_8_99 # # The actual list of built-in databases is rendered on a screen # during GUMPP's initialization. # # NOTE: custom database may substitute the built in ones, # which is described below in this document as well as in # the GUMPP's users' manual. # # NOTE: Based on the users' suggestions, we would be happy to integrate # additional databases that may be interesting for a broad audience. # If you would like suggest or contribute a database, please contact # bostjan.murovec@fe.uni-lj.si. # GUMPP enables three levels of taxonomical analysis: ASV, OTU and genus (GEN). # There exist three configuration parameters for selecting the type of analysis. # #analysis_asv=yes #analysis_otu=yes #analysis_gen=yes # Several choices may be enabled simultaneously to execute the associated # analyses in parallel. # # Analysis GEN is selected by default, if no analysis is enabled explicitly. # When several of the above parameters are set, GUMPP tries to execute # all enabled analyses in parallel, which boosts performance. # # For each analysis type GUMPP provides an appropriate Mothur's script # for processing reads, as well as for constellatng a biom file and # its associated fasta file. These two files are then fed into PicRust2. # Inputs for Piphillin server are also prepared from this data. Relevant # Mothur's shared files that may be interesting for further downstream # analysis are also delivered to the end user. # # Note: initial Mothur steps for all three analyses are the same. # GUMPP takes care to execute these only for one of the enabled analyses, # and recycles resuls for the others. This is the reason that initially # on-screen it does not appear that several enabled analyses are executed # in parallel. Only when analyzing steps begin to differ between the analyses, # their parallel execution takes place. # In order to use the provided Mothur scripts, it is necessary to specify # some template parameters that are embedded into them, as described below. # ------------------------------------------------------------------------------------------- # The parameters that are required to be supplied to carry on the entire analysis are: #msp_screen_seq_start = start of a sequence in an alignment to a reference database #msp_screen_seq_end = end of a sequence in an alignment to a reference database # for example (do NOT use these numbers blindly, please provide values that apply to your data): #msp_screen_seq_start = 6388 #msp_screen_seq_end = 25316 # If one or both of these parameters is not set, then GUMPP automatically enters the probing # mode of operation to deliver a report, which aids in determining their proper values. # ------------------------------------------------------------------------------------------- # Although not strictly necessary, you SHOULD also set the following parameters: #msp_screen_seq_min_length = minimal valid sequence length (for filtering out unsuitable ones) #msp_screen_seq_max_length = maximal valid sequence length (for filtering out unsuitable ones) # # for example (do NOT use these numbers blindly, please provide values that apply to your data): #msp_screen_seq_max_length = 430 #msp_screen_seq_max_length = 465 # # the default values are 0 and 100,000, respectively, so that sequences are not # filter by default based on their length. But sequences of infeasible lengths # worsen quality of an input dataset. # # The previously mentioned report that is generated during the probing phase, also # suggests values for setting these two parameters. # Depending on the use case, the following parameter may also be of a value for an analysis. #msp_sub_sample_size = equal number of sequences per sample when subsampling # a large group of samples # # # for example (do NOT use this number blindly, please provide a value that applies to your data): #msp_sub_sample_size = 3000 # NOTE: If subsampling is not required, it can be disabled by not setting this parameter. # This subsampling is based on a Mothur's shared file, and is completly different # from the initial subsampling described below. # PLEASE NOTE. The above values that are given above as examples are strongly # specific to a certain use case. They are by no means transferable to # generic situations. Unless reasonable figures are supplied as applied # to YOUR particular case, the results of analysis will be meaningless, or # very likely the workflow will terminate with an error, since all input # sequences will be recognized as inappropriate by the script. # ----------------------------------------- # Dealing with large datasets # ----------------------------------------- # # Sometimes input dataset is to large to be processed on an # actual hardware, or it exceeds limitations of some software # piece that is integrated into GUMPP. In order to deal with # situations like this there exists a possibility to reduce the # set of input sequences with Mothur's command sub.sample. # # As a rule of thumb, with OTU analyses the limiting factor is the # available amount of RAM (for Mothur's command cluster.split, which # appears to freeze, and saturates the system with disk swapping). # The usual bottleneck with ASV analysis is inability of PicRust2 # to process the vast amount of generated OTUs. Generally, GENus # analysis is able to process much larger datasets than the other # two analyses, but certain limitations apply to it as well. # # When the capabilities of the workflow, its ingredients or # underlying hardware are exceeded, then GUMPP application either # terminates with an error or appears to be frozen due to an # inefficient processing. # # If any of this happens, reducing of an input set may help. # Reduction with Mothur's sub.sample is added to the beginning of # pre-processing; precisely, to the point where fasta format of # input sequences is available (after making contigs from input # fastq files, or directly from a fasta input). # # Initial subsampling (in contrast to above mentioned subsampling # that happens only after certain preprocessing steps and # screening of sequences) may be activated by config parameter: # msp_initial_sub_sample_size = ...value for parameter size of # Mothur's command sub.sample # # It is also possible to specify: # msp_initial_sub_sample_per_sample = ...parameter persample # of the Mothur's sub.sample command; # default: false # # msp_initial_sub_sample_with_replacement = ...parameter # withreplacement of the Mothur's sub.sample command; # default: false # # Please, see Mothur's documentation for details: # https://mothur.org/wiki/sub.sample/ # # If feasible, it is recommendted to use the previously # mentioned subsampling that is activated by parameter # msp_sub_sample_size, since this subsampling operates on # a suitably prepared subset of sequences and gives more # stable or semantically meaningful results. # Initial subsampling with parameter # msp_initial_sub_sample_per_sample should be used only # as a last resort, if input set is so large that even # preprocessing steps fail, by means of which the more # appropriate subsampling cannot be utilized as well. #msp_initial_sub_sample_size = 100000 #msp_initial_sub_sample_size = 1000000 #msp_initial_sub_sample_size = 10000000 #msp_initial_sub_sample_per_sample = T #msp_initial_sub_sample_per_sample = F #msp_initial_sub_sample_with_replacement = T #msp_initial_sub_sample_with_replacement = F # Note 1: config parameter msp_initial_sub_sample_size must be # defined for the sub.sample command to be executed. The last two # abovely descrbed parameters are ignored otherwise. # # Note 2: usually, it requires some trial & error experimenting with # parameter msp_initial_sub_sample_size. Too small value will result # in too much information in the input dataset to be discarded. # As a result, the entire workflow will terminate with an error at # a certain point of processing. Too large value will result in # too little reduction of an input dataset, which will still # saturate the hardware or exceed other limitations of the # software components, so the initial problem will still persist. # # One possbility is to start with some rather small value, like # 10000, and then exponentially increase it with the factor of 10, # like 10000, 100000, 1000000 (maybe other finer grained values # in-between), until the workflow manages to complete the analysis. # Then, more gradually increase the value to probe for limitations. # The larger the number at which the workflow completes its processing, # the less information gets lost with sub sampling. # # Note 3: Too large value of msp_initial_sub_sample_size for a # given input dataset may also result in a error of the sub sampling # itself. For smaller datasets, the above suggested values like # 10000 may be too large. However, in these cases there is probably # no need for sub sampling, since the input dataset is small # enough to be processed without its reduction. # ----------------------------------------- # Other parameters # ----------------------------------------- # There is another parameter that you may want to set from time to time. # It is injected into the PicRust2's command line. #params_picrust2 = --min_align xxx # (copied verbatim from PicRust2 help) # Proportion of the total length of an input query # sequence that must align with reference sequences. Any # sequences with lengths below this value after making # an alignment with reference sequences will be excluded # from the placement and all subsequent steps. # # According to the PicRust2 help, the default value is 0, but it looks # like the value 0.8 applies by default. If input dataset is not of a # fairly good quality, no hits result, if the value of the parameter # is not lowered explicitly. However, lowering this value enables more # noise to sneak in results to trigger false conclusions. The use of # this parameter requires great care and conciseness. Generally, # if this parameter needs to be lowered, then input data or the above # mandatory settings are at least suspicious. # # for example (warning, lower values than 0.8 increase possibility of false conclusion): #params_picrust2 = --min_align 0.7 # # NOTE: the value of parameter param_picrust2 is injected verbatim into the # PicRust2's command line. With it you may set any PicRust2 parameter that # you want, except the ones that consider file names and other settings # that GUMPP workflow provides by itself (like number of processors, ...). # GUMPP enables fasta representatives to be fed into PicRust2 in four different # forms. The first one does not alter fasta sequences that are produced by a # Mothur script. Other possibilities are that sequences are reversed, complemented # or reverse-complemented prior to be subjected to PicRust2. # This may come handy, if it is not possible to assure an orientation that # sequencing process produces. # All four possibilities may be executed within a single workflow run for # easier comparison of results in the case of a doubt or uncertainty regarding # reads orientation. However, this requires four PicRust2 runs (only PicRust2, # not the entire workflow) and takes adequately more time and computing # resources to complete. # # The following parameters determine the possibilities to be executed. # for example (orientation_original is default and needs not to be set) #orientation_original = Yes #orientation_reverse = Yes #orientation_complement = Yes #orientation_reverse_complement = Yes # The Third Step # .............. # # Execute GUMPP by invoking its Singularity image. Config file # may be specified in one of the following ways. # # singularity run path_to/gumpp_1p.sif /abs/path_to/config_file.txt # singularity run path_to/gumpp_1p.sif /home/user/path_to/config_file.txt # singularity run path_to/gumpp_1p.sif ~/path_to/config_file.txt (the same as above) # singularity run path_to/gumpp_1p.sif ./relative_to_current_dir/path/config_file.txt # singularity run path_to/gumpp_1p.sif relative_to_current_dir/path/config_file.txt (the same as above) # ----------------------------------------- # Changing bind paths for proper symbolic # linking of files in output directory # ----------------------------------------- # # By default, GUMPP may only access files that are located in # a users' home directory (like /home/john_doe/some_data/...). # # Sometimes input or output files are located on an external disk # that needs to be bind into the Singulariy's internal directory # structure. Let us suppose that GUMPP's input files are located # on a disk that is mounted on /mnt/large_disk # # In order for GUMPP to be able to access files on the disk, # the disks mount point needs to be bound into a Singularity's # file system. Singularity provides a bind directive (-B) for # that matter. For example: # # singularity run -B /mnt/large_disk:/data /path_to/gumpp_1p.sif abs_or_relative/path_to/config_file.txt # # Bind directive consists of a specification of a physical mount # point (/mnt/large_disk in this case) and a path where the very # disk is visible internally by Singularity (/data in this case). # # Note that a bind point needs to already exist within a container. # GUMPP provides two paths /data and /scratch for that matter. # When GUMPP is invoked in the above manner, it sees files that are # physicallly located under /mnt/large_disk/sub_dirs/... as /data/sub_dirs/... # Consequently in_dir and out_dir config parameters need to be altered # accordingly, to point to the re-binded locations. GUMPP does not know # anything about physical file paths. In accordance with the above binding, # directive in_dir may look something like this: # #in_dir = /data/some_input_directory # Furthermore, since GUMPP knows only internal file paths, but # a user inspects resulting files outside of a Singularity # environment, symbolic links that are created within the process # of GUMPP's run, need to reflect the external directory structure. # In order to create proper symbolic links, GUMPP needs a hit about # the applid bind paths, which may be specified by a config # directive bind_paths. For the example above, the the config # directive would look something like this: #bind_paths = /mnt/large_disk:/data # Several bind paths separated by spaces may be specified on the same line: #bind_paths = /mnt/large_disk:/data /media/usb_key3:/scratch # This config directive is needed only when Singularity's bind path # feature is used. If all input and output files are located within # the user's home directory, then their paths are trivially properly # mapped within the GUMPP's Singularity container. # ----------------------------------------- # Custom taxonomy database # ----------------------------------------- # # GUMPP's built in taxonomy databases may be replaced by custom ones. # The replacement is placed in a subdirectory resources within a # directory with input reads. For example, if the previously # described parameter in_dir is set as follows: #in_dir = /home/john/test_sequences # # then database replacement is placed within directory # /home/john/test_sequences/resources # # In order to replace the built in taxonomy databases, # the subdirectory resources needs to contain exactly # one file with extension .align (fasta patterns) # and one file with extension .tax (taxonomy classification). # Their formats need to be compatible with formats # of the Mothur's taxonomy databases. # Please, see: https://mothur.org/wiki/taxonomy_outline # # If none of these two files exist, then the built in # database applies. If only one of the two files exist # but not the other, then the workflow aborts with an error, # in order to prevent accidental mix-up of built in and # user's supplied resources.. # # Note that as soon as the appropriate files exist within # subdirectory resources, then these apply regardless # of the value of config parameter taxonomy_db. ##################################################### # # Customization of a general workflow behavior # ##################################################### # OPTIONAL: out_dir (default: the same as in_dir) # # Specification of an output directory, where the workflow # places generated files. Everything that workflow generates # is placed into subfolders of out_dir, so the output # location may be the same as directory with input files. # The same out_dir location may also be shared with # several GUMPP runs. # # On the other hand, there are several reasons to # select out_dir on a different location. # In may be that disk/partition with input files does # not have enough space to hold intermediate and # resulting files. Also, disk with in_dir may be a # slow one (e.g. USB key), for which it is prudent # to only read input files from. # #out_dir = /absolute/convenient_output_location #out_dir = /home/some_user/convenient_output_location #out_dir = ~/convenient_output_location (the same as above) #out_dir = ./relative_to_current_dir/convenient_output_location #out_dir = relative_to_current_dir/convenient_output_location (the same as above) # OPTIONAL: verbose (default: No) # # Set to Yes for a more detailed on-screen description # of the ongoing progress and performance tuning. # These are primarily useful for workflow debugging. # # NOTE: if you report an issue to us, please set # this option to yes, and send us the # resulting report directory. # # Enabling or disabling of verbose mode may be # overridden by command-line options # (+verbose, -verbose). # Please, consult users' manual for details. # for example (No is default and needs not to be set) #verbose=Yes #verbose=No # OPTIONAL: number_of_threads # (default: as many as there are CPUs) # # Number of threads to use for parallel execution. # # If the parameter is not specified, the number of # available processors is determined by querying # the underlying operating system. # # If your intention is not to consume all available # resources (e.g. because the same hardware executes # some other calculations in parallel), then this # parameter may be set to a LOWER value than the number # of available CPUs. Setting this number to a larger # value than the numbers of CPUs, DECREASES computational # speed (but has no other adverse consequences). # # Another reason for lowering this number below the # actual CPU count is to lower memory consumption or # disk utilization. Some programs spawn too many disk # intensive threads when there are enough available # CPUS which leads to a saturation of a disk subsystem, # which noticeably lowers an overall performance and # makes the system unresponsive. If an experience shows # that the workflow (in fact, some of its external # programs) consumes too much memory or becomes too disk # intensive, sometimes (but not always) the issue may be # alleviated by reducing the number of threads that # execute in parallel. The next parameter is built into # the workflow exactly for this purpose. #number_of_threads = 4 # OPTIONAL: force_picrust2_single_thread # # When this parameter is set to yes, it forces PicRust2 # to run as a single-threaded application. # # This is a workaround for certain PicRust2 crashes # that were reported by some users. # # Some (but NOT all) of these cases could be resolved # by forcing PicRust2 to run as a single-threaded # application. Hence, this option is supposed to be # used only, if PicRust2 crashes are experienced. #force_picrust2_single_thread = yes ##################################################### # # History feature; please see chapter # # Time machine and more # # in the GUMPP users' manual # ##################################################### # OPTIONAL: enable or disable history feature. # Default is enable. #preserve_history = yes #preserve_history = no # OPTIONAL: re-execute past failed results. # Default is Yes. # This option is ignored, if history # feature is disabled. #rerun_failed_steps = Yes #rerun_failed_steps = No # OPTIONAL: deliver symbolic links to the actual # resulting files in the repository # of past results (default), or make # a full copy of each resulting file # in the output directory. # This option is ignored, if history # feature is disabled. #symlink_results = Yes #symlink_results = No