====================================================================== Summary of changes between GUMPP version 1p (released on 5. 3. 2023) and GUMPP version 1n (released on 24. 12. 2021) ====================================================================== 1. Mothur upgraded from version 1.44.1 to 1.48.0. 2. Processing scripts adapted for Mothur changes. 3. PicRust2 upgraded from version 2.4.1 to 2.5.1. 4. Accelerated file hashing. 5. Parameter rerun_failed_steps now Yes by default. 6. Container built with Singularity 3.10.5 instead of 2.6.1. ====================================================================== Summary of changes between GUMPP version 1n (released on 24. 12. 2021) and GUMPP version 1m (released on 2. 11. 2021) ====================================================================== 1. So far GUMPP accepted only absolute paths for specification of its config file, as well as for config directives in_dir and out_dir. Now, a possibilty has been aded to specify relative paths in all three cases. Internally, GUMPP still uses only absolute paths, and for that matter it converts (expands) paths in the above mentioned cases into absolute ones as soon as possible, if they are relative. Specifically, in a path that BEGINS with ~ (like ~/some_dir_or_file), the leading ~ gets expanded into a user's home directory (like /home/john to form absolute path /home/john/some_dir_or_file). A path or file name that does not begin with /, gets prepended with a directory from which GUMPP was launched. This can be made explicit by prepending ./ to a relative path. The feature is conveient for running GUMPP on HPC facilites, where is it not possible to know absolute paths of simg and input files in advance. For example, some_dir_or_file may get expanded to /tmp/unique_job_ID/some_dir_or_file Alternatively, the same functioning may be specified by ./some_dir_or_file => /tmp/unique_job_ID/some_dir_or_file Hence, GUMPP may now be launched as: singularity run gumpp_1n.simg ~/relative_path_to_config_file.txt singularity run gumpp_1n.simg ./relative_path_to_config_file.txt singularity run gumpp_1n.simg relative_path_to_config_file.txt In the first case, a path to a config file that is relative to a user's home directory is provided. In the last two cases, which are quivalent, a path to a config file that is relative to the current work directory of an invoking shell is given. The same logic apply to config directives in_dir and out_dir. All paths that are used by GUMPP internally are displayed on screen as absolute ones to aid in diagnosing wrong configurations. ====================================================================== Summary of changes between GUMPP version 1m (released on 2. 11. 2021) and GUMPP version 1k (released on 18. 9. 2021) ====================================================================== 1. All three types of analysis (ASV, OTU and GEN) may be run at once. GUMPP tries to run their independent steps in parallel, which noticeably boosts the execution. Now, all three config parameters: analysis_asv = yes analysis_otu = yes analysis_gen = yes may be set simulataneously. Of course, it is stil possible to execute only one (or two) type of analysis, if that is all that is needed. Note: initial Mothur steps for all three analyses are the same. GUMPP takes care to execute these only for one of the enabled analyses, and recycles resuls for the others. This is the reason that initially on-screen it does not appear that several enabled analyses are executed in parallel. Only when analyzing steps begin to differ between the analyses, their parallel execution takes place. ===================================================================== Summary of changes between GUMPP version 1k (released on 18. 9. 2021) and GUMPP version 1j (released on 19. 8. 2021) ===================================================================== 1. A much needed feature: GUMPP now assists in determining mandatory parameters of analysis. Hence, there is no need to make a separate setup of e.g. Mothur and taxonomy databases to deduce the parameters. Now, GUMPP is all that an end-user needs for performing analyses. 2. Due to the above feature, there now exists quick-run instructions, according to which even a new GUMPP user may start analysing her or his samples quckly. 3. Integration of additional taxonomy databases. The list is as follows: Silva_v138.1 Silva_seed_v138.1 Silva_v138 Silva_seed_v138 Silva_v132 Silva_seed_v132 Green_Genes_13_8_99 4. Additional results of analysis are provided with integrated Mothur summary.single command, which executes automatically. 5. Improved descriptions of disk IO errors. 6. Minor bugfixes and improvements of GUMPP and it documentation. ===================================================================== Summary of changes between GUMPP version 1j (released on 19. 8. 2021) and GUMPP version 1i (released on 12. 8. 2021) ===================================================================== 1. bugfixes of preprocessing Mothur script: a. Command classify.out has been moved immediately after command make.shared (i.e. before filter.shared) for all three types of analyses. b. Removed unnecessary commands list.otulabels. ===================================================================== Summary of changes between GUMPP version 1i (released on 12. 8. 2021) and GUMPP version 1h (released on 10. 8. 2021) ===================================================================== 1. bugfix: automatic collection of selected Mothurs' results did not work properly with external bindings. ===================================================================== Summary of changes between GUMPP version 1h (released on 10. 8. 2021) and GUMPP version 1g (released on 9. 8. 2021) ===================================================================== 1. new feature: automatic collection of selected Mothurs' results. This is documented in config template under section "Automatic collection of selected Mothur results". ===================================================================== Summary of changes between GUMPP version 1g (released on 9. 8. 2021) and GUMPP version 1f (released on 8. 8. 2021) ===================================================================== 1. bugfix: GUMPP did not print on screen (and consequently into screen dump) template substitutions from config file. Hence, documentation of an analysis was not complete. ===================================================================== Summary of changes between GUMPP version 1f (released on 8. 8. 2021) and GUMPP version 1e (released on 7. 8. 2021) ===================================================================== 1. Mothur 1.45.2 downgraded to 1.44.1 due to frequent crashes (e.g. corrupted double-linked list). ===================================================================== Summary of changes between GUMPP version 1e (released on 7. 8. 2021) and GUMPP version 1d (released on 14. 7. 2021) ===================================================================== 1. parameters rarepercent (config file: msp_filter_shared_rare_percent) and keepties (config file: msp_filter_shared_keep_ties) are added to command filter.shared within the generic Mothur script. 2. Added possibility to reduce the set of input sequences with Mothur's command sub.sample. This is vital to be able to process input datasets that are too large to be processed in their entirety on a given hardware, or that are beyond the PicRust2's capabilities. As a rule of thumb, with OTU analyses the limiting factor is the available amount of RAM (for Mothur command cluster.split, which appears to freeze, and saturates the system with disk swapping). The usual bottleneck with ASV analysis is inability of PicRust2 to process the vast amount of generated OTUs. Generally, GENus analysis is able to process much larger datasets than the other two analyses, but certain limitations apply to it as well. When the capabilities of the workflow, its ingredients or underlying hardware are exceeded, then GUMPP application either terminates with error or appears to be frozen due to an inefficient processing. If any of this happens, reducing of an input set may help. Reduction with Mothur's sub.sample is added to the beginning of the generic Mothur script with fasta input, or after making contigs from input fastq files. Subsampling may be activated by config parameter: msp_initial_sub_sample_size = ...value for parameter size of Mothur's command sub.sample It is also possible to specify: msp_initial_sub_sample_per_sample = ...parameter persample of the Mothur's sub.sample command; default: false msp_initial_sub_sample_with_replacement = ...parameter withreplacement of the Mothur's sub.sample command; default: false Please, see Mothur's documentation for details: https://mothur.org/wiki/sub.sample/ Note 1: config parameter msp_initial_sub_sample_size must be defined for the sub.sample command to be executed. The last two abovely descrbed parameters are ignored otherwise. Note 2: usually, it requires some trial&error experimenting with parameter msp_initial_sub_sample_size. Too small value will result in too much information in the input dataset to be discarded. As a result, the entire workflow will terminate with error at a certain point of processing. Too large value will result in too little reduction of an input dataset, which will still saturate the hardware or exceed other limitations of the software components, so the initial problem will still persist. One possbility is to start with some rather small value, like 10000, and then exponentially increase it with the factor of 10, like 10000, 100000, 1000000 (maybe other finer grained values in-between), until the workflow manages to complete the analysis. Then, more gradually increase the value to probe for limitations. The larger the number at which the workflow completes its processing, the less information gets lost with sub sampling. Note 3: Too large value of msp_initial_sub_sample_size for a given input dataset may also result in a error of the sub sampling itself. For smaller datasets, the above suggested values like 10000 may be too large. However, in these cases there is probably no need for sub sampling, since the input dataset is small enough to be processed without its reduction. ===================================================================== Summary of changes between GUMPP version 1d (released on 14. 7. 2021) and GUMPP version 1c (released on 12. 7. 2021) ===================================================================== 1. BUGFIX of a CRITICAL ERROR: sometimes GUMPP fed wrong biom file to PicRust2. Consequently, this step crashed. 2. Added config parameter bind_paths for informing GUMPP about externally binded disks through Singularity directive -B (e.g. singularity run -b /physical_path:/internal_path). Without this directive, symbolic links in the output directory pointed to the wrong location, whereas symlinked files could be properly accessed only within the GUMPP's Singularity container. 3. The number of output files is reduced. GUMPP 1c and before made a copy or symbolic link of input and output files of each processing step. This cluttered output directory unnecessarily. It also consumed way to much disk when files were physically copied into output directory instead of being symbolically linked there, since the same file was often an ouput of some step and at the same time an input of anoter step, by means of which it was copied several times into the output directory structure. ===================================================================== Summary of changes between GUMPP version 1c (released on 12. 7. 2021) and GUMPP version 1b (released on 10. 7. 2021) ===================================================================== This is a cosmetic release. Processing is not altered in any way. A noticeable inconvenience of GUMPP version 1B (and earlier) is that during processing of large datasets the workflow appears to freeze during hashing or copying large files to the output directory. The first three improvements from the list below aim to relieve an operator from guessing whether the workflow is stalled. 1. Hashing of a file shows file size on screen, which gives an end user some clue about the duration of the hashing process. 2. Copying file to output directory shows file name and size on screeen (previously this operation was not indicated on a screen at all). Large file sizes give a clue that copying cannot happen immediately. 3. Each on-screen message is instantly displayed. Previously, the system sometimes cached messages in memory to display them at a later time, which is rather inconvenient for an end user, especially when the workflow appears to stall. The changes below aim to give incremental improvements to certain aspects of GUMPP's use. 4. Hashing of a file does not show a non-informative top directory any more to make a screen less cluttered. 5. When results are symlinked, output directory name changes from history_results to history_symlinks to give an operator a better clue that symlinks need to be properly handled. 6. Symlinking of results is enabled by default when history is enabled, since otherwise disk consumption more than doubles, which is a noticealbe burden on disk use with large datasets. Also, copying of large files (several terabytes in total) noticeably slows down the workflow. 7. Corrected are some typos in the on screen messages and in the internal Mothur script. ===================================================================== Summary of changes between GUMPP version 1b (released on 10. 7. 2021) and GUMPP version 1a (released on 7. 6. 2021) ===================================================================== Please NOTE: descriptions that follow may appear fairly advanced. They are given primarily as information about the changes under the hood. The majority of users need not to get involved in delving into these features. However, each feature is fully accessible to any user. GUMPP operators who are interested in fine-tuning the GUMPP's workflow execution may explore the novelties below to tweak execution according to their preferences. Despite great effort and care in the past to make GUMPP execute as efficiently as possible, there is still plenty of room for improvement. Features of version 1b focus primarily on execution streamlining. 1. THE MOST IMPORTANT NOVELTY -------------------------- Now every command in a Mothur script is executed as a separate entity, and its results are separately deposited into a repository of results. This enables much more optimal workflow re-execution in the case of an interruption or due to changes of some parameters. For example, if subsample size is changed, upon the workflow restart script processing jumps immediately to the subsampling step and the steps after it. No previous steps need to be re-executed again. Similarly, if some filtering parameter changes, re-execution continues with the filter command, whereas everything before the first affected step is instantly retrieved from the repository of results. (This is possible only, when the history feature is enabled; history is already introduced in GUMPP 1.0 and is enabled by default). 2. Separate performance parameters can be specified for each Mothur step by inserting appropriate directives into a Mothur script. 3. Processing of Mothur's script template parameters is vastly improved. Each numerical template value may now be accompanied by a lower and an upper limit. This features was primarily introduced to limit number of processors for individual Mothur commands. Experience reveals that different Mothur commands utilize computing resources in a vastly different way. Some of them may work efficiently with large number of CPUs, whereas others completely saturate disk when running with too many allocated CPUs. There is simply no one-number-fits-all setting that would allow efficient execution of the entire workflow. The only possibility to improve the current state of affairs is to fine-tune performance (CPU, disk, ...) settings for each Mothur command separately. For that matter, GUMPP version 1b noticeably expands template capabilities of Mothur scripts. TODO: actual measurements have not been performed yet. So performance parameters are currently set to some generic values. Nonetheless, the infrastructure for fine-tuning is in place. If someone is willing to help us performing measurements on kinds of hardware that we do not have access to, we would be happy to include tuning results into future GUMPP releases (bostjan.murovec@fe.uni-lj.si). 4. Instantiation of template parameters is empowered. Each template definition may now consist of four fields, of which only the first one is mandatory: <<>> Default-value field already exists in the previous versions of GUMPP, but now they can itself be templatized, i.e. they need not to be a hard-wired constant in a script. Minimal and maximal values may be templatized as well. Furthermore, minimal and maximal fields may contain an arbitrary number of constants or templatized values, like: <<>> This way, several limiting factors may be specified. The least one of them then imposes the actual limit. Similarly, for a minimal value, except that the maximal value of the entire list determines a lower limit. 5. Template values may now be defined by a newly introduced <<<#let>>> directive. For example: <<<#let MAX_NUMBER_OF_CPU = 16>>> ...and used further on... make.contigs(...., processors=<<>> or make.contigs(...., processors=<<>> The last example imposes a hard limit of 64 CPUs aside from the one prescribed by the #let definition. This way, the #let value may be altered in the future, but regardless of its value, the number of processors for executing make.contigs never exceeds 64. In the following example, the physically present number of CPUs is a value that is automatically provided by a GUMPP, and turns out to be handy in specifying CPU limits: make.contigs(...., processors=<<>> In this example, the actually applied number of CPUs is the minimal value of the user's specified limit MAX_NUMBER_OF_CPU, the physically present number of CPUs and the hard-wired number 64. 6. Fragments of Mothur's scripts may now be conditionally excluded from processing (similarly to conditional compilations of e.g. C/C++ code). This makes it possible to combine several slightly altered scripts into one compact version. By relying on this and other above features, the entire diversity of previous GUMPP's scripts were recollected into just one generic script. Specifically, there is no need for separate versions of scripts for processing fastq and fasta files, since conditional exclusions take care to eliminate the non-relevant parts: -------------------------------------------------------------------- <<<#ifdef INPUT_MODE_FASTQ_PAIRED>>> # # the beginning of paired-fastq specific processing # make.file(inputdir=., type=fastq) make.contigs(file=stability.paired.files, processors=<<>>) summary.seqs(fasta=current, processors=<<>>) screen.seqs(fasta=current, group=current, ...) # # the end of paired-fastq specific processing # <<<#endif>>> <<<#ifdef INPUT_MODE_FASTA>>> # # the beginning of fasta + groups specific processing # summary.seqs(fasta=<<>>, processors=<<>>) screen.seqs(fasta=current, group=<<>>, ...) # # the end of fasta + groups specific processing # <<<#endif>>> -------------------------------------------------------------------- Template values INPUT_MODE_FASTQ_PAIRED, INPUT_MODE_FASTA, INPUT_FASTA_FILE, INPUT_GROUPS_FILE, etc. are automatically supplied by GUMMP, as appropriate for specific circumstances. Similarly, there is no need for separate scripts for each type of analysis: -------------------------------------------------------------------- <<<#ifdef ANALYSIS_TYPE_ASV>>> # # the part (or possibly several parts) that is (are) specific to ASV analysis # ... <<<#endif>>> <<<#ifdef ANALYSIS_TYPE_OTU>>> # # the part (or possibly several parts) that is (are) specific to OTU analysis # ... <<<#endif>>> <<<#ifdef ANALYSIS_TYPE_GEN>>> # # the part (or possibly several parts) that is (are) specific to GEN analysis # ... <<<#endif>>> -------------------------------------------------------------------- Again, template definitions ANALYSIS_TYPE_ASV, ANALYSIS_TYPE_OTU, and ANALYSIS_TYPE_GEN are automatically supplied by GUMMP, as appropriate for the analysis in question. Similar conclusions apply to subsampling and additional filtering for which previous versions of GUMPP supplied separate scripts. Hence, the number of required scripts was growing exponentially with provided variations in processing routes. The new template capabilities and the resulting scripts' consolidations make the situation manageable and much less error prone. Previously, a modification that needed to be introduces had to be carefully applied to each scripts' variation. Since now there is only one script, the modification needs to be applied only once. GUMPP still possesses a repository of scripts. Different Mothur workflows may still be provided, if they vastly differ from the generic one, by means of it is more optimal and sensible to prescribe them with separate scripts. 7. Specifications of conditional exclusions may consist of an arbitrary number of AND or OR conditions. These are specified by & or |, respectively, in between template definitions. <<<#ifdef ANALYSIS_TYPE_ASV | INPUT_MODE_FASTQ_PAIRED>>> # ...some processing commands <<<#endif>>> <<<#ifdef ANALYSIS_TYPE_ASV & INPUT_MODE_FASTQ_PAIRED>>> # ...some processing commands <<<#endif>>> In the first case, the associated fragment applies, if analysis type is ASV or input consists of fastq files, whereas in the second case both of these requirements need to be met for the associated script segment to be applied. An isolated #ifdef directive may contain only one type of these logical operators. By nesting of #ifdef directives, arbitrarily complex conditions may be imposed. 8. Script processing is now aware of line continuation. If a line is terminated with a slash character, it is joined with the next line AFTER conditional exclusions are applied. This feature enables intra-line conditional exclusions: -------------------------------------------------------------------- <<<#ifdef FILTER_SHARED_MIN_ABUND | FILTER_SHARED_MIN_PERCENT | FILTER_SHARED_MIN_TOTAL>>> # # the beginning of filter.shared fragment # filter.shared(shared=current, \ <<<#ifdef FILTER_SHARED_MIN_ABUND>>> minabund=<<>>, \ <<<#endif>>> <<<#ifdef FILTER_SHARED_MIN_PERCENT>>> minpercent=<<>>, \ <<<#endif>>> <<<#ifdef FILTER_SHARED_MIN_TOTAL>>> mintotal=<<>>, \ <<<#endif>>> <<<#ifdef FILTER_SHARED_MIN_NUM_SAMPLES>>> minnumsamples=<<>>, \ <<<#endif>>> <<<#ifdef FILTER_SHARED_MIN_PERCENT_SAMPLES>>> minpercentsamples=<<>>, \ <<<#endif>>> makerare=<<>>) # # the end of filter.shared fragment # <<<#endif>>> # the entire filter OR condition -------------------------------------------------------------------- The first line in the above fragment assures that Mothur command filter.shared is applied only, if at least one of templates FILTER_SHARED_MIN_ABUND, FILTER_SHARED_MIN_PERCENT or FILTER_SHARED_MIN_TOTAL is defined. Then further #ifdef directives include only the actually defined parameters into the actual command. For example, if only template FILTER_SHARED_MIN_TOTAL is defined, the resulting Mothur command is constellated as filter.shared(shared=current, mintotal=<<>>, makerare=<<>>) where, of course, fragments <<>> and <<>> are replaced by their instantiated values at the end of script preparation, like filter.shared(shared=current, mintotal=10, makerare=F) 9. Each Mothur command may be accommpanied with various directives to give the workflow hints about which generated files need to be preserved or can be deleted. This way, it is possible to efficiently save disk space. GUMPP provides a semi-automatic deletion of files bases on Mothur's current files feature. However, this behavior is disabled by default, since sometimes it is hard for a workflow to decide properly which files may be deleted. As illustration, the script fragment below renders the actual part of a generic GUMPP's built-in script. Directives are lines that being with the exclamation sign! # remove a safe default option !out_globs = *, which is automatically # set by the workflow in order to enable smooth execution of plain # Mothur scripts without any need for directive hints. !global_out_globs # directive out_globs lists none, one or several linux wildcard # specifications. Files that are hit by any of these wildcards, # are not deleted by a workflow upon completion of the Mothur # command that the directive applies to. Directive global_out_globs # applies settings globally, i.e. until the next global redefinition. <<<#ifdef INPUT_MODE_FASTQ_PAIRED>>> # fastq files need to be preserved by make.file to be fed to make.contigs, !out_globs *.fastq make.file(inputdir=., type=fastq) # after make.contigs, we may get rid of the "file" entry in the current_files.summary !exclude_current file make.contigs(file=current, processors=<<>>) A. Results of each Mothur command are presented separately in an output directory, which greately eases inspection and debugging in the case that something goes wrong. B. MINOR: comments and empty lines are preserved during templates instantiation for documentation, clarity and esthetic reasons. C. MINOR: new type of stylistic empty line (denoted by a line with a single minus character) may be used for stylization of final scripts. Without this feature, empty lines between templates typically accumulate excessively, which makes resulting scripts look weird. The newly introduced features that are related to Mothur scripts are interesting primarily to advanced GUMPP users, who intend to develop their own Mothur scripts. Consequently, documentation of script-related features is moved out of the users' manual into a separate document. ===================================================================== Summary of changes between GUMPP version 1a (released on 7. 6. 2021) and GUMPP version 1.0 (released on 7. 4. 2021) ===================================================================== 1. GUMPP now also accepts input reads in fasta+groups format in addition to paired fastq reads. 2. It is possible to skip subsampling step (Mothur command sub.sample). 3. Mothur taxonomy outputs are also delivered to an end user instead of being deleted.