======================================================================
Summary of changes between GUMPP version 1p (released on  5.  3. 2023)
                       and GUMPP version 1n (released on 24. 12. 2021)
======================================================================

1. Mothur upgraded from version 1.44.1 to 1.48.0.

2. Processing scripts adapted for Mothur changes.

3. PicRust2 upgraded from version 2.4.1 to 2.5.1.

4. Accelerated file hashing.

5. Parameter rerun_failed_steps now Yes by default.

6. Container built with Singularity 3.10.5 instead of 2.6.1.


======================================================================
Summary of changes between GUMPP version 1n (released on 24. 12. 2021)
                       and GUMPP version 1m (released on  2. 11. 2021)
======================================================================

1. So far GUMPP accepted only absolute paths for specification of
   its config file, as well as for config directives in_dir and
   out_dir. Now, a possibilty has been aded to specify relative paths
   in all three cases.

   Internally, GUMPP still uses only absolute paths, and for that
   matter it converts (expands) paths in the above mentioned cases
   into absolute ones as soon as possible, if they are relative.

   Specifically, in a path that BEGINS with ~ (like ~/some_dir_or_file),
   the leading ~ gets expanded into a user's home directory
   (like /home/john to form absolute path /home/john/some_dir_or_file).

   A path or file name that does not begin with /, gets prepended with
   a directory from which GUMPP was launched. This can be made
   explicit by prepending ./ to a relative path.

   The feature is conveient for running GUMPP on HPC facilites,
   where is it not possible to know absolute paths of simg
   and input files in advance.

   For example, some_dir_or_file may get expanded to
   /tmp/unique_job_ID/some_dir_or_file

   Alternatively, the same functioning may be specified by
   ./some_dir_or_file => /tmp/unique_job_ID/some_dir_or_file

   Hence, GUMPP may now be launched as:
   singularity run gumpp_1n.simg ~/relative_path_to_config_file.txt
   singularity run gumpp_1n.simg ./relative_path_to_config_file.txt
   singularity run gumpp_1n.simg relative_path_to_config_file.txt

   In the first case, a path to a config file that is relative to
   a user's home directory is provided.
   In the last two cases, which are quivalent, a path to a config
   file that is relative to the current work directory of
   an invoking shell is given.

   The same logic apply to config directives in_dir and out_dir.

   All paths that are used by GUMPP internally are displayed on
   screen as absolute ones to aid in diagnosing wrong configurations.


======================================================================
Summary of changes between GUMPP version 1m (released on  2. 11. 2021)
                       and GUMPP version 1k (released on 18.  9. 2021)
======================================================================

1. All three types of analysis (ASV, OTU and GEN) may be run at once.
   GUMPP tries to run their independent steps in parallel, which
   noticeably boosts the execution.

   Now, all three config parameters:
   analysis_asv = yes
   analysis_otu = yes
   analysis_gen = yes
   may be set simulataneously. Of course, it is stil possible to
   execute only one (or two) type of analysis, if that is all
   that is needed.

   Note: initial Mothur steps for all three analyses are the same.
   GUMPP takes care to execute these only for one of the enabled
   analyses, and recycles resuls for the others. This is the reason
   that initially on-screen it does not appear that several enabled
   analyses are executed in parallel. Only when analyzing steps begin
   to differ between the analyses, their parallel execution takes place.


=====================================================================
Summary of changes between GUMPP version 1k (released on 18. 9. 2021)
                       and GUMPP version 1j (released on 19. 8. 2021)
=====================================================================

1. A much needed feature: GUMPP now assists in determining
   mandatory parameters of analysis. Hence, there is no need
   to make a separate setup of e.g. Mothur and taxonomy databases
   to deduce the parameters. Now, GUMPP is all that an end-user needs
   for performing analyses.

2. Due to the above feature, there now exists quick-run instructions,
   according to which even a new GUMPP user may start analysing
   her or his samples quckly.

3. Integration of additional taxonomy databases. The list is as follows:
   Silva_v138.1
   Silva_seed_v138.1
   Silva_v138
   Silva_seed_v138
   Silva_v132
   Silva_seed_v132
   Green_Genes_13_8_99

4. Additional results of analysis are provided with integrated
   Mothur summary.single command, which executes automatically.

5. Improved descriptions of disk IO errors.

6. Minor bugfixes and improvements of GUMPP and it documentation.


=====================================================================
Summary of changes between GUMPP version 1j (released on 19. 8. 2021)
                       and GUMPP version 1i (released on 12. 8. 2021)
=====================================================================

1. bugfixes of preprocessing Mothur script:

   a. Command classify.out has been moved immediately after command
      make.shared (i.e. before filter.shared) for all three
      types of analyses.

   b. Removed unnecessary commands list.otulabels.


=====================================================================
Summary of changes between GUMPP version 1i (released on 12. 8. 2021)
                       and GUMPP version 1h (released on 10. 8. 2021)
=====================================================================

1. bugfix: automatic collection of selected Mothurs' results
   did not work properly with external bindings.


=====================================================================
Summary of changes between GUMPP version 1h (released on 10. 8. 2021)
                       and GUMPP version 1g (released on  9. 8. 2021)
=====================================================================

1. new feature: automatic collection of selected Mothurs' results.
   This is documented in config template under section
   "Automatic collection of selected Mothur results".


=====================================================================
Summary of changes between GUMPP version 1g (released on  9. 8. 2021)
                       and GUMPP version 1f (released on  8. 8. 2021)
=====================================================================

1. bugfix: GUMPP did not print on screen (and consequently into
           screen dump) template substitutions from config file.
           Hence, documentation of an analysis was not complete.


=====================================================================
Summary of changes between GUMPP version 1f (released on  8. 8. 2021)
                       and GUMPP version 1e (released on  7. 8. 2021)
=====================================================================

1. Mothur 1.45.2 downgraded to 1.44.1 due to frequent crashes
   (e.g. corrupted double-linked list).


=====================================================================
Summary of changes between GUMPP version 1e (released on  7. 8. 2021)
                       and GUMPP version 1d (released on 14. 7. 2021)
=====================================================================

1. parameters rarepercent (config file: msp_filter_shared_rare_percent)
   and keepties (config file: msp_filter_shared_keep_ties) are added
   to command filter.shared within the generic Mothur script.

2. Added possibility to reduce the set of input sequences with Mothur's
   command sub.sample. This is vital to be able to process input
   datasets that are too large to be processed in their entirety on a
   given hardware, or that are beyond the PicRust2's capabilities.

   As a rule of thumb, with OTU analyses the limiting factor is the
   available amount of RAM (for Mothur command cluster.split, which
   appears to freeze, and saturates the system with disk swapping).
   The usual bottleneck with ASV analysis is inability of PicRust2
   to process the vast amount of generated OTUs. Generally, GENus
   analysis is able to process much larger datasets than the other
   two analyses, but certain limitations apply to it as well.

   When the capabilities of the workflow, its ingredients or
   underlying hardware are exceeded, then GUMPP application either
   terminates with error or appears to be frozen due to an
   inefficient processing.

   If any of this  happens, reducing of an input set may help.
   Reduction with Mothur's sub.sample is added to the beginning of
   the generic Mothur script with fasta input, or after making
   contigs from input fastq files.

   Subsampling may be activated by config parameter:
   msp_initial_sub_sample_size = ...value for parameter size of
                                    Mothur's command sub.sample

   It is also possible to specify:
   msp_initial_sub_sample_per_sample = ...parameter persample
                          of the Mothur's sub.sample command;
                          default: false

   msp_initial_sub_sample_with_replacement = ...parameter
        withreplacement of the Mothur's sub.sample command;
        default: false

   Please, see Mothur's documentation for details:
                             https://mothur.org/wiki/sub.sample/

   Note 1: config parameter msp_initial_sub_sample_size must be
   defined for the sub.sample command to be executed. The last two
   abovely descrbed parameters are ignored otherwise.

   Note 2: usually, it requires some trial&error experimenting with
   parameter msp_initial_sub_sample_size. Too small value will result
   in too much information in the input dataset to be discarded.
   As a result, the entire workflow will terminate with error at
   a certain point of processing. Too large value will result in
   too little reduction of an input dataset, which will still
   saturate the hardware or exceed other limitations of the
   software components, so the initial problem will still persist.

   One possbility is to start with some rather small value, like
   10000, and then exponentially increase it with the factor of 10,
   like 10000, 100000, 1000000 (maybe other finer grained values
   in-between), until the workflow manages to complete the analysis.
   Then, more gradually increase the value to probe for limitations.
   The larger the number at which the workflow completes its processing,
   the less information gets lost with sub sampling.

   Note 3: Too large value of msp_initial_sub_sample_size for a
   given input dataset may also result in a error of the sub sampling
   itself. For smaller datasets, the above suggested values like
   10000 may be too large. However, in these cases there is probably
   no need for sub sampling, since the input dataset is small
   enough to be processed without its reduction.


=====================================================================
Summary of changes between GUMPP version 1d (released on 14. 7. 2021)
                       and GUMPP version 1c (released on 12. 7. 2021)
=====================================================================

1. BUGFIX of a CRITICAL ERROR: sometimes GUMPP fed wrong biom
   file to PicRust2. Consequently, this step crashed.

2. Added config parameter bind_paths for informing GUMPP about
   externally binded disks through Singularity directive -B
   (e.g. singularity run -b /physical_path:/internal_path).
   Without this directive, symbolic links in the output
   directory pointed to the wrong location, whereas symlinked
   files could be properly accessed only within the
   GUMPP's Singularity container.

3. The number of output files is reduced. GUMPP 1c and before
   made a copy or symbolic link of input and output files of
   each processing step. This cluttered output directory
   unnecessarily. It also consumed way to much disk when files
   were physically copied into output directory instead of
   being symbolically linked there, since the same file was often
   an ouput of some step and at the same time an input of anoter
   step, by means of which it was copied several times into
   the output directory structure.


=====================================================================
Summary of changes between GUMPP version 1c (released on 12. 7. 2021)
                       and GUMPP version 1b (released on 10. 7. 2021)
=====================================================================

This is a cosmetic release. Processing is not altered in any way.

A noticeable inconvenience of GUMPP version 1B (and earlier) is
that during processing of large datasets the workflow appears
to freeze during hashing or copying large files to the output
directory. The first three improvements from the list below
aim to relieve an operator from guessing whether the workflow
is stalled.

1. Hashing of a file shows file size on screen, which gives an
   end user some clue about the duration of the hashing process.

2. Copying file to output directory shows file name and size
   on screeen (previously this operation was not indicated
   on a screen at all). Large file sizes give a clue that
   copying cannot happen immediately.

3. Each on-screen message is instantly displayed. Previously,
   the system sometimes cached messages in memory to display
   them at a later time, which is rather inconvenient for
   an end user, especially when the workflow appears to stall.


The changes below aim to give incremental improvements to
certain aspects of GUMPP's use.

4. Hashing of a file does not show a non-informative top
   directory any more to make a screen less cluttered.

5. When results are symlinked, output directory name changes
   from history_results to history_symlinks to give an operator
   a better clue that symlinks need to be properly handled.

6. Symlinking of results is enabled by default when history is
   enabled, since otherwise disk consumption more than doubles,
   which is a noticealbe burden on disk use with large datasets.
   Also, copying of large files (several terabytes in total)
   noticeably slows down the workflow.

7. Corrected are some typos in the on screen messages and in the
   internal Mothur script.


=====================================================================
Summary of changes between GUMPP version 1b (released on 10. 7. 2021)
                       and GUMPP version 1a (released on  7. 6. 2021)
=====================================================================

Please NOTE: descriptions that follow may appear fairly advanced.
             They are given primarily as information about
             the changes under the hood. The majority of users
             need not to get involved in delving into these features.
             However, each feature is fully accessible to any
             user. GUMPP operators who are interested in fine-tuning
             the GUMPP's workflow execution may explore the novelties
             below to tweak execution according to their preferences.

Despite great effort and care in the past to make GUMPP execute
as efficiently as possible, there is still plenty of room for
improvement. Features of version 1b focus primarily on
execution streamlining.


1. THE MOST IMPORTANT NOVELTY
   --------------------------
   Now every command in a Mothur script is executed as a separate entity, and its
   results are separately deposited into a repository of results. This enables
   much more optimal workflow re-execution in the case of an interruption or due
   to changes of some parameters.

   For example, if subsample size is changed, upon the workflow restart script
   processing jumps immediately to the subsampling step and the steps after it.
   No previous steps need to be re-executed again. Similarly, if some filtering
   parameter changes, re-execution continues with the filter command, whereas
   everything before the first affected step is instantly retrieved from the
   repository of results. (This is possible only, when the history feature is
   enabled; history is already introduced in GUMPP 1.0 and is enabled by default).


2. Separate performance parameters can be specified for each Mothur step by inserting
   appropriate directives into a Mothur script.


3. Processing of Mothur's script template parameters is vastly improved. Each
   numerical template value may now be accompanied by a lower and an upper limit.
   This features was primarily introduced to limit number of processors for
   individual Mothur commands. Experience reveals that different Mothur commands
   utilize computing resources in a vastly different way. Some of them may work
   efficiently with large number of CPUs, whereas others completely saturate disk
   when running with too many allocated CPUs. There is simply no one-number-fits-all
   setting that would allow efficient execution of the entire workflow.

   The only possibility to improve the current state of affairs is to fine-tune
   performance (CPU, disk, ...) settings for each Mothur command separately. For that
   matter, GUMPP version 1b noticeably expands template capabilities of Mothur scripts.

   TODO: actual measurements have not been performed yet. So performance parameters are
         currently set to some generic values. Nonetheless, the infrastructure for
         fine-tuning is in place.

         If someone is willing to help us performing measurements on kinds of hardware
         that we do not have access to, we would be happy to include tuning results
         into future GUMPP releases (bostjan.murovec@fe.uni-lj.si).


4. Instantiation of template parameters is empowered. Each template definition may now
   consist of four fields, of which only the first one is mandatory:

   <<<PARAM_NAME; default_value; min:min_value; max:max_value>>>

   Default-value field already exists in the previous versions of GUMPP, but now they
   can itself be templatized, i.e. they need not to be a hard-wired constant in a
   script. Minimal and maximal values may be templatized as well.

   Furthermore, minimal and maximal fields may contain an arbitrary
   number of constants or templatized values, like:

        <<<PARAM_NAME; max:limit1, limit2, limit3, etc.>>>

   This way, several limiting factors may be specified. The least one
   of them then imposes the actual limit. Similarly, for a minimal
   value, except that the maximal value of the entire list determines
   a lower limit.


5. Template values may now be defined by a newly introduced <<<#let>>>
   directive. For example:

        <<<#let MAX_NUMBER_OF_CPU = 16>>>

        ...and used further on...

        make.contigs(...., processors=<<<NUM_CPU; max:MAX_NUMBER_OF_CPU>>>

        or

        make.contigs(...., processors=<<<NUM_CPU; max:MAX_NUMBER_OF_CPU,64>>>

   The last example imposes a hard limit of 64 CPUs aside from the one
   prescribed by the #let definition. This way, the #let value may be altered
   in the future, but regardless of its value, the number of processors for
   executing make.contigs never exceeds 64.

   In the following example, the physically present number of CPUs is a
   value that is automatically provided by a GUMPP, and turns out to be handy
   in specifying CPU limits:

        make.contigs(...., processors=<<<NUM_CPU; max:MAX_NUMBER_OF_CPU,PHY_CPU,64>>>

    In this example, the actually applied number of CPUs is the minimal value
    of the user's specified limit MAX_NUMBER_OF_CPU, the physically present
    number of CPUs and the hard-wired number 64.


6. Fragments of Mothur's scripts may now be conditionally excluded from
   processing (similarly to conditional compilations of e.g. C/C++ code).
   This makes it possible to combine several slightly altered scripts into
   one compact version. By relying on this and other above features,
   the entire diversity of previous GUMPP's scripts were recollected into
   just one generic script. Specifically, there is no need for separate
   versions of scripts for processing fastq and fasta files, since conditional
   exclusions take care to eliminate the non-relevant parts:

        --------------------------------------------------------------------
        <<<#ifdef INPUT_MODE_FASTQ_PAIRED>>>
        #
        # the beginning of paired-fastq specific processing
        #
        make.file(inputdir=., type=fastq)
        make.contigs(file=stability.paired.files, processors=<<<NUM_CPU>>>)
        summary.seqs(fasta=current, processors=<<<NUM_CPU>>>)
        screen.seqs(fasta=current, group=current, ...)
        #
        # the end of paired-fastq specific processing
        #
        <<<#endif>>>

        <<<#ifdef INPUT_MODE_FASTA>>>
        #
        # the beginning of fasta + groups specific processing
        #
        summary.seqs(fasta=<<<INPUT_FASTA_FILE>>>, processors=<<<NUM_CPU>>>)
        screen.seqs(fasta=current, group=<<<INPUT_GROUPS_FILE>>>, ...)
        #
        # the end of fasta + groups specific processing
        #
        <<<#endif>>>
        --------------------------------------------------------------------

   Template values INPUT_MODE_FASTQ_PAIRED, INPUT_MODE_FASTA,
       INPUT_FASTA_FILE, INPUT_GROUPS_FILE, etc. are automatically
       supplied by GUMMP, as appropriate for specific circumstances.

   Similarly, there is no need for separate scripts for each type of
   analysis:

        --------------------------------------------------------------------
        <<<#ifdef ANALYSIS_TYPE_ASV>>>
        #
        # the part (or possibly several parts) that is (are) specific to ASV analysis
        #

        ...

        <<<#endif>>>


        <<<#ifdef ANALYSIS_TYPE_OTU>>>
        #
        # the part (or possibly several parts) that is (are) specific to OTU analysis
        #

        ...

        <<<#endif>>>


        <<<#ifdef ANALYSIS_TYPE_GEN>>>
        #
        # the part (or possibly several parts) that is (are) specific to GEN analysis
        #

        ...

        <<<#endif>>>
        --------------------------------------------------------------------

   Again, template definitions ANALYSIS_TYPE_ASV, ANALYSIS_TYPE_OTU,
   and ANALYSIS_TYPE_GEN are automatically supplied by GUMMP,
   as appropriate for the analysis in question.

   Similar conclusions apply to subsampling and additional filtering
   for which previous versions of GUMPP supplied separate scripts.
   Hence, the number of required scripts was growing exponentially
   with provided variations in processing routes. The new template
   capabilities and the resulting scripts' consolidations make
   the situation manageable and much less error prone. Previously,
   a modification that needed to be introduces had to be carefully
   applied to each scripts' variation. Since now there is only one
   script, the modification needs to be applied only once.

   GUMPP still possesses a repository of scripts. Different Mothur
   workflows may still be provided, if they vastly differ from the
   generic one, by means of it is more optimal and sensible to
   prescribe them with separate scripts.


7. Specifications of conditional exclusions may consist of an arbitrary
   number of AND or OR conditions. These are specified by & or |,
   respectively, in between template definitions.

        <<<#ifdef ANALYSIS_TYPE_ASV | INPUT_MODE_FASTQ_PAIRED>>>
        # ...some processing commands
        <<<#endif>>>

        <<<#ifdef ANALYSIS_TYPE_ASV & INPUT_MODE_FASTQ_PAIRED>>>
        # ...some processing commands
        <<<#endif>>>

   In the first case, the associated fragment applies, if analysis
   type is ASV or input consists of fastq files, whereas in the
   second case both of these requirements need to be met for the
   associated script segment to be applied.

   An isolated #ifdef directive may contain only one type of these
   logical operators. By nesting of #ifdef directives, arbitrarily
   complex conditions may be imposed.


8. Script processing is now aware of line continuation. If a line
   is terminated with a slash character, it is joined with the next line
   AFTER conditional exclusions are applied. This feature enables
   intra-line conditional exclusions:

        --------------------------------------------------------------------
        <<<#ifdef FILTER_SHARED_MIN_ABUND | FILTER_SHARED_MIN_PERCENT | FILTER_SHARED_MIN_TOTAL>>>
        #
        # the beginning of filter.shared fragment
        #

        filter.shared(shared=current, \
        <<<#ifdef FILTER_SHARED_MIN_ABUND>>>
        minabund=<<<FILTER_SHARED_MIN_ABUND>>>, \
        <<<#endif>>>
        <<<#ifdef FILTER_SHARED_MIN_PERCENT>>>
        minpercent=<<<FILTER_SHARED_MIN_PERCENT>>>, \
        <<<#endif>>>
        <<<#ifdef FILTER_SHARED_MIN_TOTAL>>>
        mintotal=<<<FILTER_SHARED_MIN_TOTAL>>>, \
        <<<#endif>>>
        <<<#ifdef FILTER_SHARED_MIN_NUM_SAMPLES>>>
        minnumsamples=<<<FILTER_SHARED_MIN_NUM_SAMPLES>>>, \
        <<<#endif>>>
        <<<#ifdef FILTER_SHARED_MIN_PERCENT_SAMPLES>>>
        minpercentsamples=<<<FILTER_SHARED_MIN_PERCENT_SAMPLES>>>, \
        <<<#endif>>>
        makerare=<<<FILTER_SHARED_MAKE_RARE;F>>>)

        #
        # the end of filter.shared fragment
        #
        <<<#endif>>>  # the entire filter OR condition
        --------------------------------------------------------------------

   The first line in the above fragment assures that Mothur command filter.shared
   is applied only, if at least one of templates FILTER_SHARED_MIN_ABUND,
   FILTER_SHARED_MIN_PERCENT or FILTER_SHARED_MIN_TOTAL is defined.
   Then further #ifdef directives include only the actually defined parameters into
   the actual command. For example, if only template FILTER_SHARED_MIN_TOTAL
   is defined, the resulting Mothur command is constellated as

        filter.shared(shared=current, mintotal=<<<FILTER_SHARED_MIN_TOTAL>>>, makerare=<<<FILTER_SHARED_MAKE_RARE;F>>>)

    where, of course, fragments <<<FILTER_SHARED_MIN_TOTAL>>> and
    <<<FILTER_SHARED_MAKE_RARE;F>>> are replaced by their
    instantiated values at the end of script preparation, like

        filter.shared(shared=current, mintotal=10, makerare=F)


9. Each Mothur command may be accommpanied with various directives to give the
   workflow hints about which generated files need to be preserved or can be
   deleted. This way, it is possible to efficiently save disk space. GUMPP
   provides a semi-automatic deletion of files bases on Mothur's current files
   feature. However, this behavior is disabled by default, since sometimes it is
   hard for a workflow to decide properly which files may be deleted.
   As illustration, the script fragment below renders the actual part of a
   generic GUMPP's built-in script. Directives are lines that being with the
   exclamation sign!

       # remove a safe default option !out_globs = *, which is automatically
       # set by the workflow in order to enable smooth execution of plain
       # Mothur scripts without any need for directive hints.
       !global_out_globs

       # directive out_globs lists none, one or several linux wildcard
       # specifications. Files that are hit by any of these wildcards,
       # are not deleted by a workflow upon completion of the Mothur
       # command that the directive applies to. Directive global_out_globs
       # applies settings globally, i.e. until the next global redefinition.

       <<<#ifdef INPUT_MODE_FASTQ_PAIRED>>>
       # fastq files need to be preserved by make.file to be fed to make.contigs,
       !out_globs *.fastq
       make.file(inputdir=., type=fastq)

       # after make.contigs, we may get rid of the "file" entry in the current_files.summary
       !exclude_current file
       make.contigs(file=current, processors=<<<NUM_CPU;max:MAX_CPU_MAKE_CONTIGS,PHY_CPU>>>)


A. Results of each Mothur command are presented separately in an output directory,
   which greately eases inspection and debugging in the case that something goes wrong.


B. MINOR: comments and empty lines are preserved during templates instantiation
   for documentation, clarity and esthetic reasons.


C. MINOR: new type of stylistic empty line (denoted by a line with a single minus
   character) may be used for stylization of final scripts. Without this
   feature, empty lines between templates typically accumulate excessively,
   which makes resulting scripts look weird.


The newly introduced features that are related to Mothur scripts are
interesting primarily to advanced GUMPP users, who intend to develop
their own Mothur scripts. Consequently, documentation of script-related
features is moved out of the users' manual into a separate document.


=====================================================================
Summary of changes between GUMPP version 1a  (released on 7. 6. 2021)
                       and GUMPP version 1.0 (released on 7. 4. 2021)
=====================================================================

1. GUMPP now also accepts input reads in fasta+groups format in
   addition to paired fastq reads.

2. It is possible to skip subsampling step (Mothur command sub.sample).

3. Mothur taxonomy outputs are also delivered to an end user instead
   of being deleted.