Task Overview

Task Overview#

This page gives a small overview of the current tasks available by default in columnflow.

In general (at least for people working in the CMS experiment), using columnflow requires a grid proxy, as the dataset files are accessed through it. The grid proxy should be activated after the setup of the default environment. It is however possible to create a custom law config file to be used by people without a grid certificate, although someone with a grid proxy must run the tasks GetDatasetLFNs (and potentially the cms-specific CreatePileUpWeights) for them. Once these tasks are done, the local task outputs can be used without grid certificate by other users if they are able to access them with the storage location declared in the custom law config file. An example for such a custom config file can be found in the examples section of this documentation.

While the name of each task is fairly descriptive of its purpose, a short introduction of the most important facts and parameters about each task group is provided below. As some tasks require others to run, the arguments for a task higher in the tree will also be required for tasks below in the tree (sometimes in a slightly different version, e.g. with an “s” if the task allows several instances of the parameter to be given at once (e.g. several datasets)).

It should be mentioned that in your analysis, the command line argument for running the columnflow tasks described below will contain an additional “cf.”-prefix before the name of the task, as these are columnflow tasks and not new tasks created explicitely for your analysis. For example, running the SelectEvents task will require the following syntax:

law run cf.SelectEvents --your-parameters parameter_value
  • GetDatasetLFNs: This task looks for the logical file names of the datasets to be used and saves them in a json file. The argument --dataset followed by the name of the dataset to be searched for, as defined in the analysis config is needed for this task to run.

  • CalibrateEvents: Task to implement corrections to be applied on the datasets, e.g. jet-energy corrections. This task uses objects of the Calibrator class to apply the calibration. The argument --calibrator followed by the name of the Calibrator object to be run is needed for this task to run. A default value for this argument can be set in the analysis config. Similarly, the --shift argument can be given, in order to choose which corrections are to be used, e.g. which variation (up, down, nominal) of the jet-energy corrections are to be used.

  • SelectEvents: Task to implement selections to be applied on the datssets. This task uses objects of the Selector class to apply the selection. The output are masks for the events and objects to be selected, saved in a parquet file, and some additional parameters stored in a dictionary format, like the statistics of the selection (which are needed for the plotting tasks further down the task tree), saved in a json file. The mask are not applied to the columns during this task. The argument --selector followed by the name of the Selector object to be run is needed for this task to run. A default value for this argument can be set in the analysis config. From this task on, the --calibrator argument is replaced by --calibrators.

  • ReduceEvents: Task to apply the masks created in SelectEvents on the datasets. All tasks below ReduceEvents in the task graph use the parquet file resulting from ReduceEvents to work on, not the original dataset. The columns to be conserved after ReduceEvents are to be given in the analysis config under the config.x.keep_columns argument in a DotDict structure (from util).

  • ProduceColumns: Task to produce additional columns for the reduced datasets, e.g. for new high level variables. This task uses objects of the Producer class to create the new columns. The new columns are saved in a parquet file that can be used by the task below on the task graph. The argument --producer followed by the name of the Producer object to be run is needed for this task to run. A default value for this argument can be set in the analysis config.

  • PrepareMLEvents, MLTraining, MLEvaluation: Tasks to train, evaluate neural networks and plot (to be implemented) their results.

  • CreateHistograms: Task to create histograms with the python package Hist which can be used by the tasks below in the task graph. From this task on, the --producer argument is replaced by --producers. The histograms are saved in a pickle file.

  • CreateDatacards: TODO

  • Merge tasks (e.g. MergeReducedEvents, MergeHistograms): Tasks to merge the local outputs from the various occurences of the corresponding tasks.

There are also CMS-specialized tasks, like CreatePileUpWeights, which are described in the CMS specializations section. As a note, the CreatePileUpWeights task is interesting from a workflow point of view as it is an example of a task required through an object of the Producer class. This behaviour can be observed in the pu_weight_requires() method.

TODO: maybe interesting to have examples e.g. for the usage of the parameters for the 2d plots. Maybe in the example section, or after the next subsection, such that all parameters are explained? If so, at least to be mentioned here.