v0.2 → v0.3 Transition#

This document describes changes on columnflow introduced in version 0.3.0 that may affect existing code as well as already created output files. These changes were made in a refactoring campaign (see release v0.3) that was necessary to generalize some decisions made in an earlier stage of the project, and to ultimately support more analysis use cases that require a high degree of flexibility in many aspects of the framework.

The changes are grouped into the following categories:

Restructured Task Array Functions#

The internals of task array functions (TAF) like calibrators, selectors and producers received a major overhaul. Not all changes affect user code but some might.

Most notably, TAFs no longer have the attributes task, global_shift_inst, local_shift_inst. Instead, some of the configurable functions now receive a task argument through which task information and attributes like shifts can be accessed. In turn, the attributes analysis_inst, config_inst and dataset_inst are guarenteed to always be available, and there is no longer the need to dynamically check their existence.

This change reflects the new state separation imposed by the order in which underlying, customizable functions (or hooks) are called. A full overview of these hooks and arguments they received are listed in the task array functions documentation. In short, there are three types of hooks:

  1. pre_init, init, post_init: Initialization hooks meant to dynamically update used and produced columns and TAF dependencies. post_init is the first hook to receive the task argument.

  2. requires, setup, teardown: Methods to define custom task requirements, setting up attributes of the task array function before event processing, and to clean up and free resources afterwards.

  3. __call__: The main callable that is invoked for each event chunk.

pre_init, post_init and teardown have been newly introduced. See the task array function interface for a full descrption of all hooks and the arguments they receive.

(Note that, as before, while the hooks to register custom functions are named as shown above, the functions stored internally have an additional suffix and are named <HOOK_NAME>_func.)

Example#

The example below shows a simple producer that calculates the invariant mass of the two leading jets per event. The task argument is now passed to the function, and the task.logger can be used to log messages in the scope of the task.

import law
import awkward as ak
from columnflow.production import Producer, producer
from columnflow.columnar_util import set_ak_column

@producer(
    uses={"Jet.{pt,eta,phi,mass}"},
    produces={"di_jet_mass"},
)
def di_jet_mass(self: Producer, events: ak.Array, task: law.Task) -> ak.Array:
    # issue a warning in case less than 2 jets are present
    if ak.any(ak.num(events.Jet, axis=1) < 2):
        task.logger.warning("encountered event with less than 2 jets")

    di_jet = events.Jet[:, :2].sum(axis=1)
    events = set_ak_column(events, "di_jet_mass", di_jet.mass, value_type="float32")

    return events

Update Instructions#

  1. Checkout the TAF interface to learn about the arguments that the hooks receive. In particular, the task argument is now passed to all hooks after (and including) post_init.

  2. Make sure to no longer use the TAF attribites self.task, self.global_shift_inst, and self.local_shift_inst. Access them through task argument instead.

  3. Depending on whether your custom TAF required access to these attributes, for instance in the init hook, you need to move your code to a different hook such as post_init.

  4. If your TAF blocked specific resources, such as a large object, ML model, etc. loaded during setup, think about releasing these resources in the teardown hook.

  5. Also, all TAF instances are chached from now on, given the combination of self.analysis_inst, self.config_inst and self.dataset_inst.

Multi-config Tasks#

Most of the tasks provided by columnflow operate on a single analysis configuration (usually representing self-contained data taking periods or eras). Examples are cf.CalibrateEvents and cf.SelectEvents, or cf.ProduceColumns and cf.CreateHistograms which do the heavy lifting in terms of event processing.

However, some tasks require access to data of multiple eras at a time, and therefore, access to multiple analysis configurations. We refer to these tasks as multi-config tasks.

In version 0.3, the following tasks are multi-config tasks:

  • Most plotting tasks: tasks like cf.PlotVariables1D need to be able to draw events simulated for / recorded in multiple eras into the same plot.

  • cf.MLTraining: For many ML training applications it is reasonable to train on data from multiple eras, given that detector conditions are not too different. It is now possible to request data from multiple eras to be loaded for a single training.

  • cf.CreateDatacards (CMS-specific): The inference model interface as well as the datacard export routines now support entries for multiple configurations. See the changes to the inference model interface below for details.

Update Instructions#

All instructions only apply to the CLI usage of tasks.

  1. Tasks listed above no longer have a --config parameter. However, they now have a --configs parameter that accepts multiple configuration names as a comma-separate sequece. In order to achieve the single-config behavior, just pass the name of a single configuration here.

  2. Specific other parameters of multi-config tasks changed as well. Most notably, the --datasets and --processes parameters, which previously allowed for defining sequences of dataset and process names on the command line, now accept muliple comma-separated sequences. The number of sequences should be exactly one (applies to all configurations) or match the number of configurations given in --configs (one-to-one assignment). Sequences should be separater by colons.

    • Example: law.run cf.PlotVariables1D --configs 22pre,22post --datasets tt_sl,st_tw:tt_sl,st_s

Reducers#

Reducers are a new type of task array function that are invoked by the cf.ReduceEvents task. They control how results of the event selection - event and object masks - are applied to the full event data. See the types of task array functions and the detailed documentation on reducers for details.

The reducer’s job is

  • to apply the event selection mask (booleans) to select only a subset of events,

  • to apply object selection masks (booleans or integer indices) to create new collections of objects (e.g. specific jets, or leptons), and

  • to drop columns are not needed by any of the downstream tasks.

These three steps were previously part of the default implementation of the cf.ReduceEvents tasks but are now fully configurable though custom reducers. For compatibility with existing analyses, a default reducer called cf_default is provided by columnflow that implements exactly the previous behavior. In doing so, it even relies on the auxiliary entry keep_columns in the configuration to determine which columns should be kept after reduction.

Example#

The following example creates a custom reducer that invokes columnflow’s default reduction behavior and additionally creates a new column.

from columnflow.reduction import Reducer, reducer
from columnflow.reduction.default import cf_default
from columnflow.util import maybe_import
from columnflow.columnar_util import set_ak_column

ak = maybe_import("awkward")

@reducer(
    uses={cf_default, "Jet.hadronFlavour"},
    produces={cf_default, "Jet.from_b_hadron"},
)
def example(self: Reducer, events: ak.Array, selection: ak.Array, **kwargs) -> ak.Array:
    # run cf's default reduction which handles event selection and collection creation
    events = self[cf_default](events, selection, **kwargs)

    # compute and store additional columns after the default reduction
    # (so only on a subset of the events and objects which might be computationally lighter)
    col = abs(events.Jet.hadronFlavour) == 5
    events = set_ak_column(events, "Jet.from_b_hadron", col, value_type=bool)

    return events

Update Instructions#

  1. In general, there is no need to update your code. However, you will notice that output paths of all tasks downstream of (and including) cf.ReduceEvents will have an additional fragment like .../red__cf_default/... to reflect the choice of the reducer.

  2. The reduction behavior that was previously part of the cf.ReduceEvents task is now encapsulated by a default reducer called cf_default. To extend or alter its behavior, create your own implementation either from scratch or by inheriting from it and only overwriting some of its hooks.

  3. Invoke your reducer by adding --reducer MY_REDUCER_CLASS on the command line or by adding an auxiliary entry default_reducer to your configuration.

  4. If you decide to control the set of columns that should be available after reduction solely through your reducer, and no longer through the keep_columns auxiliary entry in your configuration, you can do so by redefining the produces set of your reducer.

Histogram Producers#

In release v0.2 and before, the amount of control users had over the creation of histograms within cf.CreateHistograms was limited to the selection of variables to use (through the --variables parameter) and the definition of event weights to be used during histogram filling. The latter was configured by specifying a so-called weight producer (through the --weight-producer parameter), was referred to the name of a task array function.

As of v0.3, we generalized this concept and renamed it to histogram producers. Use --hist-producer in the command line to specify the histogram producer you intend to use. See the full histogram producer documentation for more info.

In short, histogram producers continue to be task array functions, however, they provide additional hooks to control different aspects of the histogramming process:

  • create_hist(self, variables: list[od.Variable], task: law.Task) -> hist.Histogram: Given a list of variables, creates and returns a new histogram, with arbitrary axes, binning and weight storage.

  • fill_hist(self, h: hist.Histogram, data: dict[str, Any], task: law.Task) -> None: Provided columnar data to fill (with fields "category", "process", "shift" (a string) and "weight"), controls the way this data is filled into the histogram.

  • post_process_hist(self, h: hist.Histogram, task: law.Task) -> hist.Histogram: After all data was filled in cf.CreateHistogram, allows to change the histogram before it is saved to disk.

  • post_process_merged_hist(self, h: hist.Histogram, task: law.Task) -> hist.Histogram: Invoked by cf.MergeHistograms, allows to change the merged histogram before it is saved for subsequent processing.

The only requirement that columnflow imposes on histograms for plotting and export as part of statistical models is the existence of categorical (string) axes "category", "process" and "shift" after merging.

The main callable of a histogram producer continues to be responsible for returning (and potentially preprocessing) the event chunk to histogram, as well as a float array representing event weights in a 2-tuple, consistent with the previous behavior of weight producers.

Note that, unlike for most other task array functions, columnflow provides a default histogram producer named cf_default. It handles the histogram definition and filling in a backwards-compatible way, as well as a post-processing step that converts the category and shift axes from categorical integer to string types (for consistency across configuration objects when used in multi-config tasks). It is recommended to extend this default histogram producer in case you only need to change a single aspect of the histogramming process with respect to the default behavior. See the example below for how to do this.

Example#

from columnflow.histogramming import HistProducer
from columnflow.histogramming.default import cf_default
from columnflow.util import maybe_import

ak = maybe_import("awkward")

@cf_default.hist_producer(
    uses={"{normalization,pileup,btag}_weight"}
)
def example(self: HistProducer, events: ak.Array, **kwargs) -> ak.Array:
    """
    Example histogram producer that inherits from columnflow's default and
    changes the event weight only.
    """
    # compute the event weight
    weight = events.normalization_weight * events.pileup_weight * events.btag_weight

    return events, weight

Update Instructions#

  1. In case you used a weight producer before, convert it to a HistProducer(). There should be no change necessary for the main event callable.

  2. On the command line, use --hist-producer instead of --weight-producer.

  3. Note that the weight__ prefix in the weight producer related fragment of output paths of all tasks downstream of (and including) cf.CreateHistograms were changed to hist__ accordingly.

  4. If you do not intend to alter the default histogram definition, filling and post-processing, make sure to inherit from cf_default as shown in the example above.

Inference Model Updates#

As stated above, multi-config tasks allow for the inclusion of multiple analysis configurations in a single task to be able to access event data that spans multiple eras. This is particularly useful for tasks that export statistical models like cf.CreateDatacards (CMS-specific), and all other tasks that inherit from the generalized SerializeInferenceModelBase task.

To support this new feature, the underlying InferenceModel, i.e., the container object able to configure statistical models for your analysis, was updated. Pointers to analysis-specific objects in category and process defintions are now to be stored per configuration (see example below). This info is picked up by (e.g.) cf.CreateDatacards to pull in information and data from multiple data taking eras to potentially fill their event data into the same inference category.

As for all multi-config tasks, pass a sequence of configuration names to the --configs parameter on the command line.

Example#

The following example demonstrates how to define an inference model that …

from columnflow.inference import InferenceModel, inference_model

@inference_model
def example_model(self: InferenceModel) -> None:
    """
    Initialization method for the inference model.
    Use instance methods to define categories, processes and parameters.
    """
    # add a category
    self.add_category(
        "example_category",
        # add config dependent settings
        config_data={
            config_inst.name: self.category_config_spec(
                # name of the analysis category in the config
                category=f"{ch}__{cat}__os__iso",
                # name of the variable
                variable="jet1_pt",
                # names (or patterns) of datasets with real data in the config
                data_datasets=["data_*"],
            )
            for config_inst in self.config_insts
        },
        # additional category settings
        mc_stats=10.0,
        flow_strategy=FlowStrategy.move,
    )

    # add processes
    self.add_process(
        name="TT",
        # add config dependent settings
        config_data={
            config_inst.name: self.process_config_spec(
                # name of the (parent) process in the config
                process="tt",
                # names of MC datasets in the config
                mc_datasets=["tt_sl_powheg", "tt_dl_...", ...],
            ),
        },
        # additional process settings
        is_signal=False,
    )
    # more processes here
    ...

Update Instructions#

  1. In definitions of categories, processes and parameters within your inference model, make sure that all pointers that refer for analysis-specific objects are stored in a dictionary with keys being configuration names.

  2. These dictionaries are stored in fields named config_data.

  3. Use the provided factory functions to create these dictionary structures to invoke some additional value validation:

Changed Plotting Task Names#

The visualization of systematic uncertainties is updated as of v0.3. A new plot method was introduced to show not only the effect of the statistical uncertainty (due to the limited amount of simulated events) as a grey, hatched area, but also that of systematic uncertainties as a differently colored band.

The task that invokes this plot method by default is cf.PlotShiftedVariables1D. See the full task graph in our wiki to see its dependencies to other tasks.

Note that this task is not new, but it has been changed to include the systematic uncertainty bands. In version v0.2 and below, this task was used to plot the effect of a single up or down variation of a single shift. This behavior is now covered by a task called cf.PlotShiftedVariablesPerShift1D.

Update Instructions#

  1. If you are interested in creating plots showing the effect of one or multiple shifts in the same graph, use the cf.PlotShiftedVariables1D task.

  2. If you want to plot the effect of a single up or down variation of a single shift, use the cf.PlotShiftedVariablesPerShift1D task (formerly known as cf.PlotShiftedVariables1D)

Miscellaneous smaller updates#

  • The SelectorStepsMixin was removed and its functionality was moved into the standard SelectorClassMixin and SelectorMixin classes.

  • columnflow.util.InsertableDict was removed in favor of law.util.InsertableDict.