v0.2 → v0.3 Transition#
This document describes changes on columnflow introduced in version 0.3.0 that may affect existing code as well as already created output files. These changes were made in a refactoring campaign (see release v0.3) that was necessary to generalize some decisions made in an earlier stage of the project, and to ultimately support more analysis use cases that require a high degree of flexibility in many aspects of the framework.
The changes are grouped into the following categories:
Restructured Task Array Functions#
The internals of task array functions (TAF) like calibrators, selectors and producers received a major overhaul. Not all changes affect user code but some might.
Most notably, TAFs no longer have the attributes task, global_shift_inst, local_shift_inst.
Instead, some of the configurable functions now receive a task argument through which task information and attributes like shifts can be accessed.
In turn, the attributes analysis_inst, config_inst and dataset_inst are guarenteed to always be available, and there is no longer the need to dynamically check their existence.
This change reflects the new state separation imposed by the order in which underlying, customizable functions (or hooks) are called. A full overview of these hooks and arguments they received are listed in the task array functions documentation. In short, there are three types of hooks:
pre_init,init,post_init: Initialization hooks meant to dynamically update used and produced columns and TAF dependencies.post_initis the first hook to receive thetaskargument.requires,setup,teardown: Methods to define custom task requirements, setting up attributes of the task array function before event processing, and to clean up and free resources afterwards.__call__: The main callable that is invoked for each event chunk.
pre_init, post_init and teardown have been newly introduced.
See the task array function interface for a full descrption of all hooks and the arguments they receive.
(Note that, as before, while the hooks to register custom functions are named as shown above, the functions stored internally have an additional suffix and are named <HOOK_NAME>_func.)
Example#
The example below shows a simple producer that calculates the invariant mass of the two leading jets per event.
The task argument is now passed to the function, and the task.logger can be used to log messages in the scope of the task.
import law
import awkward as ak
from columnflow.production import Producer, producer
from columnflow.columnar_util import set_ak_column
@producer(
uses={"Jet.{pt,eta,phi,mass}"},
produces={"di_jet_mass"},
)
def di_jet_mass(self: Producer, events: ak.Array, task: law.Task) -> ak.Array:
# issue a warning in case less than 2 jets are present
if ak.any(ak.num(events.Jet, axis=1) < 2):
task.logger.warning("encountered event with less than 2 jets")
di_jet = events.Jet[:, :2].sum(axis=1)
events = set_ak_column(events, "di_jet_mass", di_jet.mass, value_type="float32")
return events
Update Instructions#
Checkout the TAF interface to learn about the arguments that the hooks receive. In particular, the
taskargument is now passed to all hooks after (and including)post_init.Make sure to no longer use the TAF attribites
self.task,self.global_shift_inst, andself.local_shift_inst. Access them throughtaskargument instead.Depending on whether your custom TAF required access to these attributes, for instance in the
inithook, you need to move your code to a different hook such aspost_init.If your TAF blocked specific resources, such as a large object, ML model, etc. loaded during
setup, think about releasing these resources in theteardownhook.Also, all TAF instances are chached from now on, given the combination of
self.analysis_inst,self.config_instandself.dataset_inst.
Multi-config Tasks#
Most of the tasks provided by columnflow operate on a single analysis configuration (usually representing self-contained data taking periods or eras).
Examples are cf.CalibrateEvents and cf.SelectEvents, or cf.ProduceColumns and cf.CreateHistograms which do the heavy lifting in terms of event processing.
However, some tasks require access to data of multiple eras at a time, and therefore, access to multiple analysis configurations. We refer to these tasks as multi-config tasks.
In version 0.3, the following tasks are multi-config tasks:
Most plotting tasks: tasks like
cf.PlotVariables1Dneed to be able to draw events simulated for / recorded in multiple eras into the same plot.cf.MLTraining: For many ML training applications it is reasonable to train on data from multiple eras, given that detector conditions are not too different. It is now possible to request data from multiple eras to be loaded for a single training.cf.CreateDatacards(CMS-specific): The inference model interface as well as the datacard export routines now support entries for multiple configurations. See the changes to the inference model interface below for details.
Update Instructions#
All instructions only apply to the CLI usage of tasks.
Tasks listed above no longer have a
--configparameter. However, they now have a--configsparameter that accepts multiple configuration names as a comma-separate sequece. In order to achieve the single-config behavior, just pass the name of a single configuration here.Specific other parameters of multi-config tasks changed as well. Most notably, the
--datasetsand--processesparameters, which previously allowed for defining sequences of dataset and process names on the command line, now accept muliple comma-separated sequences. The number of sequences should be exactly one (applies to all configurations) or match the number of configurations given in--configs(one-to-one assignment). Sequences should be separater by colons.Example:
law.run cf.PlotVariables1D --configs 22pre,22post --datasets tt_sl,st_tw:tt_sl,st_s
Reducers#
Reducers are a new type of task array function that are invoked by the cf.ReduceEvents task.
They control how results of the event selection - event and object masks - are applied to the full event data.
See the types of task array functions and the detailed documentation on reducers for details.
The reducer’s job is
to apply the event selection mask (booleans) to select only a subset of events,
to apply object selection masks (booleans or integer indices) to create new collections of objects (e.g. specific jets, or leptons), and
to drop columns are not needed by any of the downstream tasks.
These three steps were previously part of the default implementation of the cf.ReduceEvents tasks but are now fully configurable though custom reducers.
For compatibility with existing analyses, a default reducer called cf_default is provided by columnflow that implements exactly the previous behavior.
In doing so, it even relies on the auxiliary entry keep_columns in the configuration to determine which columns should be kept after reduction.
Example#
The following example creates a custom reducer that invokes columnflow’s default reduction behavior and additionally creates a new column.
from columnflow.reduction import Reducer, reducer
from columnflow.reduction.default import cf_default
from columnflow.util import maybe_import
from columnflow.columnar_util import set_ak_column
ak = maybe_import("awkward")
@reducer(
uses={cf_default, "Jet.hadronFlavour"},
produces={cf_default, "Jet.from_b_hadron"},
)
def example(self: Reducer, events: ak.Array, selection: ak.Array, **kwargs) -> ak.Array:
# run cf's default reduction which handles event selection and collection creation
events = self[cf_default](events, selection, **kwargs)
# compute and store additional columns after the default reduction
# (so only on a subset of the events and objects which might be computationally lighter)
col = abs(events.Jet.hadronFlavour) == 5
events = set_ak_column(events, "Jet.from_b_hadron", col, value_type=bool)
return events
Update Instructions#
In general, there is no need to update your code. However, you will notice that output paths of all tasks downstream of (and including)
cf.ReduceEventswill have an additional fragment like.../red__cf_default/...to reflect the choice of the reducer.The reduction behavior that was previously part of the
cf.ReduceEventstask is now encapsulated by a default reducer calledcf_default. To extend or alter its behavior, create your own implementation either from scratch or by inheriting from it and only overwriting some of its hooks.Invoke your reducer by adding
--reducer MY_REDUCER_CLASSon the command line or by adding an auxiliary entrydefault_reducerto your configuration.If you decide to control the set of columns that should be available after reduction solely through your reducer, and no longer through the
keep_columnsauxiliary entry in your configuration, you can do so by redefining theproducesset of your reducer.
Histogram Producers#
In release v0.2 and before, the amount of control users had over the creation of histograms within cf.CreateHistograms was limited to the selection of variables to use (through the --variables parameter) and the definition of event weights to be used during histogram filling.
The latter was configured by specifying a so-called weight producer (through the --weight-producer parameter), was referred to the name of a task array function.
As of v0.3, we generalized this concept and renamed it to histogram producers.
Use --hist-producer in the command line to specify the histogram producer you intend to use.
See the full histogram producer documentation for more info.
In short, histogram producers continue to be task array functions, however, they provide additional hooks to control different aspects of the histogramming process:
create_hist(self, variables: list[od.Variable], task: law.Task) -> hist.Histogram: Given a list of variables, creates and returns a new histogram, with arbitrary axes, binning and weight storage.fill_hist(self, h: hist.Histogram, data: dict[str, Any], task: law.Task) -> None: Provided columnar data to fill (with fields"category","process","shift"(a string) and"weight"), controls the way this data is filled into the histogram.post_process_hist(self, h: hist.Histogram, task: law.Task) -> hist.Histogram: After all data was filled incf.CreateHistogram, allows to change the histogram before it is saved to disk.post_process_merged_hist(self, h: hist.Histogram, task: law.Task) -> hist.Histogram: Invoked bycf.MergeHistograms, allows to change the merged histogram before it is saved for subsequent processing.
The only requirement that columnflow imposes on histograms for plotting and export as part of statistical models is the existence of categorical (string) axes "category", "process" and "shift" after merging.
The main callable of a histogram producer continues to be responsible for returning (and potentially preprocessing) the event chunk to histogram, as well as a float array representing event weights in a 2-tuple, consistent with the previous behavior of weight producers.
Note that, unlike for most other task array functions, columnflow provides a default histogram producer named cf_default.
It handles the histogram definition and filling in a backwards-compatible way, as well as a post-processing step that converts the category and shift axes from categorical integer to string types (for consistency across configuration objects when used in multi-config tasks).
It is recommended to extend this default histogram producer in case you only need to change a single aspect of the histogramming process with respect to the default behavior.
See the example below for how to do this.
Example#
from columnflow.histogramming import HistProducer
from columnflow.histogramming.default import cf_default
from columnflow.util import maybe_import
ak = maybe_import("awkward")
@cf_default.hist_producer(
uses={"{normalization,pileup,btag}_weight"}
)
def example(self: HistProducer, events: ak.Array, **kwargs) -> ak.Array:
"""
Example histogram producer that inherits from columnflow's default and
changes the event weight only.
"""
# compute the event weight
weight = events.normalization_weight * events.pileup_weight * events.btag_weight
return events, weight
Update Instructions#
In case you used a weight producer before, convert it to a
HistProducer(). There should be no change necessary for the main event callable.On the command line, use
--hist-producerinstead of--weight-producer.Note that the
weight__prefix in the weight producer related fragment of output paths of all tasks downstream of (and including)cf.CreateHistogramswere changed tohist__accordingly.If you do not intend to alter the default histogram definition, filling and post-processing, make sure to inherit from
cf_defaultas shown in the example above.
Inference Model Updates#
As stated above, multi-config tasks allow for the inclusion of multiple analysis configurations in a single task to be able to access event data that spans multiple eras.
This is particularly useful for tasks that export statistical models like cf.CreateDatacards (CMS-specific), and all other tasks that inherit from the generalized SerializeInferenceModelBase task.
To support this new feature, the underlying InferenceModel, i.e., the container object able to configure statistical models for your analysis, was updated.
Pointers to analysis-specific objects in category and process defintions are now to be stored per configuration (see example below).
This info is picked up by (e.g.) cf.CreateDatacards to pull in information and data from multiple data taking eras to potentially fill their event data into the same inference category.
As for all multi-config tasks, pass a sequence of configuration names to the --configs parameter on the command line.
Example#
The following example demonstrates how to define an inference model that …
from columnflow.inference import InferenceModel, inference_model
@inference_model
def example_model(self: InferenceModel) -> None:
"""
Initialization method for the inference model.
Use instance methods to define categories, processes and parameters.
"""
# add a category
self.add_category(
"example_category",
# add config dependent settings
config_data={
config_inst.name: self.category_config_spec(
# name of the analysis category in the config
category=f"{ch}__{cat}__os__iso",
# name of the variable
variable="jet1_pt",
# names (or patterns) of datasets with real data in the config
data_datasets=["data_*"],
)
for config_inst in self.config_insts
},
# additional category settings
mc_stats=10.0,
flow_strategy=FlowStrategy.move,
)
# add processes
self.add_process(
name="TT",
# add config dependent settings
config_data={
config_inst.name: self.process_config_spec(
# name of the (parent) process in the config
process="tt",
# names of MC datasets in the config
mc_datasets=["tt_sl_powheg", "tt_dl_...", ...],
),
},
# additional process settings
is_signal=False,
)
# more processes here
...
Update Instructions#
In definitions of categories, processes and parameters within your inference model, make sure that all pointers that refer for analysis-specific objects are stored in a dictionary with keys being configuration names.
These dictionaries are stored in fields named
config_data.Use the provided factory functions to create these dictionary structures to invoke some additional value validation:
for categories:
category_config_spec()for processes:
process_config_spec()for parameters:
parameter_config_spec()
Changed Plotting Task Names#
The visualization of systematic uncertainties is updated as of v0.3. A new plot method was introduced to show not only the effect of the statistical uncertainty (due to the limited amount of simulated events) as a grey, hatched area, but also that of systematic uncertainties as a differently colored band.
The task that invokes this plot method by default is cf.PlotShiftedVariables1D.
See the full task graph in our wiki to see its dependencies to other tasks.
Note that this task is not new, but it has been changed to include the systematic uncertainty bands.
In version v0.2 and below, this task was used to plot the effect of a single up or down variation of a single shift.
This behavior is now covered by a task called cf.PlotShiftedVariablesPerShift1D.
Update Instructions#
If you are interested in creating plots showing the effect of one or multiple shifts in the same graph, use the
cf.PlotShiftedVariables1Dtask.If you want to plot the effect of a single up or down variation of a single shift, use the
cf.PlotShiftedVariablesPerShift1Dtask (formerly known ascf.PlotShiftedVariables1D)
Miscellaneous smaller updates#
The
SelectorStepsMixinwas removed and its functionality was moved into the standardSelectorClassMixinandSelectorMixinclasses.columnflow.util.InsertableDictwas removed in favor oflaw.util.InsertableDict.