Config Objects#

Generalities#

The order package defines several classes to implement the metavariables of an Analysis. The order documentation and its Quickstart section provide an introduction to these different classes. In this section we will concentrate on the use of the order classes to define your analysis.

The three main classes needed to define your analysis are order.analysis.Analysis, order.config.Campaign and order.config.Config. Their purpose and definition can be found in the Analysis, Campaign and Config section of the Quickstart section of the order documentation.

After defining your Analysis object and your Campaign object(s), you can use the command

cfg = analysis.add_config(campaign, name=your_config_name, id=your_config_id)

to create the new Config object cfg, which will be associated to both the Analysis object and the Campaign object needed for its creation. As the Config object should contain the analysis-dependent information related to a certain campaign, it should contain most of the information needed for running your analysis. Therefore, in this section, the Config parameters required by Columnflow and some convenience parameters will be presented.

To start your analysis, do not forget to use the already existing analysis template in the analysis_templates/cms_minimal Git directory and its config.

The Config saves information under two general formats: the objects from the order package, which are necessary for your analysis to run in columnflow, and the additional parameters, which are saved under the auxiliary key, accessible through the “x” key. In principle, the auxiliary field can contain any parameter the user wants to save and reuse for parts of the analysis. However, several names in the auxiliary field do already have a meaning in columnflow and their values should respect the format used in columnflow. These two general formats are presented below.

Additionally, please note that some columnflow objects, like some Calibrators and Producers, require specific information that is needed to be accessible with predefined keywords. As explained in the object-specific variables section, please check the documentation of these objects before using them.

It is generally advised to use functions to set up Config objects. This enables easy and reliable reusage of parts of your analysis that are the same or similar between Campaigns (e.g. parts of the uncertainty model). Additionally, other parts of the analysis that might be changed quite often, e.g. the definition of variables, can be defined separately, thus improving the overall organization and readability of your code. An example of such a separation can be found in the existing hh2bbtautau analysis.

Parameters from the order package (required)#

Processes#

The physical processes to be included in the analysis. These should be saved as objects of the order.process.Process class and added to the Config object using its order.config.Config.add_process() method. An example is given in the columnflow analysis template:

process_names = [
    "data",
    "tt",
    "st",
]
for process_name in process_names:
    # add the process
    proc = cfg.add_process(procs.get(process_name))

Additionally, these processes must have corresponding datasets that are to be added to the Config as well (see next section). It is possible to get all root processes from a specific campaign using the get_root_processes_from_campaign() function from columnflow. Examples of information carried by a process could be the cross section of the process, registered under the order.process.Process.xsecs attribute and further used for normalization_weights in columnflow, and a color for the plotting scripts, which can be set using the order.mixins.ColorMixin.color1 attribute of the process. An example of a Process definition is given in the Analysis, Campaign and Config section of the columnflow documentation. More information about processes can be found in the order.process.Process and the Quickstart sections of the order documentation.

Datasets#

The actual datasets to be processed in the analysis. These should be saved as objects of the order.dataset.Dataset class and added to the Config object using its order.config.Config.add_dataset() method. The datasets added to the Config object must correspond to the datasets added to the Campaign object associated to the Config object. They are accessible through the order.config.Campaign.get_dataset() method of the Campaign class. An example is given in the columnflow analysis template:

dataset_names = [
    # data
    "data_mu_b",
    # backgrounds
    "tt_sl_powheg",
    # signals
    "st_tchannel_t_powheg",
]
for dataset_name in dataset_names:
    # add the dataset
    dataset = cfg.add_dataset(campaign.get_dataset(dataset_name))

The Dataset objects should contain for example information about the number of files and number of events present in a Dataset as well as its keys (= the identifiers or origins of a dataset, used by the cfg.x.get_dataset_lfns parameter presented below in the section on custom retrieval of datasets) and wether it contains observed or simulated data.

It is also possible to change information of a dataset in the config script. An example would be reducing the number of files to process for test purposes in a specific test config. This could be done with the following lines of code: e.g.

n_files_max = 5
for info in dataset.info.values():
    info.n_files = min(info.n_files, n_files_max)

Once the processes and datasets have both been added to the config, one can check that the root process of all datasets is part of any of the registered processes, using the columnflow function verify_config_processes().

An example of a Dataset definition is given in the Analysis, Campaign and Config section of the columnflow documentation.

Variables#

In order to create histograms out of the processed datasets, columnflow uses order.variable.Variables. These Variables need to be added to the config using the function order.config.Config.add_variable(). An example of the standard syntax for the Config object cfg would be as follows for the transverse momentum of the first jet:

# pt of the first jet in every event
cfg.add_variable(
    name="jet1_pt",  # variable name, to be given to the "--variables" argument for the plotting task
    expression="Jet.pt[:,0]",  # content of the variable
    null_value=EMPTY_FLOAT,  # value to be given if content not available for event
    binning=(40, 0.0, 400.0),  # (bins, lower edge, upper edge)
    unit="GeV",  # unit of the variable, if any
    x_title=r"Jet 1 $p_{T}$",  # x title of histogram when plotted
)

It is worth mentioning, that you do not need to select a specific jet per event in the expression argument (here with Jet.pt[:,0]), you can get a flattened histogram for all jets in all events with expression="Jet.pt".

In histogramming tasks such as CreateHistograms, one histogram is created per Variable given via the --variables argument, accessing information from columns based on the expression of the Variable and storing them in histograms with binning defined via the binning argument of the Variable.

The list of possible keyword arguments can be found in the order documentation for the class order.variable.Variable. The values in the expression argument can be either a one-dimensional or a more dimensional array. In this second case the information is flattened before plotting. It is to be mentioned that EMPTY_FLOAT is a columnflow internal null value.

Category#

Categories built to investigate specific parts of the phase-space, for example for plotting. These objects are described in the Channel and Category part of the Quickstart section of the order documentation. You can add such a category with the add_category() method. When adding this object to your Config instance, the selection argument is expected to take the name of an object of the Categorizer class instead of a boolean expression in a string format. An example for an inclusive category with the Categorizer cat_incl defined in the cms_minimal analysis template is given below:

add_category(
    cfg,
    id=1,
    name="incl",
    selection="cat_incl",
    label="inclusive",
)

It is recommended to always add an inclusive category with id=1 or name="incl" which is used in various places, e.g. for the inclusive cutflow plots and the “empty” selector.

A more detailed description of the usage of categories in columnflow is given in the Categories section of this documentation.

Channel#

Similarly to categories, Channels are built to investigate specific parts of the phase space and are described in the Channel and Category part of the Quickstart section of the order documentation. They can be added to the Config object using order.config.Config.add_channel().

Shift#

In order to implement systematic variations in the Config object, the order.shift.Shift class can be used. Implementing systematic variations using shifts can take different forms depending on the kind of systematic variation involved, therefore a complete section specialized in the description of these implementations is to be found in (TODO: add link Shift section). Adding a Shift object to the Config object happens through the order.config.Config.add_shift() function. An example is given in the columnflow analysis template:

cfg.add_shift(name="nominal", id=0)

Often, shifts are related to auxiliary parameters of the Config, like the name of the scale factors involved, or the paths of source files in case the shift requires external information.

Auxiliary Parameters (optional)#

In principle, the auxiliaries of the Config may contain any kind of variables. However, there are several keys with a special meaning for columnflow, for which you would need to respect the expected format. These are presented below at first, followed by a few examples of the kind of information you might want to save in the auxiliary part of the config on top of these. If you would like to use modules that ship with Columnflow, it is generally a good idea to first check their documentation to understand what kind of information you need to specify in the auxiliaries of your Config object for a successful run.

Keep_columns#

During the Task ReduceEvents new files containing all remaining events and objects after the selections are created in parquet format. If the auxiliary argument keep_columns, accessible through cfg.x.keep_columns, exists in the Config object, only the columns declared explicitely will be kept after the reduction. Actually, several tasks can make use of such an argument in the Config object for the reduction of their output. Therefore, the keep_columns argument expects a DotDict containing the name of the tasks (with the cf. prefix) for which such a reduction should be applied as keys and the set of columns to be kept in the output of this task as values.

For easier handling of the list of columns, the class ColumnCollection was created. It defines several enumerations containing columns to be kept according to a certain category. For example, it is possible to keep all the columns created during the SelectEvents task with the enum ALL_FROM_SELECTOR. An example is given below:

Custom retrieval of dataset files#

The Columnflow task GetDatasetLFNs obtains by default the logical file names of the datasets based on the keys argument of the corresponding order Dataset. By default, the function get_dataset_lfns_dasgoclient() is used, which obtains the information through the CMS DAS. However, this default behaviour can be changed using the auxiliary parameter cfg.x.get_dataset_lfns. You can set this to a custom function with the same keyword arguments as the default. For more information, please consider the documentation of get_dataset_lfns_dasgoclient(). Based on these parameters, the custom function should implement a way to create the list of paths corresponding to this dataset (the paths should not include the path to the remote file system) and return this list. Two other auxiliary parameters can be changed:

  • get_dataset_lfns_sandbox provides the sandbox in which the task GetDatasetLFNS will be run and expects therefore a law.sandbox.base.Sandbox object, which can be for example obtained through the dev_sandbox() function.

  • get_dataset_lfns_remote_fs provides the remote file system on which the LFNs for the specific dataset can be found. It expects a function with the dataset_inst as a parameter and returning the name of the file system as defined in the law config file.

An example of such a function and the definition of the corresponding config parameters for a campaign where all datasets have been custom processed and stored on a single remote file system is given below.

# custom lfn retrieval method in case the underlying campaign is custom uhh
if cfg.campaign.x("custom", {}).get("creator") == "uhh":
    def get_dataset_lfns(
        dataset_inst: od.Dataset,
        shift_inst: od.Shift,
        dataset_key: str,
    ) -> list[str]:
        # destructure dataset_key into parts and create the lfn base directory
        dataset_id, full_campaign, tier = dataset_key.split("/")[1:]
        main_campaign, sub_campaign = full_campaign.split("-", 1)
        lfn_base = law.wlcg.WLCGDirectoryTarget(
            f"/store/{dataset_inst.data_source}/{main_campaign}/{dataset_id}/{tier}/{sub_campaign}/0",
            fs=f"wlcg_fs_{cfg.campaign.x.custom['name']}",
        )

        # loop though files and interpret paths as lfns
        return [
            lfn_base.child(basename, type="f").path
            for basename in lfn_base.listdir(pattern="*.root")
        ]

    # define the lfn retrieval function
    cfg.x.get_dataset_lfns = get_dataset_lfns

    # define a custom sandbox
    cfg.x.get_dataset_lfns_sandbox = dev_sandbox("bash::$CF_BASE/sandboxes/cf.sh")

    # define custom remote fs's to look at
    cfg.x.get_dataset_lfns_remote_fs = lambda dataset_inst: f"wlcg_fs_{cfg.campaign.x.custom['name']}"

External_files#

If some files from outside columnflow are needed for an analysis, be them local files or online (and accessible through wget), these can be indicated in the cfg.x.external_files auxiliary parameter. These can then be copied to the columnflow outputs using the BundleExternalFiles task and used by being required by the object needing them. The cfg.x.external_files parameter expects a (possibly nested) DotDict with a user-defined key to retrieve the target in columnflow and the link/path as value. It is also possible to give a tuple as value, with the link/path as the first entry of the tuple and a version as a second entry. As an example, the cfg.x.external_files parameter might look like this, where json_mirror is a local path to a mirror directory of a specific commit of the jsonPOG-integration Gitlab (CMS-specific):

cfg.x.external_files = DotDict.wrap({
    # lumi files
    "lumi": {
        "golden": ("/afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions17/13TeV/Legacy_2017/Cert_294927-306462_13TeV_UL2017_Collisions17_GoldenJSON.txt", "v1"),  # noqa
        "normtag": ("/afs/cern.ch/user/l/lumipro/public/Normtags/normtag_PHYSICS.json", "v1"),
    },

    # muon scale factors
    "muon_sf": (f"{json_mirror}/POG/MUO/{year}_UL/muon_Z.json.gz", "v1"),
})

An example of usage of the muon_sf, including the requirement of the BundleExternalFiles task is given in the muon_weights Producer.

How to require a task in a Producer?

Showing how to require the BundleExternalFiles task to have run in the example of the muon weights Producer linked above:

@muon_weights.requires
def muon_weights_requires(self: Producer, reqs: dict) -> None:
    if "external_files" in reqs:
        return

    from columnflow.tasks.external import BundleExternalFiles
    reqs["external_files"] = BundleExternalFiles.req(self.task)

Luminosity#

The luminosity, needed for some normalizations and for the labels in the standard columnflow plots, needs to be given in the auxiliary arguments cfg.x.luminosity as an object of the scinum.Number class, such that for example the nominal parameter exists. An example for a CMS luminosity of 2017 with uncertainty sources given as relative errors is given below.

from scinum import Number
cfg.x.luminosity = Number(41480, {
    "lumi_13TeV_2017": 0.02j,
    "lumi_13TeV_1718": 0.006j,
    "lumi_13TeV_correlated": 0.009j,
})

Defaults#

Default values can be given for several command line parameters in columnflow, using the cfg.x.default_{parameter} entry in the Config object. The expected format is either:

  • a single string containing the name of the object to be used as default for parameters accepting only one argument or

  • a tuple for parameters accepting several arguments.

The command-line arguments supporting a default value from the Config object are given in the cms_minimal example of the analysis_templates and shown again below:

cfg.x.default_calibrator = "example"
cfg.x.default_selector = "example"
cfg.x.default_producer = "example"
cfg.x.default_weight_producer = "example"
cfg.x.default_ml_model = None
cfg.x.default_inference_model = "example"
cfg.x.default_categories = ("incl",)
cfg.x.default_variables = ("n_jet", "jet1_pt")

Groups#

It is also possible to create groups, which allow to conveniently loop over certain command-line parameters. This is done with the cfg.x.{parameter}_group arguments. The expected format of the group is a dictionary containing the custom name of the groups as keys and the list of the parameter values as values. The name of the group can then be given as command-line argument instead of the single values. An example with a selector_steps group is given below.

# selector step groups for conveniently looping over certain steps
# (used in cutflow tasks)
cfg.x.selector_step_groups = {
    "default": ["muon", "jet"],
}

With this group defined in the Config object, running over the “muon” and “jet” selector_steps in this order in a cutflow task can done with the argument --selector-steps default.

All parameters for which groups are possible are given below:

# process groups for conveniently looping over certain processs
# (used in wrapper_factory and during plotting)
cfg.x.process_groups = {}

# dataset groups for conveniently looping over certain datasets
# (used in wrapper_factory and during plotting)
cfg.x.dataset_groups = {}

# category groups for conveniently looping over certain categories
# (used during plotting)
cfg.x.category_groups = {}

# variable groups for conveniently looping over certain variables
# (used during plotting)
cfg.x.variable_groups = {}

# shift groups for conveniently looping over certain shifts
# (used during plotting)
cfg.x.shift_groups = {}

# general_settings groups for conveniently looping over different values for the general-settings parameter
# (used during plotting)
cfg.x.general_settings_groups = {}

# process_settings groups for conveniently looping over different values for the process-settings parameter
# (used during plotting)
cfg.x.process_settings_groups = {}

# variable_settings groups for conveniently looping over different values for the variable-settings parameter
# (used during plotting)
cfg.x.variable_settings_groups = {}

# custom_style_config groups for conveniently looping over certain style configs
# (used during plotting)
cfg.x.custom_style_config_groups = {}

# selector step groups for conveniently looping over certain steps
# (used in cutflow tasks)
cfg.x.selector_step_groups = {
    "default": ["muon", "jet"],
}

# calibrator groups for conveniently looping over certain calibrators
# (used during calibration)
cfg.x.calibrator_groups = {}

# producer groups for conveniently looping over certain producers
# (used during the ProduceColumns task)
cfg.x.producer_groups = {}

# ml_model groups for conveniently looping over certain ml_models
# (used during the machine learning tasks)
cfg.x.ml_model_groups = {}

Reduced File size#

The target size of the files in MB after the MergeReducedEvents task can be set in the Config object with the cfg.x.reduced_file_size argument. A float number corresponding to the size in MB is expected. This value can also be changed with the merged_size argument when running the task. If nothing is set, the default value implemented in columnflow will be used (defined in the resolve_param_values() method). An example is given below.

# target file size after MergeReducedEvents in MB
cfg.x.reduced_file_size = 512.0

Object-specific variables#

Other than the variables mentioned above, several might be needed for specific Producers for example. These won’t be discussed here as they are not general parameters. Hence, we invite the users to check which Config entries are needed for each Calibrator, Selector and Producer and in general each CMS-specific object (=objects in the cms-subfolders) they want to use. Since the muon_weights Producer was already mentioned above, we will remind users here that the cfg.x.muon_sf_names Config entry is needed for this Producer to run, as indicated in the docstring of the Producer.

As for the CMS-specific objects, an example could be the task CreatePileupWeights, which requires for example the minimum bias cross sections cfg.x.minbias_xs and the pu entry in the external files.

Examples of these entries in the Config objects can be found in already existing CMS-analyses working with columnflow, for example the hh2bbtautau analysis or the hh2bbww analysis from UHH.

Other examples of auxiliaries in the Config object#

As mentioned above, any kind of python variables can be stored in the auxiliary of the Config object. To give an idea of the kind of variables an analysis might want to include in the Config object additionally to the ones needed by columnflow, a few examples of variables which do not receive any specific treatment in native columnflow are given.

  • Triggers

  • b-tag working points

  • MET filters

For applications of these examples, you can look at already existing columnflow analyses, for example the hh2bbtautau analysis from UHH.