Work in Progess! This guide has not been finalized and the presented code has not yet been tested.

TODO: make the code testable

Categories#

In columnflow, there are many tools to create a complex and flexible categorization of all analysed events. Generally, this categorization can be layered. We refer to the smallest building block of these layers as leaf categories, which can subsequently be either run individually or combined into more complex categories. This guide presents how to implement a set of categories in columnflow and shows how to use the resulting categories via the CreateYieldTable task.

Create a category#

If you used the analysis template to setup your analysis, there are already two categories included, named incl and 2j. To test that the categories are properly implemented, we can use the CreateYieldTable task:

law run cf.CreateYieldTable --version v1 \
    --calibrators example --selector example --producers example \
    --processes tt,st --categories incl,2j

When all tasks are finished, this should create a table presenting the event yield of each process for all categories. To create new categories, we essentially need three ingredients:

  1. We need to define our categories as order.category.Category instances in the config

add_category(
    cfg,
    id=1,
    name="incl",
    selection="cat_incl",
    label="inclusive",
)
  1. We need to write a Categorizer that defines which events to select for each category. The name of the Categorizer needs to be the same as the “selection” attribute of the category inst.

# coding: utf-8

"""
Exemplary selection methods.
"""

from columnflow.categorization import Categorizer, categorizer
from columnflow.util import maybe_import

ak = maybe_import("awkward")


#
# categorizer functions used by categories definitions
#

@categorizer(uses={"event"})
def cat_incl(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    # fully inclusive selection
    return events, ak.ones_like(events.event) == 1


@categorizer(uses={"Jet.pt"})
def cat_2j(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    # two or more jets
    return events, ak.num(events.Jet.pt, axis=1) >= 2

Keep in mind that the module in which your define your Categorizer needs to be included in the law config

  1. We need to run the category_ids Producer to create the category_ids column. This is directly included via the --producers category_ids, but you could also run the category_ids Producer as part of another Selector or Producer.

@producer(
    uses={category_ids},
    produces={category_ids},
)
def my_producer(self: Producer, events: ak.Array, **kwargs) -> ak.Array:
    # produce your category ids
    events = self[category_ids](events, **kwargs)

    # do whatever else your producer needs to do
    ...

    return events
But how does this actually work?

The category_ids Producer loads all category instances from your config. For each leaf category inst (which is the “smallest unit” of categories), it maps the category_inst.selection string to a Categorizer and adds it the to the uses and produces, meaning that columns required (produced) by the Categorizer will automatically be loaded (stored) when running the category_ids Producer.

During the event processing, the Categorizer of each leaf category is evaluated to generate a mask, which defines, whether the event is part of this category or not. The mask is then transformed to an array of ids (either the category_inst.id if True, and None for False entries).t

In the end, we return a jagged array of category ids, which allows us to categorize one event into multiple different types of categories.

You can also store a list of strings in the category_inst.selection field. In that case, the logical and of all masks obtained by the Categorizer is used to define the mask corresponding to this category.

And where are the category_ids actually used?

The category_ids column is primarily used when creating our histograms (e.g. in the CreateHistograms task). The created histograms always contain one axis for categories, using the values from the category_ids column. Since this column is a jagged array, it is possible to fill events either never or multiple times in a histogram.

Other tasks such as PlotVariables1D are then using this axis with categories to obtain all entries from the histogram corresponding to the category that is requested via the --categories parameter. When the given category is not a leaf category but contains categories itself, the ids of all its leaf categories combined are used for the category.

Creation of nested categories#

Note

The following section is mostly included for didactic purposes. To implement a set of nested categories, it is recommended to follow the section Categorization with multiple layers.

Often times, there are multiple layers of categorization. For example, we might want to categorize based on the number of leptons and the number of jets.

# do 0 lepton vs >= 1 lepton instead?
cat_1e = config.add_category(
    name="1e",
    id=10,
    selection="cat_1e",
    label="1 Electron, 0 Muons",
)
cat_1mu = config.add_category(
    name="1mu",
    id=20,
    selection="cat_1mu",
    label="1 Muon, 0 Electrons",
)
cat_2jet = config.add_category(
    name="2jet",
    id=200,
    selection="cat_2jet",
    label=r"$N_{jets} \leq 2$",
)
cat_3jet = config.add_category(
    name="3jet",
    id=300,
    selection="cat_3jet",
    label=r"$N_{jets} \geq 3$",
)
@categorizer(uses={"Jet.pt"})
def cat_2jet(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, ak.num(events.Jet.pt, axis=1) <= 2

@categorizer(uses={"Jet.pt"})
def cat_3jet(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, ak.num(events.Jet.pt, axis=1) >= 3

@categorizer(uses={"Muon.pt", "Electron.pt"})
def cat_1e(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, (ak.num(events.Muon.pt, axis=1) == 0 & ak.num(events.Electron.pt, axis=1) == 1)

@categorizer(uses={"Muon.pt", "Electron.pt"})
def cat_1mu(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, (ak.num(events.Muon.pt, axis=1) == 1 & ak.num(events.Electron.pt, axis=1) == 0)

This code snippet produces 4 new categories, which (after rerunning the category_ids producer and recreating all necessary histograms) can already be used:

law run cf.CreateYieldTable --version v1 \
    --calibrators example --selector example --producers example \
    --processes tt,st --categories "*"

As a next step, one might want to combine these categories to categorize based on the number of jets and leptons simutaneously. You could simply create all combinations by hand. The following code snippet shows how to combine the “2jet” and the “1lep” category:

cat_2jet = config.get_category("2jet")
cat_3jet = config.get_category("3jet")
cat_1e = config.get_category("1e")
cat_1mu = config.get_category("1mu")

# add combined category as child to the existing categories
cat_1e__2jet = cat_2jet.add_category(
    name="1e__2jet",
    id=cat_2jet.id + cat_1e.id,
    # our `category_ids` producer can automatically combine multiple selections via logical 'and'
    selection=[cat_1e.selection, cat_2jet.selection],
    label="1 electron, 0 muons, >2 jets",
)
cat_1mu__2jet = cat_2jet.add_category(
    name="1e__2jet",
    id=cat_2jet.id + cat_1mu.id,
    selection=[cat_1mu.selection, cat_2jet.selection],
    label="1 muon, 0 electrons, 2 jets",
)
cat_1e__3jet = cat_3jet.add_category(
    name="1e__3jet",
    id=cat_3jet.id + cat_1e.id,
    selection=[cat_1e.selection, cat_3jet.selection],
    label="1 electron, 0 muons, 3 jets",
)
cat_1mu__3jet = cat_3jet.add_category(
    name="1e__3jet",
    id=cat_3jet.id + cat_1mu.id,
    selection=[cat_1mu.selection, cat_3jet.selection],
    label="1 muon, 0 electrons, 3 jets",
)

# add children also to lepton categories
cat_1e.add_category(cat_1e__2jet)
cat_1e.add_category(cat_1e__3jet)
cat_1mu.add_category(cat_1mu__2jet)
cat_1mu.add_category(cat_1mu__3jet)

We now created categories with dependencies between each other. In columnflow, we always use the smallest units of categories (commonly called “leaf categories”) to build our parent categories. For example, the 2jet category consists of the leaf categories 1e__2jet and 1mu__2jet. To select events corresponding to 2jet, we will add all events with category ids corresponding to either 1e__2jet or 1mu__2jet.

Note

We might add unexpected selections by combining categorization.

In this example, the cat_1e and cat_1mu Categorizer only select events with exactly one lepton (either electron or muon). This means, that events with zero or multiple leptons will not be considered in our categorization, even when building e.g. the 2jet category.

Let’s test if our leaf categories are working as intended:

law run cf.CreateYieldTable --version v1 \
    --calibrators example --selector example --producers example \
    --processes tt,st --categories 1e,2jet,1e__2jet,1e__3jet,1mu__2jet,1mu__3jet

We should see that the yield of the 2jet category is the same as the 1e__2jet and 1mu__2jet categories combined. This is to be expected, since the 2jet category will be built by combining histograms from all of their leaf categories.

Note

Take care to define your categories orthogonal!

There is no mechanism that automatically prevents double counting. It is therefore essential to define categories such that there is no overlap between leaf categories of any defined category.

Example of what NOT to do

We might have two categories selecting either exactly two or at least two jets.

cat_2jet = config.add_category(
    name="2jet",
    id=200,
    selection="cat_2jet",
    label=r"$N_{jets} = 2$",
)
cat_geq2jet = config.add_category(
    name="geq2jet",
    id=300,
    selection="cat_geq2jet",
    label=r"$N_{jets} \geq 2$",
)

@categorizer(uses={"Jet.pt"})
def cat_2jet(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, ak.num(events.Jet.pt, axis=1) == 2

@categorizer(uses={"Jet.pt"})
def cat_geq2jet(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
    return events, ak.num(events.Jet.pt, axis=1) >= 2

This can be done without any issues since these two categories are independent of each other (they will each produce their own category id). However, if we now add a parent category that contains both 2jet and geq2jet, this will result in couting all events twice that contain exactly two jets.

non_orthogonal_cat = config.add_category(
    name="non_orthogonal_cat",
    id=100,
    label="Exactly two and at least two jets, resulting in double-counting. You should not do this",
)
non_orthogonal_cat.add_category(cat_2jet)
non_orthogonal_cat.add_category(cat_geq2jet)

Categorization with multiple layers

But what should you do when the number of category combinations is getting large? For example, you might want to define categories based on the number of leptons, the number of jets, and based on a simple obervable such as HT:

jet_categories = (2, 3, 4, 5, 6)
lep_categories = (1, 2, 3)
ht_categories = ((0, 200), (200, 400), (400, 600))

If you want to build all combinations of this set of 5+3+3 categories, you end up with 5 x 3 x 3 = 45 leaf categories. To build all of them by hand is very tedious, unflexible, and error prone. Luckily, columnflow provides all necessary tools to build combinations of categories automatically. To use these tools, we first need to create our base categories and their corresponding Categorizers.

#
# add categories to config
#

for n_lep in lep_categories:
    config.add_category(
        id=10 * n_lep,
        name=f"{n_lep}lep",
        selection=f"cat_{n_lep}lep",
        label=f"{n_lep} leptons",
    )
for n_jet in jet_categories:
    config.add_category(
        id=100 * n_jet,
        name=f"{n_jet}jet",
        selection=f"cat_{n_jet}jet",
        label=f"{n_jet} jets",
    )
for i, (ht_down, ht_up) in enumerate(ht_categories):
    config.add_category(
        id=1000 * i,
        name=f"ht_{ht_down}to{ht_up}",
        selection=f"cat_ht{ht_down}to{ht_up}",
        label=f"HT in ({ht_down}, {ht_up}]",
    )

#
# define Categorizer modules
#

for n_jet in jet_categories:
    @categorizer(name=f"cat_{n_jet}jet", uses={"Jet.pt"})
    def cat_n_jet(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
        return events, ak.num(events.Jet.pt, axis=1) == n_jet

for n_lep in (1, 2, 3):
    @categorizer(name=f"cat_{n_lep}lep", uses={"Electron.pt", "Muon.pt"})
    def cat_n_lep(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
        return events, (ak.num(events.Electron.pt, axis=1) + ak.num(events.Muon.pt, axis=1)) == n_lep

for ht_down, ht_up in ht_categories:
    @categorizer(name=f"cat_ht{ht_down}to{ht_up}", uses={"Jet.pt"})
    def cat_ht(self: Categorizer, events: ak.Array, **kwargs) -> tuple[ak.Array, ak.Array]:
        ht = ak.sum(events.Jet.pt, axis=1)
        return events, ((ht > ht_down) & (ht <= ht_up))

This will only create the (5+3+3) parent categories. To create all possible combinations of these categories, we can use the create_category_combinations() function that will do most of the work for us:

def name_fn(root_cats):
    """ define how to combine names for category combinations """
    cat_name = "__".join(cat.name for cat in root_cats.values())
    return cat_name


def kwargs_fn(root_cats):
    """ define how to combine category attributes for category combinations """
    kwargs = {
        "id": sum([c.id for c in root_cats.values()]),
        "label": ", ".join([c.name for c in root_cats.values()]),
        # optional: store the information on how the combined category has been created
        "aux": {
            "root_cats": {key: value.name for key, value in root_cats.items()},
        },
    }
    return kwargs

#
# define how to combine categories
#


category_blocks = OrderedDict({
    "lep": [config.get_category(f"{n_jet}lep") for n_lep in lep_categories],
    "jet": [config.get_category(f"{n_jet}jet") for n_jet in jet_categories],
    "ht": [config.get_category(f"ht{ht_down}to{ht_up}") for ht_down, ht_up in ht_categories],
})

from columnflow.config_util import create_category_combinations

# this function creates all possible category combinations without producing duplicates
# (e.g. category '1lep__2jet__ht200to400' will be produced, but '2jet__1lep__ht200to400' will be skipped)
n_cats = create_category_combinations(
    config,
    category_blocks,
    name_fn=name_fn,
    kwargs_fn=kwargs_fn,
    skip_existing=False,
)

# let's check what categories have been produced
print(f"{n_cats} have been created by the create_category_combinations function")
all_cats = [cat.name for cat, _, _ in config.walk_categories()]
print(f"List of all cateogries in our config: \n{all_cats}")

To test our final set of categories, we can call our CreateYieldTable task again.

law run cf.CreateYieldTable --version v1 \
    --calibrators example --selector example --producers example \
    --processes tt,st --categories "*"

Note

This categorization is not inclusive!

Since we only store leaf category ids, all the events that are not considered by one of the category blocks will not be considered at all. For this set of categories it means that we implicitly added cuts on the number of jets (>= 2 and <= 6), the number of leptons (>= 1 and <= 3), and HT (<= 600) as soon as we use one of these categories.

TODO: it might be more instructive to build this example such that our categorization is inclusive.

Extend your categorization as part of a Producer#

Some of your categories might need some computationally expensive reconstructed observables and can therefore only be called after certain requirements are met. In these cases, it might be beneficial to first produce columns and then load these columns when necessary. To streamline this, you can set these dependencies as part of the Producer instance that creates the category ids (i.e. via the category_ids producer).

To properly set this up, add the category instances as part of the Producer.init. Columns from custom requirements can be added to a Producer via the Producer.requires in combination with the Producer.setup.

@producer(
    uses={category_ids},
    produces={category_ids},
    ml_model_name="my_ml_model",
)
def my_categorization_producer(self: Producer, events: ak.Array, **kwargs) -> ak.Array:
    # do whatever else this Producer needs to to
    ...

    # reproduce category ids
    events = self[category_ids](events, **kwargs)

    return events


# The *requires* function can be used to add custom requirements to your Producer task call.
# In this example, we add the MLEvaluation output to our requirements, e.g. to
# categorize events based on a DNN output score
@my_categorization_producer.requires
def my_categorization_producer_reqs(self: Producer, reqs: dict) -> None:
    if "ml" in reqs:
        return

    from columnflow.tasks.ml import MLEvaluation
    reqs["ml"] = MLEvaluation.req(self.task, ml_model=self.ml_model_name)


# To also load columns from our custom requirements, we need to tell our Producer, how to access
# columns from the inputs. This is achieved via this *setup* function
@my_categorization_producer.setup
def my_categorization_producer_setup(self: Producer, reqs: dict, inputs: dict, reader_targets: InsertableDict) -> None:
    reader_targets["mlcolumns"] = inputs["ml"]["mlcolumns"]


# This *init* function is used to add categories as part of the Producer init. This means that
# your config will always contain this category as long as you're running some task where this
# particular Producer has been requested.
@my_categorization_producer.init
def my_categorization_producer_init(self: Producer) -> None:
    # ensure that categories are only added to the config once
    tag = f"{my_categorization_producer_init.__name__}_called"
    if self.config_inst.has_tag(tag):
        return

    # add categories to config inst
    self.config_inst.add_category(
        name="my_category",
        id=12345,
        selection="my_categorizer",
    )

Helper functions for categories#

get_events_from_categories#

To obtain the events of a certain category (or categories) of an akward array events that contains the category_ids column, you can use the get_events_from_categories() function.

from columnflow.util import get_events_from_cateogries

# you can pass a string of a category in combination with the config inst
selected_events = get_events_from_cateogries(events, "my_category", config_inst)

# it is also possible to directly pass a category. In that case, no config inst is needed
selected_events = get_events_from_cateogries(events, config_inst.get_category("my_category"))

# it is also possible to pass a list of categories. In that case, all events
# that belong to one of the requested categories are selected
selected_events = get_events_from_cateogries(events, ["cat1", "cat2"], config_inst)

This function automatically consideres category ids of all leaf categories from the requested categories. This is especially helpful when your category contains multiple leaf categories.

Key points (TL; DR)#

  • Categories are order.category.Category instances defined in the config.

  • We can add categories to each other; the “smallest unit” of category is referred to as “leaf category”.

  • We build each category by combining all of it’s leaf categories (both in columns and in histograms).

  • Leaf categories need to define one or multiple Categorizer, which is a function that defines whether or not an event belongs in this category.

  • The category_ids column is produced via the category_ids Producer.

  • Make sure that for each category, all of its leaf categories are defined orthogonal to prevent double counting.

  • Groups of categories can be combined via create_category_combinations().

  • Combining categories can impact your parent categories; this can be prevented by defining your categories such that each group of categories is inclusive.