columnflow.ml

`columnflow.ml`#

Definition of basic objects for describing and creating ML models.

Classes:

MLModel(analysis_inst, *[, parameters])

Minimal interface to ML models with connections to config objects (such as py:class:order.Config or a order.Dataset) and, on an optional basis, to tasks.

class MLModel(analysis_inst, *, parameters=None, **kwargs)[source]#

Bases: Derivable

Minimal interface to ML models with connections to config objects (such as py:class:order.Config or a order.Dataset) and, on an optional basis, to tasks.

Inheriting classes need to overwrite eight methods:

sandbox()

datasets()

uses()

produces()

output()

open_model()

train()

evaluate()

See their documentation below for more info.

There are several optional hooks that allow for a custom setup after config objects were assigned (setup()), a fine-grained configuration of additional training requirements (requires()), diverging training and evaluation phase spaces (training_configs(), training_calibrators(), training_selector(), training_producers()), or how hyper-paramaters are string encoded for output declarations (parameter_pairs()).

classattribute single_config#

type: bool

The default flag that marks whether this model only accepts a single config object in case no value is passed in the constructor. Converted into an instance attribute upon instantiation.

classattribute folds#

type: int

The default number of folds for the k-fold cross-validation in case no value is passed in the constructor. Converted into an instance attribute upon instantiation.

classattribute store_name#

type: str, None

The default name for storing input data in case no value is passed in the constructor. When None, the name of the model class is used instead. Converted into an instance attribute upon instantiation.

analysis_inst#

type: order.Analysis

Reference to the order.Analysis object.

parameters#

type: OrderedDict

A dictionary mapping parameter names to arbitrary values, such as {"layers": 5, "units": 128}.

used_datasets#

type: dict read-only

Sets of order.Dataset instances that are used by the model training, mapped to their corresponding order.Config instances.

used_columns#

type: set read-only

Column names or Route’s that are used by this model, mapped to order.Config instances they belong to.

produced_columns#

type: set read-only

Column names or Route’s that are produces by this model, mapped to order.Config instances they belong to.

Methods:

`parameter_pairs`([only_significant])	Returns a list of all parameter name-value tuples.
`get_scheduler_messages`(task)	Checks if the task obtained messages from a central luigi scheduler, parses them expecting key - value pairs, and returns them in an ordered `DotDict`.
`setup`()	Hook that is called after the model has been setup and its `config_insts` were assigned.
`requires`(task)	Returns tasks that are required for the training to run and whose outputs are needed.
`training_configs`(requested_configs)	Given a sequence of names of requested `order.Config` objects, requested_configs, this method can alter and/or replace them to define a different (set of) config(s) for the preprocessing and training pipeline.
`training_calibrators`(config_inst, ...)	Given a sequence of requested_calibrators for a config_inst, this method can alter and/or replace them to define a different set of calibrators for the preprocessing and training pipeline.
`training_selector`(config_inst, ...)	Given a requested_selector for a config_inst, this method can change it to define a different selector for the preprocessing and training pipeline.
`training_producers`(config_inst, ...)	Given a sequence of requested_producers for a config_inst, this method can alter and/or replace them to define a different set of producers for the preprocessing and training pipeline.
`sandbox`(task)	Given a task, returns the name of a sandbox that is needed to perform model training and evaluation.
`datasets`(config_inst)	Returns a set of all required datasets for a certain config_inst.
`uses`(config_inst)	Returns a set of all required columns for a certain config_inst.
`produces`(config_inst)	Returns a set of all produced columns for a certain config_inst.
`output`(task)	Returns a structure of output targets.
`open_model`(target)	Implemenents the opening of a trained model from target (corresponding to the structure returned by `output()`).
`train`(task, input, output)	Performs the creation and training of a model, being passed a task and its input and output.
`evaluate`(task, events, models, fold_indices)	Performs the model evaluation for a task on a chunk of events and returns them.

Attributes:

accepts_scheduler_messages

Whether the training or evaluation loop expects and works with messages sent from a central luigi scheduler through the active worker to the underlying task.

parameter_pairs(only_significant=False)[source]#

Returns a list of all parameter name-value tuples. In this context, significant parameters are those that potentially lead to different results (e.g. network architecture parameters as opposed to some log level).

Return type:: list[tuple[str, Any]]

property accepts_scheduler_messages: bool#: Whether the training or evaluation loop expects and works with messages sent from a central luigi scheduler through the active worker to the underlying task. See get_scheduler_messages() for more info.

get_scheduler_messages(task)[source]#

Checks if the task obtained messages from a central luigi scheduler, parses them expecting key - value pairs, and returns them in an ordered DotDict. All values are KeyValueMessage objects (with key, value and respond() members).

Scheduler messages are only sent while the task is actively running, so it most likely only makes sense to expect and react to messages during training and evaluation loops.

Return type:: DotDict[str, KeyValueMessage]

setup()[source]#

Hook that is called after the model has been setup and its config_insts were assigned.

Return type:: None

requires(task)[source]#

Returns tasks that are required for the training to run and whose outputs are needed.

Return type:: Any

training_configs(requested_configs)[source]#

Given a sequence of names of requested order.Config objects, requested_configs, this method can alter and/or replace them to define a different (set of) config(s) for the preprocessing and training pipeline. This can be helpful in cases where training and evaluation phase spaces, as well as the required input datasets and/or columns are intended to diverge.

Return type:: list[str]

training_calibrators(config_inst, requested_calibrators)[source]#

Given a sequence of requested_calibrators for a config_inst, this method can alter and/or replace them to define a different set of calibrators for the preprocessing and training pipeline. This can be helpful in cases where training and evaluation phase spaces, as well as the required input columns are intended to diverge.

Return type:: list[str]

training_selector(config_inst, requested_selector)[source]#

Given a requested_selector for a config_inst, this method can change it to define a different selector for the preprocessing and training pipeline. This can be helpful in cases where training and evaluation phase spaces, as well as the required input columns are intended to diverge.

Return type:: str

training_producers(config_inst, requested_producers)[source]#

Given a sequence of requested_producers for a config_inst, this method can alter and/or replace them to define a different set of producers for the preprocessing and training pipeline. This can be helpful in cases where training and evaluation phase spaces, as well as the required input columns are intended to diverge.

Return type:: list[str]

abstract sandbox(task)[source]#

Given a task, returns the name of a sandbox that is needed to perform model training and evaluation.

Return type:: str

abstract datasets(config_inst)[source]#

Returns a set of all required datasets for a certain config_inst. To be implemented in subclasses.

Return type:: set[Dataset]

abstract uses(config_inst)[source]#

Returns a set of all required columns for a certain config_inst. To be implemented in subclasses.

Return type:: set[Route | str]

abstract produces(config_inst)[source]#

Returns a set of all produced columns for a certain config_inst. To be implemented in subclasses.

Return type:: set[Route | str]

abstract output(task)[source]#

Returns a structure of output targets. To be implemented in subclasses.

Return type:: Any

abstract open_model(target)[source]#

Implemenents the opening of a trained model from target (corresponding to the structure returned by output()). To be implemented in subclasses.

Return type:: Any

abstract train(task, input, output)[source]#

Performs the creation and training of a model, being passed a task and its input and output. To be implemented in subclasses.

Return type:: None

abstract evaluate(task, events, models, fold_indices, events_used_in_training=False)[source]#

Performs the model evaluation for a task on a chunk of events and returns them. The list of models corresponds to the number of folds generated by this model, and the already evaluated fold_indices for this event chunk that might used depending on events_used_in_training. To be implemented in subclasses.

Return type:: Array

columnflow.ml

Contents

columnflow.ml#

`columnflow.ml`#