API
GAMA
GamaClassifier
- class gama.GamaClassifier(search_space=None, scoring='neg_log_loss', *args, **kwargs)[source]
Gama with adaptations for (multi-class) classification.
- Parameters:
search_space (Dict) – Specifies available components and their valid hyperparameter settings. For more information, see GAMA Search Space Configuration.
scoring (str, Metric or Tuple) – Specifies the/all metric(s) to optimize towards. A string will be converted to Metric. A tuple must specify each metric with the same type (e.g. all str). See Metrics for built-in metrics.
regularize_length (bool (default=True)) – If True, add pipeline length as an optimization metric. Short pipelines should then be preferred over long ones.
max_pipeline_length (int, optional (default=None)) – If set, limit the maximum number of steps in any evaluated pipeline. Encoding and imputation are excluded.
random_state (int, optional (default=None)) – Seed for the random number generators used in the process. However, with
n_jobs > 1
, there will be randomization introduced by multi-processing. For reproducible results, set this and usen_jobs=1
.max_total_time (positive int (default=3600)) – Time in seconds that can be used for the
fit
call.max_eval_time (positive int, optional (default=None)) – Time in seconds that can be used to evaluate any one single individual. If None, set to 0.1 * max_total_time.
n_jobs (int, optional (default=None)) – The amount of parallel processes that may be created to speed up
fit
. Accepted values are positive integers, -1 or None. If -1 is specified, multiprocessing.cpu_count() processes are created. If None is specified, multiprocessing.cpu_count() / 2 processes are created.max_memory_mb (int, optional (default=None)) – Sets the total amount of memory GAMA is allowed to use (in megabytes). If not set, GAMA will use as much as it needs. GAMA is not guaranteed to respect this limit at all times, but it should never violate it for too long.
verbosity (int (default=logging.WARNING)) – Sets the level of log messages to be automatically output to terminal.
search (BaseSearch, optional) – Search method to use to find good pipelines. Should be instantiated. Default depends on
goal
.post_processing (BasePostProcessing, optional) – Post-processing method to create a model after the search phase. Should be an instantiated subclass of BasePostProcessing. Default depends on
goal
.output_directory (str, optional (default=None)) – Directory to use to save GAMA output. This includes both intermediate results during search and logs. This directory must be empty or not exist. If set to None, generate a unique name (“gama_HEXCODE”).
store (str (default='logs')) –
- Determines which data is stored after each run:
’nothing’: keep nothing from this run
’models’: keep only cache with models and predictions
’logs’: keep only the logs
’all’: keep logs and cache with models and predictions
preset (str (default='simple')) –
Determines the steps of the AutoML pipeline when they are not provided explicitly, based on the given goal. One of:
simple: Create a simple pipeline with good performance.
performance: Try to get the best performing model.
GamaRegressor
Metrics
If you have a custom scoring function, you can define your own Metric
.
Metric
MetricType
Search Methods
AsynchronousSuccessiveHalving
- class gama.search_methods.AsynchronousSuccessiveHalving(reduction_factor: int | None = None, minimum_resource: Tuple[int, float] | None = None, maximum_resource: Tuple[int, float] | None = None, minimum_early_stopping_rate: int | None = None)[source]
Asynchronous Halving Algorithm by Li et al.
paper: https://arxiv.org/abs/1810.05934
- Parameters:
reduction_factor (int, optional (default=3)) – Reduction factor of candidates between each rung.
minimum_resource (int or float, optional (default=0.125)) – Number of samples to use in the lowest rung. If integer, it specifies the number of rows. If float, it specifies the fraction of the dataset.
maximum_resource (int or float optional (default=1.0)) – Number of samples to use in the top rung. If integer, it specifies the number of rows. If float, it specifies the fraction of the dataset.
minimum_early_stopping_rate (int (default=0)) – Number of lowest rungs to skip.
AsyncEA
- class gama.search_methods.AsyncEA(population_size: int | None = None, max_n_evaluations: int | None = None, restart_callback: Callable[[], bool] | None = None)[source]
Perform asynchronous evolutionary optimization.
- Parameters:
population_size (int, optional (default=50)) – Maximum number of individuals in the population at any time.
max_n_evaluations (int, optional (default=None)) – If specified, only a maximum of
max_n_evaluations
individuals are evaluated. If None, the algorithm will be run until interrupted by the user or a timeout.restart_callback (Callable[[], bool], optional (default=None)) – Function which takes no arguments and returns True if search restart.
RandomSearch
Post-Processing
NoPostProcessing
BestFitPostProcessing
EnsemblePostProcessing
- class gama.postprocessing.EnsemblePostProcessing(time_fraction: float = 0.3, ensemble_size: int | None = 25, hillclimb_size: int | None = 10000, max_models: int | None = 200)[source]
Ensemble construction per Caruana et al.
- Parameters:
time_fraction (float (default=0.3)) – Fraction of total time reserved for Ensemble building.
ensemble_size (int, optional (default=25)) – Total number of models in the ensemble. When a single model is chosen more than once, it will increase its weight in the ensemble and does count towards this maximum.
hillclimb_size (int, optional (default=10_000)) – Number of predictions that are used to determine the ensemble score during hillclimbing. If
None
, use all.max_models (int, optional (default=200)) – Only consider the best
max_models
number of models. IfNone
, use all. Consequently also sets the max number of unique models in the ensemble.
Genetic Programming
Components
Defines the building blocks for Individuals.
Individuals represent machine learning pipelines in a back-end agnostic way.
An Individual can be converted to its back-end specific representation
(e.g. a scikit-learn Pipeline) by calling its pipeline
property
as long as a function has been provided to convert the individual to it.
Individuals are built with:
Terminals. Definition of a specific value for a specific hyperparameter. Immutable.
- Primitives. Definition of a specific algorithm. Immutable.
Defined by Terminal input, output type and operation.
- PrimitiveNodes. Mutable for easy operations (e.g. mutation).
An instantiated Primitive with specific Terminals.
Fitness. Stores information about the evaluation of the individual.
Individual
- class gama.genetic_programming.components.Individual(main_node: PrimitiveNode, to_pipeline: Callable | None = None)[source]
Collection of PrimitiveNodes which together specify a machine learning pipeline.
- Parameters:
main_node (PrimitiveNode) – The first node of the individual (the estimator node).
to_pipeline (Callable, optional (default=None)) – A function which can convert this individual into a machine learning pipeline. If not provided, the
pipeline
property will be unavailable.
Primitive
PrimitiveNode
- class gama.genetic_programming.components.PrimitiveNode(primitive: Primitive, data_node: PrimitiveNode | str, terminals: List[Terminal])[source]
An instantiation for a Primitive with specific Terminals.
- Parameters:
primitive (Primitive) – The Primitive type of this PrimitiveNode.
data_node (PrimitiveNode) – The PrimitiveNode that specifies all preprocessing before this PrimitiveNode.
terminals (List[Terminal]) – A list of terminals matching the
primitive
.
Terminal
Mutation
Contains mutation functions for genetic programming. Each mutation function takes an individual and modifies it in-place.
- gama.genetic_programming.mutation.mut_insert(individual: Individual, primitive_set: dict) None [source]
Mutate an Individual in-place by inserting a PrimitiveNode at a random location.
The new PrimitiveNode will not be inserted as root node.
- Parameters:
individual (Individual) – Individual to mutate in-place.
primitive_set (dict)
- gama.genetic_programming.mutation.mut_replace_primitive(individual: Individual, primitive_set: dict) None [source]
Mutates an Individual in-place by replacing one of its Primitives.
- Parameters:
individual (Individual) – Individual to mutate in-place.
primitive_set (dict)
- gama.genetic_programming.mutation.mut_replace_terminal(individual: Individual, primitive_set: dict) None [source]
Mutates an Individual in-place by replacing one of its Terminals.
- Parameters:
individual (Individual) – Individual to mutate in-place.
primitive_set (dict)
- gama.genetic_programming.mutation.mut_shrink(individual: Individual, _primitive_set: dict | None = None, shrink_by: int | None = None) None [source]
Mutates an Individual in-place by removing any number of primitive nodes.
Primitive nodes are removed from the preprocessing end.
- Parameters:
individual (Individual) – Individual to mutate in-place.
_primitive_set (dict, optional) – Not used. Present to create a matching function signature with other mutations.
shrink_by (int, optional (default=None)) – Number of primitives to remove. Must be at least one greater than the number of primitives in
individual
. If None, a random number of primitives is removed.
- gama.genetic_programming.mutation.random_valid_mutation_in_place(individual: Individual, primitive_set: dict, max_length: int | None = None) Callable [source]
Apply a random valid mutation in place.
The random mutation can be one of:
mut_random_primitive
mut_random_terminal, if the individual has at least one
mutShrink, if individual has at least two primitives
mutInsert, if it would not exceed
new_max_length
when specified.
- Parameters:
individual (Individual) – An individual to be mutated in-place.
primitive_set (dict) – A dictionary defining the set of primitives and terminals.
max_length (int, optional (default=None)) – If specified, impose a maximum length on the new individual.
- Returns:
The mutation function used.
- Return type:
Callable
Crossover
Functions which take two Individuals and produce at least one new Individual.
- gama.genetic_programming.crossover.crossover_primitives(ind1: Individual, ind2: Individual) Tuple[Individual, Individual] [source]
Crossover two individuals by exchanging any number of preprocessing steps.
- Parameters:
ind1 (Individual) – The individual to crossover with individual2.
ind2 (Individual) – The individual to crossover with individual1.
- gama.genetic_programming.crossover.crossover_terminals(ind1: Individual, ind2: Individual) Tuple[Individual, Individual] [source]
Crossover two individuals in-place by exchanging two Terminals.
Terminals must share output type but have different values.
- Parameters:
ind1 (Individual) – The individual to crossover with individual2.
ind2 (Individual) – The individual to crossover with individual1.
- gama.genetic_programming.crossover.random_crossover(ind1: Individual, ind2: Individual, max_length: int | None = None) Tuple[Individual, Individual] [source]
Random valid crossover between two individuals in-place, if it can be done.
- Parameters:
ind1 (Individual) – The individual to crossover with ind2.
ind2 (Individual) – The individual to crossover with ind1.
max_length (int, optional(default=None)) – The first individual in the returned tuple has at most
max_length
primitives. Requires both provided individuals to contain at mostmax_length
primitives.
- Raises:
If there is no valid crossover function for the two individuals. - If
max_length
is set and eitherind1
orind2
contain more primitives thanmax_length
.
Utilities
Generic
Collection of generic components.
Pareto Front
- class gama.utilities.generic.paretofront.ParetoFront(start_list: List[Any] | None = None, get_values_fn: Callable[[Any], Tuple[Any, ...]] | None = None)[source]
A list of tuples in which no one tuple is dominated by another.
- Parameters:
start_list (list, optional (default=None).) – List of items of which to calculate the Pareto front.
get_values_fn (Callable, optional (default=None)) – Function that takes an item and returns a tuple of values, such that each should be maximized. If left None, it is assumed that items are already such tuples.
Stopwatch
Timekeeper
- class gama.utilities.generic.timekeeper.TimeKeeper(total_time: int | None = None)[source]
Simple object that helps keep track of time over multiple activities.
- Parameters:
total_time (int, optional (default=None)) – The total time available across activities. If set to None, the
total_time_remaining
property will be unavailable.
AsyncEvaluator
Warning
I’m sure there are better tools out there, but I have yet to find a minimal easy multi-processing tool. I tried using the built-in ProcessPoolExecutor, but it had short comings such as not being able to cancel jobs while they were running.
- class gama.utilities.generic.async_evaluator.AsyncEvaluator(n_workers: int = 1, memory_limit_mb: int | None = None, logfile: str | None = None, wait_time_before_forced_shutdown: int = 10)[source]
Manages subprocesses on which arbitrary functions can be evaluated.
The function and all its arguments must be picklable. Using the same AsyncEvaluator in two different contexts raises a
RuntimeError
.- defaults: Dict, optional (default=None)
Default parameter values shared between all submit calls. This allows these defaults to be transferred only once per process, instead of twice per call (to and from the subprocess). Only supports keyword arguments.
- Parameters:
n_workers (int (default=1)) – Maximum number of subprocesses to run for parallel evaluations.
memory_limit_mb (int, optional (default=None)) – The maximum number of megabytes that this process and its subprocesses may use in total. If None, no limit is enforced. There is no guarantee the limit is not violated.
logfile (str, optional (default=None)) – If set, recorded resource usage will be written to this file.
wait_time_before_forced_shutdown (int (default=10)) – Number of seconds to wait between asking the worker processes to shut down and terminating them forcefully if they failed to do so.