Advanced Guide
For a basic introduction to GAMA, read the User Guide first. This section will cover more advanced usage of GAMA, in particular it covers:
- Ways to configure GAMA, such as:
A description of non-default AutoML steps and how to configure them.
Configuring the search space.
- Interfacing with GAMA:
An introduction to optimization traces and visualizing them.
GAMA’s Events.
- Developers notes:
A project overview.
How to add a search or post processing step.
AutoML Pipeline
An AutoML system performs several operations in its search for a model, and each of them may have several options and hyperparameters. An important decision is picking the search algorithm, which performs search over machine learning pipelines for your data. Another choice would be how to construct a model after search, e.g. by training the best pipeline or constructing an ensemble. Similarly to how data processing algorithms can form a machine learning pipeline, we will refer to a configuration of these AutoML components as an AutoML Pipeline. In GAMA we currently support flexibility in the AutoML pipeline in two stages: search and post-processing. See Adding Your Own Search or Postprocessing for more information on how to add your own.
Search Algorithms
The following search algorithms are available in GAMA:
Random Search
: Randomly pick machine learning pipelines from the search space and evaluate them.Asynchronous Evolutionary Algorithm
: Evolve a population of machine learning pipelines, drawing new machine learning pipelines from the best of the population.Asynchronous Successive Halving Algorithm
: A bandit-based approach where many machine learning pipelines iteratively get evaluated and eliminated on bigger fractions of the data.
Post-processing
The following post-processing steps are available:
None
: no post-processing will be done. This means no final pipeline will be trained and predict and predict_proba will be unavailable. This can be interesting if you are only interested in the search procedure.FitBest
: fit the single best machine learning pipeline found during search.Ensemble
: create an ensemble out of evaluated machine learning pipelines. This requires more time but can lead to better results.
Configuring the AutoML pipeline
By default ‘prepend pipeline’, ‘Asynchronous EA’ and ‘FitBest’ are chosen for pre-processing, search and post-processing, respectively. However, it is easy to change this, or to change the hyperparameters with which each component is used. For example, searching with ‘Asynchronous Successive Halving’ and creating an ensemble during post-processing:
from gama import GamaClassifier
from gama.search_methods import AsynchronousSuccessiveHalving
from gama.postprocessing import EnsemblePostProcessing
custom_pipeline_gama = GamaClassifier(search=AsynchronousSuccessiveHalving(), post_processing=EnsemblePostProcessing())
or using ‘Asynchronous EA’ but with custom hyperparameters:
from gama import GamaClassifier
from gama.search_methods import AsyncEA
custom_pipeline_gama = GamaClassifier(search=AsyncEA(population_size=30))
GAMA Search Space Configuration
By default GAMA will build pipelines out of scikit-learn algorithms, both for preprocessing and learning models. It is possible to modify this search space, changing the algorithms or hyperparameter ranges to consider.
The search space is determined by the search_space dictionary passed upon initialization. The defaults are found in classification.py and regression.py for the GamaClassifier and GamaRegressor, respectively.
A sample of algorithms that GAMA uses by default:
The search space configuration is defined in a python dictionary. For reference, a minimal example search space configuration can look like this:
from sklearn.naive_bayes import BernoulliNB
search_space = {
'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
BernoulliNB: {
'alpha': [],
'fit_prior': [True, False]
}
}
At the top level, allowed key types are:
string
, with a list as value.
It specifies the name of a hyperparameter with its possible values. By defining a hyperparameter at the top level, you can reference it as hyperparameter for any specific algorithm. To do so, identify it with the same name and set its possible values to an empty list (see alpha in the example). The benefit of doing is that multiple algorithms can share a hyperparameter space that is defined only once. Additionally, in evolution this makes it possible to know which hyperparameter values can be crossed over between different algorithms.
class
, with a dictionary as value.
The key specifies the algorithm, calling it should instantiate the algorithm. The dictionary specifies the hyperparameters by name and their possible values as list. All hyperparameters specified should be taken as arguments for the algorithm’s initialization. A hyperparameter specified at the top level of the dictionary can share a name with a hyperparameter of the algorithm. To use the values provided by the shared hyperparameter, set the possible values to an empty list. If a list of values is provided instead, it will not use the shared hyperparameter values.
Logging
GAMA makes use of the default Python logging module. This means logs can be captured at different levels, and handled by one of several StreamHandlers.
The most common logging use cases are to write a comprehensive log to file, as well as print important messages to stdout
.
Writing log messages to stdout
is directly supported by GAMA through the verbosity
hyperparameter
(which defaults to logging.WARNING
).
By default GAMA will also save several different logs.
This can be turned off by the store
hyperparameter.
The store
hyperparameter allows you to store the logs, as well as models and predictions.
By default logs are kept (which includes evaluation data), but models and predictions are discarded.
The output_directory
hyperparameter determines where this data is stored, by default a unique name is generated.
In the output directory you will find three files and a subdirectory:
‘evaluations.log’: a csv file (with ‘;’ as separator) in which each evaluation is stored.
‘gama.log’: A loosely structured file with general (human readable) information of the GAMA run.
‘resources.log’: A record of the memory usage for each of GAMA’s processes over time.
cache directory: contains evaluated models and predictions, only if
store
is ‘all’ or ‘models’
If you want other behavior, the logging module offers you great flexibility on making your own variations.
The following script writes any log messages of logging.DEBUG
or up to both file and console:
import logging
import sys
from gama import GamaClassifier
gama_log = logging.getLogger('gama')
gama_log.setLevel(logging.DEBUG)
fh_log = logging.FileHandler('logfile.txt')
fh_log.setLevel(logging.DEBUG)
gama_log.addHandler(fh_log)
# The verbosity hyperparameter sets up an StreamHandler to `stdout`.
automl = GamaClassifier(max_total_time=180, verbosity=logging.DEBUG, store="nothing")
Running the above script will create the ‘logfile.txt’ file with all log messages that could also be seen in the console. An overview the log levels:
DEBUG
: Messages for developers.
INFO
: General information about the optimization process.
WARNING
: Serious errors that do not prohibit GAMA from running to completion (but results could be suboptimal).
ERROR
: Errors which prevent GAMA from running to completion.
Events
It is also possible to programmatically receive updates of the optimization process through the events:
from gama import GamaClassifier
def print_evaluation(evaluation):
print(f'{evaluation.individual.pipeline_str()} was evaluated. Fitness is {evaluation.score}.')
automl = GamaClassifier()
automl.evaluation_completed(print_evaluation)
automl.fit(X, y)
The function passed to evaluation_completed
should take a gama.genetic_programming.utilities.evaluation_library.Evaluation
as single argument.
Any exceptions raised but not handled in the callback will be ignored but logged at logging.WARNING
level.
During the callback a stopit.utils.TimeoutException
may be raised.
This signal normally indicates to GAMA to move on to the next step in the AutoML pipeline.
If caught by the callback, GAMA may exceed its allotted time.
For this reason, it is advised to keep callbacks short after catching a stopit.utils.TimeoutException
.
If the stopit.utils.TimeoutException
is not caught, GAMA will correctly terminate its step in the AutoML pipeline
and continue as normal.
Developers Notes
Adding Your Own Search or Postprocessing
Note
This is not set in stone. As more AutoML pipeline steps are added by more people, we expect to identify parts of the interface to be improved. We can’t do this without your feedback! Feel free to get in touch, preferably in the form of a public discussion on a Github issue, and let us know what difficulties you encounter, or what works well!
This section contains information about implementing your own Search or Postprocessing procedures.
To keep interfaces uniform across the different search or postprocesing implementations, each should derive from their respective baseclass (BaseSearch
and BasePostProcessing
).
They each have their own processing method (search
for BaseSearch
and post_process
for BasePostProcessing
) which should be implemented.
We will show example implementations further down.
Your Search or Postprocessing algorithm may feature hyperparameters, care should be taken to provide good default values.
For some algorithms, hyperparameter default values are best specified based on characteristics of the dataset.
For instance, with big datasets it might be useful to perform (some of) the workload on a subset of the data.
We refer to these data-dependent non-static defaults as ‘dynamic defaults’.
Both BaseSearch
and BasePostProcessing
feature a dynamic_defaults
method which is called before search
and ...
, respectively.
This allows you to overwrite default hyperparameter values based on the dataset properties.
The hyperparameter values with which your search or postprocessing will be called is determined in the following order:
User specified values are used if specified (e.g.
EnsemblePostProcessing(n=25)
)Otherwise the values determined by
dynamic_defaults
are usedIf neither are specified, the static default values are used.
Search
To implement your own search procedure, create a class which derives from the base class:
- class gama.search_methods.base_search.BaseSearch[source]
All search methods should be derived from this class. This class should not be directly used to configure GAMA.
- dynamic_defaults(x: DataFrame, y: DataFrame | Series, time_limit: float) None [source]
Set hyperparameter defaults based on the dataset and time-constraints.
Should be called before
search
.- Parameters:
x (pandas.DataFrame) – Features of the data.
y (pandas.DataFrame or pandas.Series) – Labels of the data.
time_limit (float) – Time in seconds available for search and selecting dynamic defaults. There is no need to adhere to this explicitly, a
stopit.utils.TimeoutException
will be raised. The time-limit might be an important factor in setting hyperparameter values
- property hyperparameters: Dict[str, Any]
Hyperparameter (name, value) pairs as set/determined dynamically/default.
Values may have been set directly, through dynamic defaults or static defaults. This is also the order in which the value of a hyperparameter is checked, i.e. a user set value wil overwrite any other value, and a dynamic default will overwrite a static one. Dynamic default values only considered if
dynamic_defaults
has been called.
- search(operations: OperatorSet, start_candidates: List[Individual])[source]
Execute search as configured.
Sets
output
field of this class to the best Individuals.- Parameters:
operations (OperatorSet) – Has methods to create new individuals, evaluate individuals and more.
start_candidates (List[Individual]) – A list of individuals to be considered before all others.
You can use existing search implementations as a reference.
__init__
To allow us to identify which hyperparameters are set by the user, and which are defaults, the default values for each hyperparameter in the __init__
method should be None
.
In search methods, each evaluation of a machine learning pipeline is logged automatically. Default data recorded includes:
a string representation of the pipeline
the scores of the pipeline according to the specified metrics
any errors that occurred during evaluation
It is possible to add additional fields to be recorded for each pipeline, as shown here.
The extra_fields
of the EvaluationLogger
expects a dictionary,
which maps the name of a field to the method which extracts the information that should be recorded.
In this case we are interested to know the parent of each evaluated pipeline, so we might later inspect a pipeline’s “lineage”.
dynamic_defaults
Hyperparameters such as population size might make for good candidates for dynamic defaults. However it is not obvious what the relationship should be. For this reason, we choose not to work with dynamic defaults in this search strategy. Perhaps in the future, when we have adequate data to model the relationship we can determine useful default values.
You can find an example usage of dynamic defaults in the Asynchronous Successive Halving Algorithm search.
search
This method should execute the search for a good machine learning pipeline.
The search should always take into account the start_candidates
in some form.
This allows the search start point to be set by the user or a warm start step.
In this evolutionary optimization, they form the initial population.
The search algorithm should update the output
field of the Search
object, and behave nicely when interrupted with a TimeoutException
.
This allows GAMA to control when to shut down search (and continue with post processing).
PostProcessing
PostProcessing follows a similar pattern, where the class should allow initialization with its hyperparameters,
an implementation of dynamic defaults (optional), and a post_process
function.
- class gama.postprocessing.BasePostProcessing(time_fraction: float)[source]
All post-processing methods should be derived from this class. This class should not be directly used to configure GAMA.
- Parameters:
time_fraction (float) – Fraction of total time that to be reserved for this post-processing step.
- dynamic_defaults(gama: Gama) None [source]
Configure the post-processing technique based on GAMA properties.
- property hyperparameters: Dict[str, Any]
Hyperparameter (name, value) pairs.
Value determined by user > dynamic default > static default. Dynamic default values only considered if
dynamic_defaults
has been called.
- post_process(x: DataFrame, y: DataFrame | Series, timeout: float, selection: List[Individual]) object [source]
- Parameters:
x (pd.DataFrame) – all training features
y (Union[pd.DataFrame, pd.Series]) – all training labels
timeout (float) – allowed time in seconds for post-processing
selection (List[Individual]) – individuals selected by the search space, ordered best first
- Returns:
A model with
predict
and optionallypredict_proba
.- Return type:
Any
- to_code(preprocessing: Sequence[Tuple[str, TransformerMixin]] | None = None) str [source]
Generate Python code to reconstruct a pipeline that constructs the model.
- Parameters:
preprocessing (Sequence[TransformerMixin], optional (default=None)) – Preprocessing steps that need be executed before the model.
- Returns:
A string of Python code that sets a ‘pipeline’ variable to the pipeline that defines the final pipeline generated by post-processing.
- Return type:
Unlike the search methods, which are not required to have any hyperparameter, post processing is required to have a default value for time_fraction
.
time_fraction
is the fraction of the total time that should be reserved for the post processing method (as set on initialization through max_total_time
).
For instance, when a post-processing object’s time_fraction
is 0.3
and GAMA is initiated with max_total_time=3600
,
then 3600*0.3=1080
seconds are reserved for the post-processing phase.
Note
While hard, it is important to provide an accurate estimate for time_fraction
. If you reserve too much time,
it means that the search procedure will have to be cut off unnecessarily early. If too little time is reserved,
GAMA will interrupt the post-processing step and return control to the user.
It is generally hard to know to how much to reserve, and is likely dependent on the dataset and number of evaluated
pipelines in search. We would like to implement ways in which post-processing methods have access to these
statistics and allow them to update their time estimate, so that less time is wasted on too long or too short
post-processing phases.