User Guide

GAMA is an AutoML tool which aims to automatically find the right machine learning algorithms to create the best possible data for your model. This page gives an introduction to basic components and concepts of GAMA.

GAMA performs a search over machine learning pipelines. An example of a machine learning pipeline would be to first perform data normalization and then use a nearest neighbor classifier to make a prediction on the normalized data. More formally, a machine learning pipeline is a sequence of one or more components. A component is an algorithm which performs either data transformation or a prediction. This means that components can be preprocessing algorithms such as PCA or standard scaling, or a predictor such as a decision tree or support vector machine. A machine learning pipeline then consists of zero or more preprocessing components followed by a predictor component.

Given some data, GAMA will start a search to try and find the best possible machine learning pipelines for it. After the search, the best model found can be used to make predictions. Alternatively, GAMA can combine several models into an ensemble to take into account more than one model when making predictions. For ease of use, GAMA provides a fit, predict and predict_proba function akin to scikit-learn.

Installation

For regular usage, you can install GAMA with pip:

pip install gama

GAMA features optional dependencies for visualization and development. You can install them with:

pip install gama[OPTIONAL]

where OPTIONAL is one or more (comma separated):

dev: sets up all required dependencies for development of GAMA.

doc: sets up all required dependencies for building documentation of GAMA.

To see exactly what dependencies will be installed, see setup.py. If you plan on developing GAMA, cloning the repository and installing locally with test and doc dependencies is advised:

git clone https://github.com/PGijsbers/gama.git
cd gama
pip install -e ".[doc,test]"

This installation will refer to your local GAMA files. Changes to the code directly affect the installed GAMA package without requiring a reinstall.

Examples

Classification

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
from gama import GamaClassifier

if __name__ == "__main__":
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=0
    )

    automl = GamaClassifier(max_total_time=180, store="nothing", n_jobs=1)
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train)

    label_predictions = automl.predict(X_test)
    probability_predictions = automl.predict_proba(X_test)

    print("accuracy:", accuracy_score(y_test, label_predictions))
    print("log loss:", log_loss(y_test, probability_predictions))

Should take 3 minutes to run and give the output below (exact performance might differ):

accuracy: 0.951048951048951
log loss: 0.1111237013184977

By default, GamaClassifier will optimize towards log loss.

Regression

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from gama import GamaRegressor

if __name__ == "__main__":
    X, y = load_diabetes(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    automl = GamaRegressor(max_total_time=180, store="nothing", n_jobs=1)
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train)

    predictions = automl.predict(X_test)

    print("MSE:", mean_squared_error(y_test, predictions))

Should take 3 minutes to run and give the output below (exact performance might differ):

MSE: 19.238475470025886

By default, GamaRegressor will optimize towards mean squared error.

Using Files Directly

You can load data directly from csv and ARFF files. For ARFF files, GAMA can utilize extra information given, such as which features are categorical. For csv files GAMA will infer column types, but this might lead to mistakes. In the example below, make sure to replace the file paths to the files to be used. The example script can be run by using e.g. breast_cancer_train.arff and breast_cancer_test.arff. The target should always be specified as the last column, unless the target_column is specified. Make sure you adjust the file path if not executed from the examples directory.

from gama import GamaClassifier

if __name__ == "__main__":
    file_path = "../tests/data/breast_cancer_{}.arff"

    automl = GamaClassifier(max_total_time=180, store="nothing", n_jobs=1)
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit_from_file(file_path.format("train"))

    label_predictions = automl.predict_from_file(file_path.format("test"))
    probability_predictions = automl.predict_proba_from_file(file_path.format("test"))

The GamaRegressor also has csv and ARFF support.

The advantage of using an ARFF file over something like a numpy-array or a csv file is that attribute types are specified. When supplying only numpy-arrays (e.g. through fit(X, y)), GAMA can not know if a particular feature is nominal or numeric. This means that GAMA might use a wrong feature transformation for the data (e.g. one-hot encoding on a numeric feature or scaling on a categorical feature). Note that this is not unique to GAMA, but any framework which accepts numeric input without meta-data.

Note

Unfortunately the date and string formats the ARFF file allows is not (fully) supported in GAMA yet, for the latest news, see issue#2.

Simple Features

This section features a couple of simple to use features that might be interesting for a wide audience. For more advanced features, see the Advanced Guide.

Command Line Interface

GAMA may also be called from a terminal, but the tool currently supports only part of all Python functionality. In particular it can only load data from .csv or .arff files and AutoML pipeline configuration is not available. The tool will produce a single pickled scikit-learn model (by default named ‘gama_model.pkl’), code export is also available. Please see gama -h for all options.

Code Export

It is possible to have GAMA export the final model definition as a Python file, see gama.Gama.export_script().

Important Hyperparameters

There are a lot of hyperparameters exposed in GAMA. In this section, you will find some hyperparameters you might want to set even if you otherwise use defaults. For more complete documentation on all hyperparameters, see API documentation.

Optimization

Perhaps the most important hyperparameters are the ones that specify what to optimize for, these are:

scoring: string (default=’neg_log_loss’ for classification and ‘mean_squared_error’ for regression): Sets the metric to optimize for. Make sure to optimize towards the metric that reflects well what is important to you. Any string that can construct a scikit-learn scorer is accepted, see this page for more information. Valid options include roc_auc, accuracy and neg_log_loss for classification, and neg_mean_squared_error and r2 for regression.
regularize_length: bool (default=True): If True, in addition to optimizing towards the metric set in scoring, also guide the search towards shorter pipelines. This setting currently has no effect for non-default search methods.

Example:

GamaClassifier(scoring='roc_auc', regularize_length=False)

Resources

n_jobs: int, optional (default=None): Determines how many processes can be run in parallel during fit. This has the most influence over how many machine learning pipelines can be evaluated. If it is set to -1, all cores are used. If set to None (default), half the cores are used. Changing it to use a set amount of (fewer) cores will decrease the amount of pipelines evaluated, but is needed if you do not want GAMA to use all resources.
max_total_time: int (default=3600): The maximum time in seconds that GAMA should aim to use to construct a model from the data. By default GAMA uses one hour. For large datasets, more time may be needed to get useful results.
max_eval_time: int (default=300): The maximum time in seconds that GAMA is allowed to use to evaluate a single machine learning pipeline. The default is set to five minutes. For large datasets, more time may be needed to get useful results.