EstimatorInterface

EstimatorInterface#

class pybrush.EstimatorInterface.EstimatorInterface(mode='classification', pop_size=100, max_gens=100, max_time=-1, max_stall=0, verbosity=0, max_depth=3, max_size=20, num_islands=1, n_jobs=1, mig_prob=0.05, cx_prob=0.14285714285714285, mutation_probs={'delete': 0.16666666666666666, 'insert': 0.16666666666666666, 'point': 0.16666666666666666, 'subtree': 0.16666666666666666, 'toggle_weight_off': 0.16666666666666666, 'toggle_weight_on': 0.16666666666666666}, functions: list[str] | dict[str, float] = {}, initialization='uniform', algorithm='nsga2', objectives=['error', 'size'], random_state=None, logfile='', save_population='', load_population='', shuffle_split=False, weights_init=True, val_from_arch=True, use_arch=False, validation_size: float = 0.0, batch_size: float = 1.0)[source]#

Interface class for all estimators in pybrush.

Parameters:
modestr, default ‘classification’

The mode of the estimator. Used by subclasses

pop_sizeint, default 100

Population size.

max_gensint, default 100

Maximum iterations of the algorithm.

max_time: int, optional (default: -1)

Maximum time terminational criterion in seconds. If -1, not used.

max_stall: int, optional (default: 0)

How many generations to continue after the validation loss has stalled. If 0, not used.

verbosityint, default 0

Controls level of printouts.

max_depthint, default 0

Maximum depth of GP trees in the GP program. Use 0 for no limit.

max_sizeint, default 0

Maximum number of nodes in a tree. Use 0 for no limit.

num_islandsint, default 5

Number of independent islands to use in evolutionary framework. This also corresponds to the number of parallel threads in the c++ engine.

mig_probfloat, default 0.05

Probability of occuring a migration between two random islands at the end of a generation, must be between 0 and 1.

cx_probfloat, default 1/7

Probability of applying the crossover variation when generating the offspring, must be between 0 and 1. Given that there are n mutations, and either crossover or mutation is used to generate each individual in the offspring (but not both at the same time), we want to have by default an uniform probability between crossover and every possible mutation. By setting cx_prob=1/(n+1), and 1/n for each mutation, we can achieve an uniform distribution.

mutation_probsdict, default {“point”:1/6, “insert”:1/6, “delete”:1/6, “subtree”:1/6, “toggle_weight_on”:1/6, “toggle_weight_off”:1/6}

A dictionary with keys naming the types of mutation and floating point values specifying the fraction of total mutations to do with that method. The probability of having a mutation is (1-cx_prob) and, in case the mutation is applied, then each mutation option is sampled based on the probabilities defined in mutation_probs. The set of probabilities should add up to 1.0.

functions: dict[str,float] or list[str], default {}

A dictionary with keys naming the function set and values giving the probability of sampling them, or a list of functions which will be weighted uniformly. If empty, all available functions are included in the search space.

initialization{“uniform”, “max_size”}, default “uniform”

Distribution of sizes on the initial population. If max_size, then every expression is created with max_size nodes. If uniform, size will be uniformly distributed between 1 and max_size.

objectiveslist[str], default [“error”, “size”]

list with one or more objectives to use. Options are “error”, “size”, “complexity”. If “error” is used, then it will be the mean squared error for regression, and accuracy for classification.

algorithm{“nsga2island”, “nsga2”, “gaisland”, “ga”}, default “nsga2”

Which Evolutionary Algorithm framework to use to evolve the population. This is used only in DeapEstimators.

weights_initbool, default True

Whether the search space should initialize the sampling weights of terminal nodes based on the correlation with the output y. If False, then all terminal nodes will have the same probability of 1.0.

validation_sizefloat, default 0.0

Percentage of samples to use as a hold-out partition. These samples are used to calculate statistics during evolution, but not used to train the models. The best_estimator_ will be selected using this partition. If zero, then the same data used for training is used for validation.

val_from_arch: boolean, optional (default: True)

Validates the final model using the archive rather than the whole population.

use_arch: boolean, optional (default: False)

Determines if we should save pareto front of the entire evolution (when set to True) or just the final population (False).

batch_sizefloat, default 1.0

Percentage of training data to sample every generation. If 1.0, then all data is used. Very small values can improve execution time, but also lead to underfit.

save_population: str, optional (default “”)

string containing the path to save the final population. Ignored if not provided.

load_population: str, optional (default “”)

string containing the path to load the initial population. Ignored if not provided.

shuffle_split: boolean, optional (default False)

whether if the engine should shuffle the data before splitting it into train and validation partitions. Ignored if validation_size is set to zero.

logfile: str, optional (default: “”)

If specified, spits statistics into a logfile. “” means don’t log.

random_state: int or None, default None

If int, then the value is used to seed the c++ random generator; if None, then a seed will be generated using a non-deterministic generator. It is important to notice that, even if the random state is fixed, it is unlikely that running brush using multiple threads will have the same results. This happens because the Operating System’s scheduler is responsible to choose which thread will run at any given time, thus reproductibility is not guaranteed.