EstimatorInterface

EstimatorInterface#

class pybrush.EstimatorInterface.EstimatorInterface(mode: str = 'classification', pop_size: int = 100, max_gens: int = 100, max_time: int = -1, max_stall: int = 0, verbosity: int = 0, max_depth: int = 10, max_size: int = 100, num_islands: int = 5, n_jobs: int = 1, mig_prob: float = 0.05, cx_prob: float = 0.14285714285714285, mutation_probs: Dict[str, float] = {'delete': 0.16666666666666666, 'insert': 0.16666666666666666, 'point': 0.16666666666666666, 'subtree': 0.16666666666666666, 'toggle_weight_off': 0.16666666666666666, 'toggle_weight_on': 0.16666666666666666}, functions: List[str] | Dict[str, float] = {}, initialization: str = 'uniform', objectives: List[str] = ['scorer', 'linear_complexity'], scorer: str = None, algorithm: str = 'nsga2', weights_init: bool = True, validation_size: float = 0.0, use_arch: bool = False, val_from_arch: bool = True, constants_simplification=True, inexact_simplification=True, batch_size: float = 1.0, sel: str = 'lexicase', surv: str = 'nsga2', save_population: str = '', load_population: str = '', bandit: str = 'dynamic_thompson', shuffle_split: bool = False, logfile: str = '', random_state: int = None)[source]#

Interface class for all estimators in pybrush.

Parameters:
modestr, default ‘classification’

The mode of the estimator. Used by subclasses

pop_sizeint, default 100

Population size.

max_gensint, default 100

Maximum iterations of the algorithm.

max_time: int, optional (default: -1)

Maximum time terminational criterion in seconds. If -1, not used.

max_stall: int, optional (default: 0)

How many generations to continue after the validation loss has stalled. If 0, not used.

verbosityint, default 0

Controls level of printouts. Set 0 to disable all printouts, 1 for basic information, and 2 or more for detailed information.

max_depthint, default 10

Maximum depth of GP trees in the GP program. Use 0 for no limit.

max_sizeint, default 100

Maximum number of nodes in a tree. Use 0 for no limit.

num_islandsint, default 5

Number of independent islands to use in evolutionary framework. This also corresponds to the number of parallel threads in the c++ engine.

mig_probfloat, default 0.05

Probability of occuring a migration between two random islands at the end of a generation, must be between 0 and 1.

cx_probfloat, default 1/7

Probability of applying the crossover variation when generating the offspring, must be between 0 and 1. Given that there are n mutations, and either crossover or mutation is used to generate each individual in the offspring (but not both at the same time), we want to have by default an uniform probability between crossover and every possible mutation. By setting cx_prob=1/(n+1), and 1/n for each mutation, we can achieve an uniform distribution.

mutation_probsdict, default {“point”:1/6, “insert”:1/6, “delete”:1/6, “subtree”:1/6, “toggle_weight_on”:1/6, “toggle_weight_off”:1/6}

A dictionary with keys naming the types of mutation and floating point values specifying the fraction of total mutations to do with that method. The probability of having a mutation is (1-cx_prob) and, in case the mutation is applied, then each mutation option is sampled based on the probabilities defined in mutation_probs. The set of probabilities should add up to 1.0. A non-positive value will disable the mutation, even if the multi armed bandit strategy is turned on (the mutation will be hidden from the bandit at initialization).

functions: dict[str,float] or list[str], default {}

A dictionary with keys naming the function set and values giving the probability of sampling them, or a list of functions which will be weighted uniformly. If empty, all available functions are included in the search space.

initialization{“uniform”, “max_size”}, default “uniform”

Distribution of sizes on the initial population. If max_size, then every expression is created with max_size nodes. If uniform, size will be uniformly distributed between 1 and max_size.

objectiveslist[str], default [“scorer”, “size”]

list with one or more objectives to use. The first objective is the main. If “scorer” is used, then the metric in scorer will be used as objective. Possible values are “scorer”, “size”, “complexity”, “linear_complexity”, and “depth”. The first objective will be used as criteria for Pareto sorting when creating the archive, and the recursive complexity will be used as secondary objective.

scorerstr, default None

The metric to use for the “scorer” objective. If None, it will be set to “mse” for regression and “log” for binary classification.

algorithm{“nsga2island”, “nsga2”, “gaisland”, “ga”}, default “nsga2”

Which Evolutionary Algorithm framework to use to evolve the population. This is used only in DeapEstimators.

weights_initbool, default True

Whether the search space should initialize the sampling weights of terminal nodes based on the correlation with the output y. If False, then all terminal nodes will have the same probability of 1.0. This parameter is ignored if the bandit strategy is used, and weights will be learned dynamically during the run.

validation_sizefloat, default 0.0

Percentage of samples to use as a hold-out partition. These samples are used to calculate statistics during evolution, but not used to train the models. The best_estimator_ will be selected using this partition. If zero, then the same data used for training is used for validation.

val_from_arch: boolean, optional (default: True)

Validates the final model using the archive rather than the whole population.

constants_simplification: boolean, optional (default: True)

Whether if the program should check for constant sub-trees and replace them with a single terminal with constant value or not.

inexact_simplification: boolean, optional (default: True)

Whether if the program should use the inexact simplification proposed in:

Guilherme Seidyo Imai Aldeia, Fabrício Olivetti de França, and William G. La Cava. 2024. Inexact Simplification of Symbolic Regression Expressions with Locality-sensitive Hashing. In Genetic and Evolutionary Computation Conference (GECCO ‘24), July 14-18, 2024, Melbourne, VIC, Australia. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3638529.3654147

The inexact simplification algorithm works by mapping similar expressions to the same hash, and retrieving the simplest one when doing the simplification of an expression.

use_arch: boolean, optional (default: False)

Determines if we should save pareto front of the entire evolution (when set to True) or just the final population (False).

batch_sizefloat, default 1.0

Percentage of training data to sample every generation. If 1.0, then all data is used. Very small values can improve execution time, but also lead to underfit.

selstr, default ‘lexicase’

The selection method to perform parent selection. When using lexicase, the selection is done as if it was a single-objective problem, based on absolute error for regression, and log loss for classification.

survstr, default ‘nsga2’

The survival method for selecting the next generation from parents and offspring.

save_population: str, optional (default “”)

string containing the path to save the final population. Ignored if not provided.

load_population: str, optional (default “”)

string containing the path to load the initial population. Ignored if not provided.

banditstr, optional (default: “dynamic_thompson”)

The bandit strategy to use for the estimator. Options are “dummy” that does not change the probabilities; “thompson” that uses static Thompson sampling to update sampling probabilities for terminals in search space with a static implementation; and `”dynamic_thompson” that implements a Thompson strategy that weights more recent rewards and applies exponential decay to older observed rewards.

shuffle_split: boolean, optional (default False)

whether if the engine should shuffle the data before splitting it into train and validation partitions. Ignored if validation_size is set to zero.

class_weightslist of float, default []

List of weights to assign to each class in classification problems. The length of the list should match the number of classes. If empty, all classes are assumed to have equal weight. This can be useful to handle imbalanced datasets by assigning weights to underrepresented classes.

logfile: str, optional (default: “”)

If specified, spits statistics into a logfile. “” means don’t log.

random_state: int or None, default None

If int, then the value is used to seed the c++ random generator; if None, then a seed will be generated using a non-deterministic generator. It is important to notice that, even if the random state is fixed, it is unlikely that running brush using multiple threads will have the same results. This happens because the Operating System’s scheduler is responsible to choose which thread will run at any given time, thus reproductibility is not guaranteed.