The Search Space

The Search Space#

The SearchSpace holds the terminals and operators used to define programs, and includes utilities for creating programs and modifying them. It has a few basic components:

  • node_map: this object maps function signatures to specific node types. It is a nested map, made to most quickly match on return types first, then full signatures. It is structured this way to mutations and crossover lookups fast.

  • terminal_map: same as node_map but for terminals.

Both of these maps have associated weights that are used to weight the probabilities of each operator/terminal being sampled. Users can optionally provide these weights.

Initializing#

At a minimum, initializing the search space requires that a Dataset is already defined, so that SearchSpace knows how to define the terminals.

import pandas as pd
from pybrush import Dataset, SearchSpace

df = pd.read_csv('../examples/datasets/d_enc.csv')
X = df.drop(columns='label')
y = df['label']

data = Dataset(X,y)

search_space = SearchSpace(data)

By default, the search space includes all available operators that have at least one argument type matching a datatype in Data. That can be quite large.

Instead, the user may specify operators with weightings that determine the probability of being sampled, i.e.

user_ops = {
    'Add': 0.5,
    'Sub': 0.5,
    'Mul': 1.0,
    'Div': 0.1,
    'SplitBest':0.2
}

search_space = SearchSpace(data, user_ops)

Inspecting#

We now have a much smaller search space. To view it, call print():

search_space.print()
Search Space
===
terminal_map: {"ArrayB": ["1.00"], "ArrayI": ["x_5", "x_7", "1.00"], "ArrayF": ["x_0", "x_1", "x_2", "x_3", "x_4", "x_6", "1.00", "1.00*MeanLabel"]}
terminal_weights: {"ArrayB": [-nan], "ArrayI": [0.011619061, 0.03579926, 0.023709161], "ArrayF": [0.6343385, 0.67299956, 0.42711574, 0.8625447, 0.8957853, 0.20750472, 0.6167148, 0.6167148]}
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][Div] = Div, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][Mul] = Mul, weight = 1
node_map[ArrayF][["ArrayF", "ArrayF"]][Sub] = Sub, weight = 0.5
node_map[ArrayF][["ArrayF", "ArrayF"]][Add] = Add, weight = 0.5
===

Note that the node_map includes two SplitBest operators: one with the signature ArrayI(ArrayI, ArrayI) and one with the signature ArrayF(ArrayF, ArrayF). This is because our dataset contains both interger and floating point data types. Note also that the default behavior is to give both of these nodes the same weight as specified by the user.

Sampling#

TODO. For now, see the mutation and crossover functions in the Program class.