The Search Space#
The SearchSpace
holds the terminals and operators used to define programs, and includes utilities for creating programs and modifying them.
It has a few basic components:
node_map
: this object maps function signatures to specific node types. It is a nested map, made to most quickly match on return types first, then full signatures. It is structured this way to mutations and crossover lookups fast.terminal_map
: same asnode_map
but for terminals.
Both of these maps have associated weights that are used to weight the probabilities of each operator/terminal being sampled. Users can optionally provide these weights.
Initializing#
At a minimum, initializing the search space requires that a Dataset
is already defined, so that SearchSpace
knows how to define the terminals.
import pandas as pd
from pybrush import Dataset, SearchSpace
df = pd.read_csv('../examples/datasets/d_example_patients.csv')
X = df.drop(columns='target')
y = df['target']
df.describe()
id | sex | race | target | |
---|---|---|---|---|
count | 993.00000 | 993.000000 | 993.000000 | 993.000000 |
mean | 496.00000 | 0.487412 | 2.625378 | 8.219092 |
std | 286.79871 | 0.500093 | 1.725240 | 1.101319 |
min | 0.00000 | 0.000000 | 0.000000 | 1.337280 |
25% | 248.00000 | 0.000000 | 1.000000 | 7.836757 |
50% | 496.00000 | 0.000000 | 3.000000 | 8.404038 |
75% | 744.00000 | 1.000000 | 4.000000 | 8.810710 |
max | 992.00000 | 1.000000 | 5.000000 | 11.410597 |
data = Dataset(X,y)
search_space = SearchSpace(data)
By default, the search space includes all available operators that have at least one argument type matching a datatype in Data
.
That can be quite large.
Instead, the user may specify operators with weightings that determine the probability of being sampled, i.e.
user_ops = {
# 'Add': 0.5,
# 'Sub': 0.5,
# 'Mul': 1.0,
# 'Div': 0.1,
'SplitBest':0.1,
'SplitOn':0.1
}
search_space = SearchSpace(data, user_ops)
Inspecting#
We now have a much smaller search space. To view it, call print()
:
search_space.print()
=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00"]}
terminal_weights: {"ArrayI": [0.01214596, 0.01214596], "ArrayB": [0.026419641, 0.026419641], "ArrayF": [0.056145623, 0.056145623]}
node_map[ArrayB][["ArrayI", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayF", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.1
node_map[ArrayI][["ArrayB", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayF", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = 1.00*SplitBest, weight = 0.1
node_map[ArrayF][["ArrayB", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayI", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = 1.00*SplitBest, weight = 0.1
Initializing sampling probabilities#
The search space will create the terminals based on each input feature.
Brush lets you start the execution with uniformly initialized weights for sampling terminals when creating or mutating programs or pre-calculating sampling probabilities based on correlations with the target variable.
These weights will affect the occurrence of each terminal in the programs during the run.
You can enable this feature by setting weights_init=True
. This setting is true by default.
Below we show the search space with the weights off.
search_space_off = SearchSpace(data, user_ops, weights_init=False)
search_space_off.print()
=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00"]}
terminal_weights: {"ArrayI": [1, 1], "ArrayB": [1, 1], "ArrayF": [1, 1]}
node_map[ArrayB][["ArrayI", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayF", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.1
node_map[ArrayI][["ArrayB", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayF", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = 1.00*SplitBest, weight = 0.1
node_map[ArrayF][["ArrayB", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayI", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = 1.00*SplitBest, weight = 0.1
Note that the node_map
includes two SplitBest
operators: one with the signature ArrayI(ArrayI, ArrayI)
and one with the signature ArrayF(ArrayF, ArrayF)
.
This is because our dataset contains both interger and floating point data types.
Note also that the default behavior is to give both of these nodes the same weight as specified by the user.
Loading datatypes#
If you pass a numpy array, Brush will try to infer datatypes based on its values. If instead of passing the data directly you rather pass a pandas dataframe, then it will use the data types retrieved from the powerful pandas sniffer to use as its own data type.
data = Dataset(X.values, y.values)
search_space = SearchSpace(data, user_ops)
search_space.print()
=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00"]}
terminal_weights: {"ArrayI": [0.01214596, 0.01214596], "ArrayB": [0.026419641, 0.026419641], "ArrayF": [0.056145623, 0.056145623]}
node_map[ArrayB][["ArrayI", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayF", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.1
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.1
node_map[ArrayI][["ArrayB", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayF", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = 1.00*SplitBest, weight = 0.1
node_map[ArrayF][["ArrayB", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayI", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = 1.00*SplitBest, weight = 0.1
The regressor contains an engine, responsible for running the search.
The engine will have its own instance of the search space.You can access the search space from the regressor with the following:
Automatically updating sampling probabilities with multi-armed bandits#
Brush has a reinforcement learning agent implemented on the C++ backend that will observe which terminals are making the programs better or worse regarding the user-defined objectives and update the sampling probabilities.
To enable it, you must set bandit="thompson"
if you want to have a global static probability used during the entire search, or bandit="dynamic_thompson"
if you want to use a policy that gives more importance to recent observed results and sets a lower importance to older generations.
from pybrush import BrushRegressor
est = BrushRegressor(
functions=user_ops,
max_gens=20,
objectives=["scorer", "linear_complexity"],
weights_init=True,
bandit="dynamic_thompson",
verbosity=1
)
est.fit(X.values,y.values)
print("Best model:", est.best_estimator_.get_model())
print('score:', est.score(X,y))
Completed 100% [====================]
saving final population as archive...
Best model: If(x_0>0.50,If(x_0>1.50,8.22,5.77),9.04)
score: 0.005547027354798839
est.engine_.search_space.print()
=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00"]}
terminal_weights: {"ArrayI": [0.5988976, 0.39022964], "ArrayB": [0.31329083, 0.29768455], "ArrayF": [0.01, 0.01]}
node_map[ArrayB][["ArrayI", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.4727218
node_map[ArrayB][["ArrayF", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.13049549
node_map[ArrayB][["ArrayB", "ArrayB", "ArrayB"]][SplitOn] = SplitOn, weight = 0.2557611
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.44342086
node_map[ArrayI][["ArrayB", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.079836436
node_map[ArrayI][["ArrayF", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.26834965
node_map[ArrayI][["ArrayI", "ArrayI", "ArrayI"]][SplitOn] = 1.00*SplitOn, weight = 0.978205
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = 1.00*SplitBest, weight = 0.54717386
node_map[ArrayF][["ArrayB", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.13805643
node_map[ArrayF][["ArrayI", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.22573847
node_map[ArrayF][["ArrayF", "ArrayF", "ArrayF"]][SplitOn] = 1.00*SplitOn, weight = 0.09722547
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = 1.00*SplitBest, weight = 0.01
print(est.best_estimator_.get_model("tree"))
If(x_0)
|- If(x_0)
| |- 8.22
| |- 5.77
|- 9.04
Sampling#
TODO. For now, see the mutation and crossover functions in the Program class.