Contribution Guide
We are happy to accept contributions of methods, as well as updates to the benchmarking framework. Below we specify minimal requirements for contributing a method to this benchmark.
Ground Rules
- In general you should submit pull requests to the dev branch.
- Make the PR detailed and reference specific issues if the PR is meant to address any.
- Please be kind and please be patient. We will be, too.
How to contribute an SR method
To contribute a symbolic regression method for benchmarking, fork the repo, make the changes listed below, and submit a pull request to the dev branch.
Once your method passes the basic tests and we’ve reviewed it, congrats!
We will plan to benchmark your method on hundreds of regression problems.
Please note that the schedule for updating benchmarks is dependent on a lot of factors including availability of computing resources and availability of all our contributors. If you are on a tight schedule, it is better to plan to benchmark your method yourself. You can leverage this code base and previous experimental results to do so.
Requirements
- An open-source method with a scikit-learn compatible API
- Your method should be compatible with Python 3.7 or higher to ensure compatibility with conda-forge.
- If your method uses a random seed, it should have a
random_stateattribute that can be set. - Methods must have their own folders in the
algorithmsdirectory (e.g.,algorithms/feat). This folder should contain:metadata.yml(required): A file describing your submission, following the descriptions in [algorithms/feat/metadata.yml][metadata].regressor.py(required): a Python file that defines your method, named appropriately. See [algorithms/feat/regressor.py][regressor] for complete documentation. It should contain:est: a sklearn-compatibleRegressorobject.model(est, X=None): a function that returns a sympy-compatible string specifying the final model. It can optionally take the training data as an input argument. See guidance below.eval_kwargs(optional): a dictionary that can specify method-specific arguments toevaluate_model.py.- We expect your algorithm to have a
max_timeparameter that lets us control the maximum execution time in seconds. When running the experiments in a cluster, we will give extra time to compensate for the overhead of initializing everything, and the maximum time considered is just the fit process. A signalsignal.SIGALRMwill be sent to your process iffit(X, y)exceeds the maximum time, and you can implement strategies to handle this signal. One idea is to store a random initial solution as the best and update it during the execution to ensure theevaluate_model.pyscript will find an equation to work on.
LICENSE(optional) A license fileenvironment.yml(optional): a conda environment file that specifies dependencies for your submission. It will be used to update the baseline environment (environment.ymlin the root directory). To the extent possible, conda should be used to specify the dependencies you need. If your method is part of conda, great! You can just put that in here and leaveinstall.shblank.requirements.txt(optional): a pypi requirements file. The script will runpip install -r requirements.txtif this file is found, before proceeding.install.sh(optional): a bash script that installs your method. **Note: scripts should not require sudo permissions. The library and include paths should be directed to conda environment; the environmental variable$CONDA_PREFIXspecifies the path to the environment.Dockerfile(optional): we will try to dockerize all algorithms. You can optionally have aDockerfileinside youralgorithms/your-submissionfolder to describe specific images for running your algorithm. If no file is provided, it will usealg-Dockerfilefor your container. You can specify the image as you like, as long as you have as minimal dependences the python packages described inbase_environment.yml, as they are used to run the experiment scripts. See this example in case you want to use a custom image. Notice that there is a workflow to build the docker images and push them to dockerhub.- do not include your source code. use
install.shto pull it from a stable source repository.
model compatibility with sympy
In order to check for exact solutions to problems with known, ground-truth models, each SR method returns a model string that can be manipulated in sympy. Assure the returned model meets these requirements:
- The variable names appearing in the model are identical to those in the training data,
X, which is apd.Dataframe. If your method names variables some other way, e.g.[x_0 ... x_m], you can specify a mapping in themodelfunction such as:
def model(est, X=None):
mapping = {'x_'+str(i):k for i,k in enumerate(X.columns)}
new_model = est.model_
for k,v in reversed(mapping.items()):
new_model = new_model.replace(k,v)
- The operators/functions in the model are available in sympy’s function set.