Hi, I am opening this issue to start a discussion about the settings

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Discussion of methods parameters,about cavalab/srbench

Comments (26)

marcovirgolin commented on August 11, 2024 1

@lacava great that we use the same data splits for each algorithm. That'd basically mean we don't have an issue with variance due to n_samples being too small as every algorithm runs on the same samples. Maybe something to keep in mind if a reviewer asks.

About my suggestion to reach 1M

as others have mentioned this would put us further towards benchmarking implementations rather than methods

My personal preference then would be to keep the evaluation budget smaller (even though my own code is C++) and hope most algorithms can consistently reach it, just because I personally find algorithmic search capability more interesting than implementation speed. Then we have both sets of results (evaluations-based, time-based) to present.

Update about using eval budget: I tried right now using Standard GP against GP-GOMEA on my (slow) laptop w/ the same code base, with a limit of 500,000 evaluations, on Boston housing (~500 samples). Because GP-GOMEA's trees don't really grow over time (it uses a template-like encoding and performs a form of homologous crossover), GP-GOMEA takes me ~10 times less runtime than Standard GP. So that goes back to our discussion about point evaluations.
If a time-based/point evaluations-based comparison is not possible, I think we should say that some algorithms can be (severely) (dis)advantaged when using evaluations-only, because of a number of reasons.

from srbench.

lacava commented on August 11, 2024 1

reflecting on this a bit, not many of our methods have asin or acos. it may make sense to drop those problems, of which there are only a few, rather than have everyone add those operators.

from srbench.

marcovirgolin commented on August 11, 2024

Hi,

Perhaps a partial solution to the fact that tree size plays a role and some methods use mini-batches is to have a budget that takes into account:

num_observations_used_to_evaluate_each_model * num_components_each_model

Or, perhaps even more simple to implement, set a total the budget to be a total number of component calls budget (i.e. exclude num_observations_used_to_evaluate) and then:

When computing the loss (no matter if within the fitness function, in a local search step, etc.), the budget spent so far is incremented by the number of components of the model being evaluated (e.g., tree nodes, "complexity")
If part of the model does not need re-evaluation (e.g., the model is a linear sum of subfunctions and only 1 subfunction is changed + caching is used for the output of the others), just add the number of components that are actually being re-evaluated.
If the method uses a mini-batch instead of the full training set, multiply by the ratio |mini batch size|/|training set size| (this way, only those who use batches need to change the code).

This does not account for everything of course, as some methods carry extra computational complexity in other parts than evaluations (e.g., I know mine: GP-GOMEA builds a linkage model every generation to decide what swaps to perform at crossover time, RDO parses a library of subfunctions to decide which to use as mutating branch).

Maybe C++-based implementations could be additionally compared in terms of total time?

from srbench.

lacava commented on August 11, 2024

Maybe C++-based implementations could be additionally compared in terms of total time?

Yes, I'm currently planning to compare all methods on process time

I think you are right @marcovirgolin that having a total component evaluation budget (sometimes called "point evaluations") would be even more fair... but I'm skeptical we can get it implemented equivalently (and quickly) in each of the 10 or so methods we're benchmarking, most of which provide only no. of gens and population size as arguments. ellyn supports setting max point evals, but I'm not sure about others. And of course nodes with different computational complexity (e.g. exp versus +) would be counted as the same.

It might also be interesting to monitor the load average on the cluster, if you can isolate it on a per-method basis, maybe combine with other measurements like memory usage. This would give a more general measure of each method's computational requirements.

let me look into this. For sure I can get RAM usage, not sure about load, especially since it is a multi-use, heterogenous cluster.

from srbench.

folivetti commented on August 11, 2024

if we think only about the evolutionary methods, how many of them does have local search and or mini-batch? If only those maintained by the people involved in this repo, we could go with @foolnotion and @marcovirgolin suggestions:

limit to 500k evaluations
count local search evaluations
multiply by a factor of the mini-batch

In any case, it would also be interesting to have an estimate of running times for these algorithms. Maybe running a single run of evaluate_model.py for each algorithm in one of the large datasets to see how much different the running times are and if we need to cut down the budget.

from srbench.

mkommend commented on August 11, 2024

Just to add my 2 cents to the discussion as well. Unfortunately, I haven't come across the ultimate method for performance comparison of algorithms for symbolic regression. Every metrics has some kind of shortcomings, i.e. wall clock time includes the efficiency of the implementation, number of evaluated solutions does not take local search into account,or point evaluations omit the overhead of some more advanced calculations (for example linkage calclations.

IMHO the fairest comparison is to estimate computational effort by using point evaluations and if that's not feasible number of evaluated solutions (include local search evaluations) will do just fine (as @folivetti suggests).

I would neglect the fact that different symbols have different complexities (e.g AQ or exp) and also ignore the model length while calculating the spent evaluation budget. However, as an additional measure an estimate of the actual running time of the algorithms would be interesting and correct some of the shortcomings of using the number of evaluations as computational budget.

from srbench.

lacava commented on August 11, 2024

Based on the discussion it seems we're agreeing on a limit of 500,000 evaluations, including local search evaluations and accounting for mini-batch size.

Another practical issue is maximum run-time on the cluster. Our short priority queue has a max of 24 hours, which would be an ideal limit for us, but based on our last experiment I'm not sure some of these methods will finish individual runs on datasets on a single thread in that time. Important differences from last time: I've added the ability to sub-sample the datasets in PMLB that are large (currently set to 10,000 samples max), and we're using HalvingGridSearch this time which is a bit quicker.

So my questions are:

do your methods support setting a maximum time stop criterion, and if so, what is a reasonable limit? (My preference would be in the 24 - 72 hour range). I know MRGP, AFP, Eplex, FEAT support this option.
if your method does not support it, should we set up training sample limits for specific methods that are not finishing in time, so that they produce a result? we'd have to reconfigure the n_samples to sub-sample the training data only, in order to have fair comparisons on test set. If not, how should we handle methods that do not produce a final model?
what's a reasonable number for n_samples? (currently 10,000)

from srbench.

foolnotion commented on August 11, 2024

do your methods support setting a maximum time stop criterion, and if so, what is a reasonable limit? (My preference would be in the 24 - 72 hour range). I know MRGP, AFP, Eplex, FEAT support this option.

I have added a time limit for Operon. I was reluctant to do it before because I thought it would introduce non-deterministic behavior and impact reproducibility, but that's something the user must be aware of. However, like @mkommend mentioned above, a time limit might mean that you benchmark not only the method but also the implementation.

if your method does not support it, should we set up training sample limits for specific methods that are not finishing in time, so that they produce a result? we'd have to reconfigure the n_samples to sub-sample the training data only, in order to have fair comparisons on test set. If not, how should we handle methods that do not produce a final model?

Maybe let them time out the first time around (it probably won't happen on all problems, only on a subset), then rerun each method with a sub-sampled training data on the specific problems where it timed out.

what's a reasonable number for n_samples? (currently 10,000)

10,000 looks ok to me. I looked at https://epistasislab.github.io/pmlb/ and found 14 problems with > 10000 samples. You could also choose to sample a % of the observations (e.g. 10%) with an upper ceiling for the really large datasets.

Problem	Observations	Features	Endpoint	Imbalance	Type
1595_poker	1025010	10	continuous	0.37	regression
1191_BNG_pbc	1000000	18	continuous	0	regression
1196_BNG_pharynx	1000000	10	continuous	0	regression
1203_BNG_pwLinear	177147	10	continuous	0	regression
1201_BNG_breastTumor	116640	9	continuous	0.04	regression
215_2dplanes	40768	10	continuous	0	regression
344_mv	40768	10	continuous	0	regression
564_fried	40768	10	continuous	0	regression
1193_BNG_lowbwt	31104	9	continuous	0	regression
218_house_8L	22784	8	continuous	0.02	regression
574_house_16H	22784	16	continuous	0.02	regression
537_houses	20640	8	continuous	0	regression
1199_BNG_echoMonths	17496	9	continuous	0	regression
201_pol	15000	48	continuous	0.37	regression

from srbench.

folivetti commented on August 11, 2024

I would rather reduce the max number of samples for everyone and, just then, reducing it even more for those algorithms that timed out. But I would make sure everyone can run all of the 500k evaluations, for the same reasons pointed out by @foolnotion and @mkommend .

Could you make a single run of each algorithm (not the grid search, just for a single hyperparameters setting) with the 201_pol dataset? Then we can use the slowest algorithm as a baseline to fine-tune the number samples (and even the number of evaluations, if necessary). I think 10k samples are good enough and we could use even less for the sake of benchmarking.

About changing the code to implement a time limit, in my case it could hurt the performance badly since I'm relying on lazy evaluation to optimize some steps, and this would require me to force strictness in some key points of my code. In any case, I think 24 hours should be enough for a single run on the largest dataset. But I haven't tested it yet to be sure.

from srbench.

marcovirgolin commented on August 11, 2024

I think 500K is nice but, mayyyybe, if time allows, we could also consider rounding up to 1 million? :)
My code can use a time limit already, and it is a good idea to use it as a hard deadline IMHO.
I think the n_samples that is sufficient may well depend on the dataset. The first thing that comes to my mind to estimate the right n_samples (of a dataset), is to use some relatively fast & robust fitting method (e.g., random forest?) with different (e.g., doubling every time) n_samples settings, and look at the variance/IQR on the test set. This across repeated runs/cv-splits. We stop doubling n_samples when the change in variance/IQR become small (e.g., within 1% R^2?). Even better, at that point bisection could be used to find n_samples_optimal, which should be somewhere between the value that allowed us to satisfy the variance threshold, and half of that.
I assume the codebase already ensures that the stochastic search algorithms all see the same data splits, correct?

from srbench.

folivetti commented on August 11, 2024

I've run ITEA for 564_fried dataset on my local machine using 500k evaluations. A single run of the grid search with 6 combos using HalvingGridSearch with 5 folds took about 10 hours.

Model name:          Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
16GB RAM
Ubuntu 18.04.5 LTS

from srbench.

lacava commented on August 11, 2024

I've run ITEA for 564_fried dataset on my local machine using 500k evaluations. A single run of the grid search with 6 combos using HalvingGridSearch with 5 folds took about 10 hours.
Model name:          Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
16GB RAM
Ubuntu 18.04.5 LTS

Did you restrict it to one thread?

* I think 500K is nice but, mayyyybe, if time allows, we could also consider rounding up to 1 million? :)

I would also like to do 1 million, but I know that this will make the experiment much more difficult. Then again, if we set time limits, it shouldn't be too much of a problem.... as others have mentioned this would put us further towards benchmarking implementations rather than methods. Personally I am ok with that, given that 24 - 72 hours is a reasonably generous time budget for training a model on a dataset with 10,000 samples.

* My code can use a time limit already, and it is a good idea to use it as a hard deadline IMHO.

Ottimo.

* I think the `n_samples` that is sufficient may well depend on the dataset. The first thing that comes to my mind to estimate the right `n_samples` (of a dataset), is to use some relatively fast & robust fitting method (e.g., random forest?) with different (e.g., doubling every time) `n_samples` settings, and look at the variance/IQR on the test set. This across repeated runs/cv-splits. We stop doubling `n_samples` when the change in variance/IQR become small (e.g., within 1% R^2?). Even better, at that point bisection could be used to find `n_samples_optimal`, which should be somewhere between the value that allowed us to satisfy the variance threshold, and half of that.

This sounds like a cool approach. I would wonder if it would bias our results though, especially since we are comparing to random forests. I think i'm more comfortable just limiting the samples for all methods equivalently, so that, even if we are making generalization variance higher in doing so, it is the same restriction across the benchmarked methods.

* I assume the codebase already ensures that the stochastic search algorithms all see the same data splits, correct?

Yes.

from srbench.

folivetti commented on August 11, 2024

Did you restrict it to one thread?

forgot about that :-D it used two threads. I'll try to repeat this experiment tomorrow with a single thread.

update about the execution time: I left it running again the same setting as before and with a single thread and, this time, without any other cpu hungry process (e.g. Chrome). It took 10 hours and a 9 minutes.

from srbench.

lacava commented on August 11, 2024

Thanks everyone for your input. I'm working on finalizing the settings, adding a sklearn API for AI-Feynman, and testing runtime on 201_pol as @folivetti suggested. (It does seem like most methods finish under 8 hours, but all have not completed). I've done some dry runs and everything is more or less working.

Here's what we have:

Experiment Settings

Max solution evaluations: 500,000
- Including any constant updates that evaluate the models.
- Adjusted for mini batch training
Hyperparameters: 6 settings per SR method.
- Fed into HalvingGridSearchCV for tuning with 5 folds.
- I'm thinking of adding an option to skip tuning for a method (like AI-Feynman) that are very slow or otherwise don't want/need to tune. But this will affect method-method runtime comparisons.
Max total time per dataset: somewhere in the 24-72 hour range, depending on my runtime test.
- Note this means max time settings for a single training instance will need to take into account multiple fits.
Datasets: looks like we'll have about 260 total (some surprises :) )
- Note: I will be testing method's abilities to find exact solutions to some problems that have known forms. Some of those problems may have sin and cos in them. So you may want to include those operators in your search space if you haven't.
Trials: 10 repeats
Total SR methods: we should have 14 or possibly 15 total, including Bayesian, Deep Learning, and other non-GP approaches
Control ML methods: Linear, Kernel Regression, MLP, XGBoost, LightGBM, RF, Adaboost

Todo:

@marcovirgolin can you confirm 500,000 evals per training in sembackprop and GP-GOMEA hyperparameters? Including constant optimization etc. I didn't want to make any wrong changes.
Complexity: I'm thinking it may be more robust to compare complexity by converting all equations to a symbolic form using https://github.com/sympy/sympy. For the datasets with exact solutions this will also make it easier to check for exact solutions.
sampling: For datasets with >10,000 samples, I am thinking it makes more sense to only sample the training sets, so that we are evaluating on the full test set. Let me know if there are any objections.

from srbench.

marcovirgolin commented on August 11, 2024

Todo:

@marcovirgolin can you confirm 500,000 evals per training in sembackprop and GP-GOMEA hyperparameters? Including constant optimization etc. I didn't want to make any wrong changes.

Now it's off, I will make a pull request with hyper-parameter settings so that there are 6 settings in total, and that the evolution stops at 500K evaluations. I will also add sin and cos as you suggest.

Complexity: I'm thinking it may be more robust to compare complexity by converting all equations to a symbolic form using https://github.com/sympy/sympy. For the datasets with exact solutions this will also make it easier to check for exact solutions.

I personally think that depends on what you want to claim. Simplification might cause misattribution of merit. For example, algorithm A that includes a smart complexity penalization term in the loss might be beaten by algorithm B that includes no penalization, solely out of luck, because B's giant models happen to simplify a lot. Is that a merit of the intended algorithmic design of B? So, I think potential merits are best evaluated without simplification. Perhaps the best way is to collect data on both settings (w/ and w/o simplification), see if they differ significantly, and then decide what to present, in what light.
However I agree that, for checking for solution exactness, simplification is the way to go.

sampling: For datasets with >10,000 samples, I am thinking it makes more sense to only sample the training sets, so that we are evaluating on the full test set. Let me know if there are any objections.

I agree 100%.

from srbench.

lacava commented on August 11, 2024

Perhaps the best way is to collect data on both settings (w/ and w/o simplification), see if they differ significantly, and then decide what to present, in what light.

That's a good point. I'll try to capture both separately.

from srbench.

marcovirgolin commented on August 11, 2024

I have a few questions:

Do we have a limit on the maximum solution size? (that can really impact how long an evaluation takes and fitting capability)
What is the minimum function set that allows to reconstruct the exact formulas? So far I get we should include sin cos (and, I guess, + - * /)

from srbench.

foolnotion commented on August 11, 2024

Complexity: I'm thinking it may be more robust to compare complexity by converting all equations to a symbolic form using https://github.com/sympy/sympy. For the datasets with exact solutions this will also make it easier to check for exact solutions.

I think it's ok to simplify especially for GP where the encoding is inherently redundant (e.g. typically, only binary symbols are used, so n-ary operations are emulated via nesting), not to mention the inevitable bloat. It seems perfectly fine to compare simplified models, as this is a routine step during post processing anyway. Like @marcovirgolin suggested you could compare both versions (simplified and non-simplified) and see the differences.

Our infix exporter was designed with sympy in mind, although I couldn't find a full spec for their parser. It looks like the only significant difference is that we use ^ instead of ** for pow, but apart from that no changes should be needed.

For datasets with >10,000 samples, I am thinking it makes more sense to only sample the training sets, so that we are evaluating on the full test set. Let me know if there are any objections.

Also agree. One practical thing to keep in mind is that if the test set is really huge, it might impact the runtime of methods that calculate the test error between iterations/generations (e.g. if they report any stats during the run).

from srbench.

lacava commented on August 11, 2024

Also agree. One practical thing to keep in mind is that if the test set is really huge, it might impact the runtime of methods that calculate the test error between iterations/generations (e.g. if they report any stats during the run).

None of the methods will have the (potentially large) test data passed to them, except thru calls to predict(), so we shouldn't have any issues there.

from srbench.

lacava commented on August 11, 2024

@marcovirgolin

Do we have a limit on the maximum solution size? (that can really impact how long an evaluation takes and fitting capability)

At the moment, no. We are assessing generalization error, complexity and run-time, which means methods with bigger max sizes might be more accurate but might be more complex / having longer running time. We could try to decide on a common setting but I'd worry about differences in what defines solution "sizes" between methods as well. What do you think?

What is the minimum function set that allows to reconstruct the exact formulas? So far I get we should include sin cos (and, I guess, + - * /)

I'm wary of giving the exact minimal set, since I think it's easy to over-fit to the data/problem once you know. But I also don't want to prevent any method from finding a solution due to not having a basic function. What I will say is this: the list of operators below covers all the functions in the problems with exact solutions:

+, -, *, /, sin, cos, tanh, arcsin, arccos, exp, log, sqrt, pow(., c), c

this is not the minimal set and there are equivalences (i.e. tanh can be expressed with exp, +, and /, ), but everything is in there. You might consider using hyperparameter tuning to choose function sets if you think it will be useful.

from srbench.

folivetti commented on August 11, 2024

btw, one of my students added the functions sqrt, tanh, exp, log, sin, cos into the GSGP code you're using: https://github.com/gAldeia/SOFTX_2019_170

He hasn't changed anything beyond that. Unless you want to run the experiment only with code provided by the authors, feel free to use this version.

from srbench.

marcovirgolin commented on August 11, 2024

@marcovirgolin

Do we have a limit on the maximum solution size? (that can really impact how long an evaluation takes and fitting capability)

At the moment, no. We are assessing generalization error, complexity and run-time, which means methods with bigger max sizes might be more accurate but might be more complex / having longer running time. We could try to decide on a common setting but I'd worry about differences in what defines solution "sizes" between methods as well. What do you think?

Hm, surely the limit we set does make a difference runtime-wise, at least for the more "traditional" GP encodings. In my experience, given enough run time, the avg. model size of the pop tends to converge to the maximum allowed size. Given what you say, I guess we could consider this max size to be a hyperparam setting as well then, aimed at e.g. controlling for generalization. This way we don't have the issue of deciding a limit for everyone to adhere to (moreover, possibly complicated for less-traditional encodings).
I'd go on and set this as a hyper-param for my algs (the code is ready, I just need to change the hyper-param settings for the grid search).

What is the minimum function set that allows to reconstruct the exact formulas? So far I get we should include sin cos (and, I guess, + - * /)

I'm wary of giving the exact minimal set, since I think it's easy to over-fit to the data/problem once you know. But I also don't want to prevent any method from finding a solution due to not having a basic function. What I will say is this: the list of operators below covers all the functions in the problems with exact solutions:

+, -, *, /, sin, cos, tanh, arcsin, arccos, exp, log, sqrt, pow(., c), c

this is not the minimal set and there are equivalences (i.e. tanh can be expressed with exp, +, and /, ), but everything is in there. You might consider using hyperparameter tuning to choose function sets if you think it will be useful.

Thank you :)

from srbench.

lacava commented on August 11, 2024

OK, some updates:

I'm planning to set a single-train instance limit of 2 hours for each method (where possible). That translates to ~ 40-48 hours per dataset:

With hyperparameter tuning via HalvingGridSearchCV, there are roughly 20*T model trainings per run, where T is the dataset size. My goal is to limit runs to roughly 48 hours (40 hours + 8 hours of flex).
Based on the 201_pol problem (the biggest dataset we're training on), each method finishes under 2 hours per train (less than 48 hours total), aside from MRGP which takes its full time budget I had set (24 hours per train). MRGP's authors designed it to run for a fixed time budget so I think we're ok there. (Also GP-GOMEA took ~180 minutes on some training settings but it ran for 10mil evaluations in my test).
For more interesting run-time comparisons, I will run each method with full parallelization on a workstation on the 201_pol dataset. The workstation has 28 cores and 2 graphics cards. If I have time, I'll try to run multiple trials. Sound good?

Some runs have started they should all be submitted tonight. Thanks everyone!

from srbench.

folivetti commented on August 11, 2024

Very nice! I'm looking forward for the results. It will certainly help us to determine the current state of Symbolic Regression and what we need to improve in our own methods.

Just to be sure, have the experiments already started? I'm asking because I'm doing some optimizations in my own code that could help cut down the total runtime.

from srbench.

lacava commented on August 11, 2024

Just to be sure, have the experiments already started? I'm asking because I'm doing some optimizations in my own code that could help cut down the total runtime.

Yep, everything is running now. But feel free to update for the next round!

from srbench.

lacava commented on August 11, 2024

seems like most of these issues are settled; closing for now

from srbench.

Discussion of methods parameters about srbench HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent