jeds6391 / lgp Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 4.0 5.28 MB

A robust Linear Genetic Programming implementation on the JVM using Kotlin.

License: MIT License

Kotlin 99.89% TeX 0.11%

ai genetic-programming gp jvm kotlin lgp linear-genetic-programming machine-learning

lgp's People

Contributors

Stargazers

Watchers

Forkers

hongyujerrywang fearofcode skalarproduktraum remz1337

lgp's Issues

Unit testings are missing.

As per the checklist in openjournals/joss-reviews#1337 , a set of Unit tests is required. I see that travis is used to verify that the system is building, but you could add also your unit tests. You can use JUnit to make sure that all the basics operations are covered.

Dependency Injection Framework

The ModuleContainer class in its current form is acting as a basic facilitator of dependency injection for the system -- implementations are registered with a module type so that whenever the container is queried for some module, it can provide the appropriate object.

This works well and is a good system design -- allowing for a great amount of flexibility -- but, it may be worthwhile to consider moving to a specific dependency injection framework. I anticipate this would require a fairly large and non-trivial change to the codebase, but it may be worth it as a framework would likely offer:

Better performance
More control and automation (e.g. object lifetimes, automatic dependency resolution)
Probably decrease the chance of things breaking

I think an initial investigation into this would be beneficial -- but any changes will probably take a long time and may break API backwards compatibility (and also require a fairly substantial documentation overhaul).

Add configurable stopping criterion.

At the moment, the base evolution model uses a hard-coded stopping criterion to facilitate early-exits. Because this is probably a commonly implemented feature for models, it'd be good to provide a built-in way to do this.

Probably a value in the config could be used to set a fitness threshold that when crossed will stop the system early. This allows for the user to decide how good of a solution they want in a way (provided that the system reaches those solutions in the first place -- very dependent on the problem/its setup).

Resources for examples.

Currently there are problems when building a JAR of the lgp.examples package, as the resources aren't being correctly referenced.

It turns out you can't use the class loaders getResource() method, you must use getResourceAsStream(). This is fine but none of the data set/config loaders support loading directly from a file stream reference, which would need to be done in order to facilitate this.

Alternatively, a data set/config loader that generates its instances or config in code code be given, meaning we don't have to have any files included (which would be nice).

README.md needs small update

The command to run the example given in README.md is:
kotlin -cp LGP.jar MyProblemKt
I believe that it should be changed to:
kotlin -cp LGP.jar:. MyProblemKt
so that the MyProblemKt class can be found in the class path.

Allow CsvDatasetLoader to automatically determine number of features.

Base Problem Implementation

At the moment, it seems as though problem definition is kind of verbose and ends up being repeated a lot. It might be worthwhile to add a BaseProblem implementation that sets up defaults for the various modules but allows configurable parameters where available -- something like:

class BaseProblem(params: BaseProblemParameters) : Problem<Double>() {
    
    // Basic setup of default problem modules, environment, etc. using parameters given.
}

It would mainly be useful for those using the system where they want to just get started quickly by providing a data set and a bunch of parameters, rather than directly configuring and wiring up a bunch of different modules for custom functionality.

Distribute LGP core JAR through jitpack.

Add utility class for exporting results.

Currently, if you want to export results it is a manual process that will have to be done repeatedly for each experiment.

It would be good to add a base class that can be used to export results. I think the best approach would be to provide a generic results exporter that takes a provider that handles the logic of actually writing the results to a destination (e.g. file, database, whatever).

I will build a bunch of providers for common export situations, but will provide an API that can be built against.

Automatically encode categorical data

At the moment, there is no built-in way to encode categorical features/targets which means the burden is on the user to manually construct their data set in a way that will work with the system.

One option could be that presented in PR #47, where the labels (categories) are encoded into a vector. For example, the Iris data set is as follows:

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	Iris-setosa
7.0	3.2	4.7	1.4	Iris-versicolor
6.3	3.3	6.0	2.5	Iris-virginica

Which is transformed into the encoding:

sepal_length	sepal_width	petal_length	petal_width	species_being_Iris-setosa	species_being_Iris-versicolor	species_being_Iris-virginica
5.1	3.5	1.4	0.2	1.0	0.0	0.0
7.0	3.2	4.7	1.4	0.0	1.0	0.0
6.3	3.3	6.0	2.5	0.0	0.0	1.0

Another option would to be provide a parsing function that can do this automatically, similarly to what is done in lgp.examples.Iris:

val targetLabels = setOf("Iris-setosa", "Iris-versicolor", "Iris-virginica")
val featureIndices = 0..3
val targetIndex = 4

val datasetLoader = CsvDatasetLoader(
        reader = BufferedReader(
            // Load from the resource file.
            InputStreamReader(this.datasetStream)
        ),
        featureParseFunction = { header: Header, row: Row ->
            val features = row.zip(header)
                              .slice(featureIndices)
                              .map { (featureValue, featureName) ->

                Feature(
                        name = featureName,
                        value = featureValue.toDouble()
                )
            }

            Sample(features)
        },
        targetParseFunction = { _: Header, row: Row ->
            val target = row[targetIndex]

            // ["Iris-setosa", "Iris-versicolor", "Iris-virginica"] -> [0.0, 1.0, 2.0]
            Targets.Single(targetLabels.indexOf(target).toDouble())
        }
)

Avoid usage of forEach on ranges.

There are a few places where the following syntax is used in the system:

(0..n).forEach {
    // Do something
}

According to this article, forEach is 300% slower than a for-loop and should be avoided.

A grep on the source code identified the following occurrences:

src//main/kotlin/lgp/core/evolution/operators/SelectionOperator.kt:    (0..tournamentSize - 2).forEach { _ ->
src/main/kotlin/lgp/core/evolution/model/Models.kt:                    (0 until this.environment.configuration.generations).forEach { gen ->
src/main/kotlin/lgp/core/evolution/model/Models.kt:                    (0 until this.environment.configuration.generations).forEach { gen ->
src/main/kotlin/lgp/core/evolution/model/Models.kt:                    (0 until numGenerations).forEach { _ ->
src/main/kotlin/lgp/core/evolution/model/Models.kt:                    (0 until [email protected]).forEach { _ ->
src/main/kotlin/lgp/core/program/registers/RegisterSet.kt:             (0 until this.totalRegisters).forEach { r ->

Add release branch

This is not so much of a code issue but an infrastructure issue.

Currently releases come from whatever branch the latest tag is added to. I am tagging pre-releases on the development branch and final releases on the master branch, but it would be good to move pre-release versions onto a separate release branch and not produce any release artefacts on GitHub for the develop branch.

Fitness context modularity.

PR #4 mentions changes that would be worthwhile to improve the modularity of the fitness context implementation.

From the PR:

In further work, it may be worthwhile to consider modularising this to allow for custom fitness contexts, for cases where multiple program outputs are used as there is no way to currently handle this.

Also, a way should be given to determine which register is used as the output register (probably best done through a field in the config).

The changes however aren't entirely trivial, as it requires defining a way for the module to be used for the fitness context to be defined (perhaps it could be a registered component?), so it is for further along in the project.

Update ResultAggregator to implement AutoCloseable.

It would be nice to not have to call ResultAggregator::close() explicitly when done with an aggregator instance.

If the AutoCloseable interface is implemented then the aggregator can be used with the use function by passing a block as a lambda. Then the aggregator will be automatically closed -- which is a much nicer API.

If use is not used then one can still call the close() function manually if they wish.

Integrate Kotlin Flow API

The Kotlin Flow API looks interesting and like it could provide some benefits in the system, e.g. instruction generation, population generation.

Would be worthwhile to investigate where it might fit in and how it would work with the existing LGP API.

Flow API Reference

Review co-routine integration

In previous work, asynchronous training was implemented using the Kotlin co-routine APIs. These APIs have changed slightly since then, so it would be good to:

Review the implementation
Review current co-routine API
Update to latest Kotlin
Update implementation to match updated Kotlin API

Technical Debt: RegisterSet

As RegisterSet was one of the first classes implemented, it has some instances of non-idiomatic Kotlin code (e.g. for loops, busy constructors, other weird things).

It would be good to fix some of these issues to make the code align more with the other classes in the project -- as well as make it nicer to work with for any future changes.

Logging and output.

Currently, the system provides no information during runtime. It'd be nice to have optional logging that allows the system to output debug information where suitable so that users of the system can sanity-check the system's behaviour at runtime.

This would probably be turned on or off through Config so that the environment can provide access to the logging verbosity parameter throughout the system. The system would provide a ILogger as below to all modules that they can use to log output, while leaving the logger implementation to handle where the log is going/whether it is output at all.

interface ILogger {
    val verbose: Boolean

    // For data that may be useful when debugging the system.
    fun debug(format: String, varargs args: Any?)
   
    // For data that describes the system's operation during runtime.
    fun info(format: String, varargs args: Any?)

    // For when something really bad happens.
    fun error(format: String, varargs args: Any?)
}

This logger may also be used to facilitate the issue relating to progress.

Secondary to this, the examples in lgp.examples currently don't output the best information so it might be good to refactor a bit to provide more information (e.g. the parameters for the problem).

Model training progress output

As of the current implementation, when models are being trained using a Trainer, there is no progress output whatsoever. It might be nice to give some kind of status about how far along the process is.

One (albeit simple) implementation of this could simply take the number of runs (i.e. the number of models to be trained) and keep track of each models training and output progress as (numberModelsTrained / totalNumberModels) * 100. This would give a basic indication of how much time is expected to be left.

An extension to this would be to keep a sort of running average of the time taken to train each model and then provide an estimated time left metric. This could also be a good idea for the model itself -- the average time for each generation as an estimate for how long it'll take until all generations are complete (I think gplearn implements something similar to this?).

Add the ability to specify numbers of runs in configuration.

Currently, the number of runs must be hard-coded in the problem definition when the Trainer instance is created. It would be good to provide a way to dynamically set this by exposing a parameter on the Configuration object.

This is an easy change so can probably added as a hot-fix in the current release. Would also be a good issue for other contributors to start developing on the system.

Installation issues

The required version of the JDK is not stated in the documentation. This can be resolved by just stating the java version in the documentation (I had to update to java 1.8, as I had 1.7 installed).

I'm following the instructions in https://github.com/JedS6391/LGP/tree/release/4.2

However, kaitlin is not compiling the MyProble.kt. I get the following issue:

$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
$ kotlinc -cp LGP.jar -no-jdk -no-stdlib MyProblem.kt
MyProblem.kt:1:36: error: unresolved reference: Configuration
import lgp.core.environment.config.Configuration
                                   ^
MyProblem.kt:11:22: error: unresolved reference: Config
            config = Config()
                     ^
$ ls
10.21105.joss.01337.pdf	LGP.jar			MyProblem.kt

Broken links in documentation

The following links are broken:

Please review all the links to the API. They seem to be consistently broken

sepal_length	sepal_width	petal_length	petal_width	species_being_Iris-setosa	species_being_Iris-versicolor	species_being_Iris-virginica
5.1	3.5	1.4	0.2	1.0	0.0	0.0
7.0	3.2	4.7	1.4	0.0	1.0	0.0
6.3	3.3	6.0	2.5	0.0	0.0	1.0

sepal_length	sepal_width	petal_length	petal_width	species_being_Iris-setosa	species_being_Iris-versicolor	species_being_Iris-virginica
5.1	3.5	1.4	0.2	1.0	0.0	0.0
7.0	3.2	4.7	1.4	0.0	1.0	0.0
6.3	3.3	6.0	2.5	0.0	0.0	1.0