henryzord / ardennes Goto Github PK

An Estimation of Distribution Algorithm for Decision-Tree Induction.

Python 92.32% C 7.68%

eda classification decision-trees induction optimization optimization-algorithms evolutionary-computation evolutionary-algorithm genetic-algorithm estimation-distribution-algorithm

ardennes's Introduction

👋 Hi!

My name is Henry Cagnini. I have a PhD in Computer Science by Pontifical Catholic University of Rio Grande do Sul (PUCRS), with a six-month internship at University of Kent. While I also got my Master's in Computer Science at PUCRS, I have a bachelor's degree in Computer Science by Federal University of Santa Maria (UFSM).

I was born and raised in Santa Maria, Rio Grande do Sul, Brazil.

💽 What do you do?

I'm currently an IT Analyst at Federal University of Santa Maria (UFSM). I mostly work with business intelligence, data mining, and machine learning.

I was also...

A Substitute Teacher at Industrial Technical College of Santa Maria (CTISM) (High School and Undergratuate);
A teacher at various institutions, such as PUCRS, Uniritter, Feevale, TargetTrust (Postgraduate);
An intern at Motorola Mobility Inc;
An intern at Enovative Design and Technology.

You can check some of my current and previous work at these organizations:

Research interests

My research interests are evolutionary algorithms, machine learning, and the teaching of computer science topics.

🎨 What are your hobbies?

I like to...

Play videogames 🎮
Travel ✈️
Eat at cafés ☕
Be with the people I love 💑

🌐 Social networks

📫 How to reach me

E-mail me: [email protected]

💻 Competences

My competences sorted by the most to least competent are*:

Status	Programming languages	Marking languages	Technologies and topics
5️⭐	Python, Java, C	LaTeX, Markdown	Evolutionary Algorithms, Machine Learning, Statistics, Data visualization
4️⭐	C++	HTML, RestructuredText
3️⭐	Assembly, Android, Javascript	CSS	Frontend, Backend, Databases
2️⭐	OpenCL, CUDA, Prolog, Haskell, PHP		Deep Learning
⭐	You're not here for my one-star competences, are you?

* Non-extensive list. Just the ones from the top of my head

⚡ Fun fact

Did you know the humanity already solved one of the seven Millenium Problems?

Metrics

ardennes's People

Contributors

Stargazers

Watchers

Forkers

chrisliu007

ardennes's Issues

Decrease class probability for inner nodes

Currently the distribution over attributes is uniform, independently if the attribute is predictive or not.

Change this in the first graphical model probabilities; make the probability of the class be zero to start the search from a more promising region.

Implement multi-processing

Implement multi-threading for individuals; and
Implement multi-process for folds.

Treat categorical attributes

Study the approach used by CART in order to find the best split for categorical attributes.

Smarter approach for categorical attributes

Currently a very naïve (and time consuming) approach is being used for setting the split for categorical attributes. Change this by using a more robust one, such as the one present in CART algorithm.

Optimize networkx.DiGraph nodes

Currently we are storing unnecessary information in each node's dictionary. The info 'terminal', for example, can be safely replaced by len(tree.successors(x)) == 0.

Investigate early convergence of trees

Guarantee that diverse trees are being trained, and not only a single pattern for them.

Fix plotting of folds

When running a N-fold cross-validation, in the plotting of final results, Ardennes is actually plotting the mean accuracy among folds, instead of plotting the correct value. This is exemplified below:

Fold 1: size: 17 accuracy: 0.93
Fold 2: size: 17 accuracy: 0.87
Fold 3: size: 17 accuracy: 0.95
Fold 4: size: 17 accuracy: 0.88
Fold 5: size: 21 accuracy: 0.75

WRONG: (0.93 + 0.87 + 0.95 + 0.88 + 0.75)/5 = 0,876
CORRECT: ((17 * 0.93) + (17 * 0.87) + (17 * 0.95) + (17 * 0.88) + (21 * 0.75)) / 89 = 0.870

Obviously, when all folds have the same size, this phenomena will not occur.

Randomly pick instances for decision tree induction

Instead of using a pre-defined training and validation sets (we are not talking about test -- it is sacred!), why not sample them in runtime, and infer the trees based on those sampled sets?

Change GM after some time

If verified that

all(
  map(
    lambda x: x == val, 
    [fitness.mean, fitness.median, fitness.max]
  )
)

then increase the complexity of the graphical model.

One variable per level

Change the GM so there is only one variable per level, as opposed to the current 2^h variables (where h is the current height of the level, starting from 0).

This should make the generated trees better (regarding their accuracy), and also decreasing the sampling procedure.

Prune tree

Currently something as counter-intuitive as this is occurring:

Note that there are two kinds of bugs:

A predictive node is being coloured as a terminal one;
There are children nodes which lead to the same leaf value!

Fix this in order to decrease tree size.

Twin nodes

One way of solving the problem of splitting categorical attributes (and thus preventing the tree from losing too much height in dealing with multiple splits) is to perform a n-ary split (where n is the number of categories for a given predictive attribute), but set all children nodes with the same node label.

Obviously it is required to decrease the impact of this repeating of attributes in the GM update.

Add support to folds and partition

Currently, ardennes only receives the full dataset as input and operates over it (i.e. splits into folds or partitions). Make it support folders with a given format, for predetermined folds/partitions.

Run C4.5 for getting the size of the tree

Better localization when dealing with GMs

The code currently relies heavily on the loc method from pandas.DataFrame for both sampling and updating procedures:

            grouped = self.weights.copy()  # type: pd.DataFrame
            for p in self.parents:
                grouped = grouped.loc[grouped[p] == session[p]]

however, the values in each cell are already known, since they follow a pre-determined distribution. Use this distribution for improving the speed of GM class.

Implement early-stop function

Under main/init.py file.

Fix threshold values

Fix threshold values for numerical values for picking middle values between two objects.

Implement any kind of initialization for GMs

Currently the GMs are initialized in a multivariate fashion. Improve this to work with any kind of variable distribution (multivariate, bivariate or univariate).

Implement two GMs

We should use two GMs in this EDA: one for initializing the population, and other for the decorrent populations.

The initial GM is currently implemented. It should be only modified to consider a increase in the probability of sampling terminal nodes as the level of the tree increases.

The other GM, for the decorrent population, is not already implemented.

Top-down algorithm as heuristic

Use a top-down inference algorithm for predicting a maximum depth of the trees in the population.

Guarantee that the top-down algorithm achieves 100% accuracy, to maximize the depth.

Preserve threshold for nodes

Preserve thresholds for nodes in given positions, given that the parents are the same across different samplings.

For example:

If in generation g_0 the threshold for node n_2 is already calculated for attribute a_2, given that its parents are nodes n_0 and n_1 with attributes n_0 and n_1, then this threshold doesn't need to be calculated again if this configuration repeats for any other generation.

Decrease probability of class as the depth increases

Improve FinalGM model

Currently, FinalGM only counts with univariate distributions. Adopt a bivariate distribution (in which each node probability depends on the choice of attribute of the parent node) for sampling.

Make whole population vote

Test whether making the whole population vote in the last generation isn't better than simply using the best tree.

Do not make class contribute in children probabilities

If a parent node is a class, it is collaborating into the children probabilities too. Prevent this, by removing the contribution of children nodes in the GM.

Limit size of subsequent populations

Once a tree height is reached, it cannot grow any more.
Prevent the code from behaving like this:

iter: 000 mean: 0.684654 median: 0.679487 max: 0.910256 ET: 291.16sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 001 mean: 0.682795 median: 0.692308 max: 0.910256 ET: 274.74sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 002 mean: 0.690731 median: 0.705128 max: 0.910256 ET: 270.86sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 003 mean: 0.701141 median: 0.717949 max: 0.948718 ET: 275.70sec height: 9 n_nodes: 39 test acc: 0.578947

JSON file for parameters

Create a JSON file for storing parameters of the algorithm. This must be an optional feature, and not the main method of passing parameters to the algorithm.

Fix non-converging evolution

Currently the evolutionary process presents a strange behavior:

iter: 000 mean: +0.785000 median: +0.800000 max: +1.000000
iter: 001 mean: +0.596667 median: +0.266667 max: +1.000000
iter: 002 mean: +0.596667 median: +0.266667 max: +1.000000

The mean and median accuracy should never decrease as much as they are decreasing right now.