Giter Site home page Giter Site logo

henryzord / ardennes Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 1.0 84.68 MB

An Estimation of Distribution Algorithm for Decision-Tree Induction.

Python 92.32% C 7.68%
eda classification decision-trees induction optimization optimization-algorithms evolutionary-computation evolutionary-algorithm genetic-algorithm estimation-distribution-algorithm

ardennes's Introduction

๐Ÿ‘‹ Hi!

My name is Henry Cagnini. I have a PhD in Computer Science by Pontifical Catholic University of Rio Grande do Sul (PUCRS), with a six-month internship at University of Kent. While I also got my Master's in Computer Science at PUCRS, I have a bachelor's degree in Computer Science by Federal University of Santa Maria (UFSM).

I was born and raised in Santa Maria, Rio Grande do Sul, Brazil.

๐Ÿ’ฝ What do you do?

I'm currently an IT Analyst at Federal University of Santa Maria (UFSM). I mostly work with business intelligence, data mining, and machine learning.

I was also...

  • A Substitute Teacher at Industrial Technical College of Santa Maria (CTISM) (High School and Undergratuate);
  • A teacher at various institutions, such as PUCRS, Uniritter, Feevale, TargetTrust (Postgraduate);
  • An intern at Motorola Mobility Inc;
  • An intern at Enovative Design and Technology.

You can check some of my current and previous work at these organizations:

Research interests

My research interests are evolutionary algorithms, machine learning, and the teaching of computer science topics.

๐ŸŽจ What are your hobbies?

I like to...

  • Play videogames ๐ŸŽฎ
  • Travel โœˆ๏ธ
  • Eat at cafรฉs โ˜•
  • Be with the people I love ๐Ÿ’‘

๐ŸŒ Social networks

๐Ÿ“ซ How to reach me

E-mail me: [email protected]

๐Ÿ’ป Competences

My competences sorted by the most to least competent are*:

Status Programming languages Marking languages Technologies and topics
5๏ธโญ Python, Java, C LaTeX, Markdown Evolutionary Algorithms, Machine Learning, Statistics, Data visualization
4๏ธโญ C++ HTML, RestructuredText
3๏ธโญ Assembly, Android, Javascript CSS Frontend, Backend, Databases
2๏ธโญ OpenCL, CUDA, Prolog, Haskell, PHP Deep Learning
โญ You're not here for my one-star competences, are you?

* Non-extensive list. Just the ones from the top of my head

โšก Fun fact

Did you know the humanity already solved one of the seven Millenium Problems?


Metrics

Metrics

ardennes's People

Contributors

henryzord avatar jwehrmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

chrisliu007

ardennes's Issues

Decrease class probability for inner nodes

Currently the distribution over attributes is uniform, independently if the attribute is predictive or not.

Change this in the first graphical model probabilities; make the probability of the class be zero to start the search from a more promising region.

Smarter approach for categorical attributes

Currently a very naรฏve (and time consuming) approach is being used for setting the split for categorical attributes. Change this by using a more robust one, such as the one present in CART algorithm.

Optimize networkx.DiGraph nodes

Currently we are storing unnecessary information in each node's dictionary. The info 'terminal', for example, can be safely replaced by len(tree.successors(x)) == 0.

Fix plotting of folds

When running a N-fold cross-validation, in the plotting of final results, Ardennes is actually plotting the mean accuracy among folds, instead of plotting the correct value. This is exemplified below:

  • Fold 1: size: 17 accuracy: 0.93
  • Fold 2: size: 17 accuracy: 0.87
  • Fold 3: size: 17 accuracy: 0.95
  • Fold 4: size: 17 accuracy: 0.88
  • Fold 5: size: 21 accuracy: 0.75

WRONG: (0.93 + 0.87 + 0.95 + 0.88 + 0.75)/5 = 0,876
CORRECT: ((17 * 0.93) + (17 * 0.87) + (17 * 0.95) + (17 * 0.88) + (21 * 0.75)) / 89 = 0.870

Obviously, when all folds have the same size, this phenomena will not occur.

Change GM after some time

If verified that

all(
  map(
    lambda x: x == val, 
    [fitness.mean, fitness.median, fitness.max]
  )
)

then increase the complexity of the graphical model.

One variable per level

Change the GM so there is only one variable per level, as opposed to the current 2^h variables (where h is the current height of the level, starting from 0).

This should make the generated trees better (regarding their accuracy), and also decreasing the sampling procedure.

Prune tree

Currently something as counter-intuitive as this is occurring:
figure_1

Note that there are two kinds of bugs:

  • A predictive node is being coloured as a terminal one;
  • There are children nodes which lead to the same leaf value!

Fix this in order to decrease tree size.

Twin nodes

One way of solving the problem of splitting categorical attributes (and thus preventing the tree from losing too much height in dealing with multiple splits) is to perform a n-ary split (where n is the number of categories for a given predictive attribute), but set all children nodes with the same node label.

Obviously it is required to decrease the impact of this repeating of attributes in the GM update.

Add support to folds and partition

Currently, ardennes only receives the full dataset as input and operates over it (i.e. splits into folds or partitions). Make it support folders with a given format, for predetermined folds/partitions.

Better localization when dealing with GMs

The code currently relies heavily on the loc method from pandas.DataFrame for both sampling and updating procedures:

            grouped = self.weights.copy()  # type: pd.DataFrame
            for p in self.parents:
                grouped = grouped.loc[grouped[p] == session[p]]

however, the values in each cell are already known, since they follow a pre-determined distribution. Use this distribution for improving the speed of GM class.

Fix threshold values

Fix threshold values for numerical values for picking middle values between two objects.

Implement two GMs

We should use two GMs in this EDA: one for initializing the population, and other for the decorrent populations.

The initial GM is currently implemented. It should be only modified to consider a increase in the probability of sampling terminal nodes as the level of the tree increases.

The other GM, for the decorrent population, is not already implemented.

Top-down algorithm as heuristic

Use a top-down inference algorithm for predicting a maximum depth of the trees in the population.

Guarantee that the top-down algorithm achieves 100% accuracy, to maximize the depth.

Preserve threshold for nodes

Preserve thresholds for nodes in given positions, given that the parents are the same across different samplings.

For example:

If in generation g_0 the threshold for node n_2 is already calculated for attribute a_2, given that its parents are nodes n_0 and n_1 with attributes n_0 and n_1, then this threshold doesn't need to be calculated again if this configuration repeats for any other generation.

Improve FinalGM model

Currently, FinalGM only counts with univariate distributions. Adopt a bivariate distribution (in which each node probability depends on the choice of attribute of the parent node) for sampling.

Make whole population vote

Test whether making the whole population vote in the last generation isn't better than simply using the best tree.

Limit size of subsequent populations

Once a tree height is reached, it cannot grow any more.
Prevent the code from behaving like this:

iter: 000 mean: 0.684654 median: 0.679487 max: 0.910256 ET: 291.16sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 001 mean: 0.682795 median: 0.692308 max: 0.910256 ET: 274.74sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 002 mean: 0.690731 median: 0.705128 max: 0.910256 ET: 270.86sec height: 8 n_nodes: 25 test acc: 0.578947
iter: 003 mean: 0.701141 median: 0.717949 max: 0.948718 ET: 275.70sec height: 9 n_nodes: 39 test acc: 0.578947

JSON file for parameters

Create a JSON file for storing parameters of the algorithm. This must be an optional feature, and not the main method of passing parameters to the algorithm.

Fix non-converging evolution

Currently the evolutionary process presents a strange behavior:

iter: 000 mean: +0.785000 median: +0.800000 max: +1.000000
iter: 001 mean: +0.596667 median: +0.266667 max: +1.000000
iter: 002 mean: +0.596667 median: +0.266667 max: +1.000000

The mean and median accuracy should never decrease as much as they are decreasing right now.

Make all population contribute

Currently only the fittest individuals are used for updating the probabilities in the GM. Make the whole population contribution, by using their fitness as moderator of influence.

Capture metadata

When running an evolutionary procedure, capture some metadata for further comparison, such as running time, model accuracy, the model itself, and other related information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.