Giter Site home page Giter Site logo

mast-ml's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mast-ml's Issues

Add keyword for input features to read-in from csv using pandas?

Right now user manually enters name of features. Could add keyword, e.g. "Auto", that is passed to (future) modified data_parser class that can use pandas to automatically extract all features based on names of their columns, excluding whatever column is listed as the y_values

array to dataframe reindexing causes other issues

@rjacobs914
Question for Ryan:
Line 322 in DataParser.py

def _array_to_dataframe(cls, array):
    dataframe = pd.DataFrame(data=array, index=range(1, len(array)+1))
return dataframe

So if I had read in an array from a CSV, the indexing defaults to starting at 0.
Then if I normalize the dataframe's x_features with FeatureNormalization, the indexing is set to start to 1 in line 322.
So then when I try to recombine the normalized x_features with the previous array (which might have other columns in addition to the x_features), the indexing is all off. Reindexing doesn't work, because the 0 index of the normalized data is just null.

Was there compelling reasoning behind starting the indexing explicitly at 1? Would it work to not set the index to anything and let pandas index automatically?

fix extrapolation plotting

Give user ability to modify colors, choose between line/no line for lots of points, modify legend labels, and plot arbitrary number of test lines (no more topredict vs standard)

favorites or summary directory

Where favorite plots are pulled out
Could be difficult to pre-determine;
Maybe would want some kind of input file section, or we determine which plots are "favorite" plots
also key statistics and readme file

Permission error when save path not set as current directory

if save_path is ./"anything" I am seeing a permission error when writing MASTMLlog.log

File "C:\ProgramData\Anaconda2\envs\ML\lib\shutil.py", line 544, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another pro
cess: 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\MASTMLlog.log
' -> 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\test_paper_nor
malization\MASTMLlog.log'

This may be related to the fact that it looks like the code is writing MASTMLlog.log to both the specified save directory and working directory that the code is run from.

Input file model list

Move list of all models available and input params out of input file and into documentation.

X features, y features

if not set for a particular test parameters, then look in Data Setup X features and y feature, and if not there, error out

ipython and docker?

Encapsulation ability - look into docker for environment and ipython notebook for display

look at how we are passing information through the workflow

For example, hyperparameter optimizations are not passed through to other tests.
Maybe have a controlling class be able to run hyperparameter test and then subsequent tests, and the controlling class would have optimized parameter info? (which could then be updated with subsequent hyperparameter optimizations)

consider class structure for tests

Each test typically requires:
data read-in
*which datasets to use
*which features to use for fitting
*which feature to predict on
data division
*into test/train groups
*also sometimes by additional grouping like by category
*also sometimes filtering out of certain test and/or train data according to separate criteria
fits
*how many fits and predictions may depend on data division
*which model
*which hyperparameters
*sometimes may want additional features to be calculated on the fly, which weren't in the original data
predictions
*how many predictions may depend on data division
statistics collected on predictions
*sometimes validation (RMSE, R2)
*sometimes the testing set has no measured data to validate with
printing
*printing of essential statistics and model outputs
plotting
*can be complex
*often requires specific annotations
*plots of measured versus predicted
*plots of predictions versus some numeric data (not necessarily a fitting data column) split out by groups)
*plots with certain data filtered out, even though that data WAS included in training and/or in testing and testing statistics

Tests could inherit from some basic class
Class structure would prevent having to pass a lot of variables.
Class structure may encourage modular programming and uniform extension

dataset dictionary in MASTML

data as a dictionary, with keys as csv names or some other way
some additional input parameter keys for each test_case in order to allow different sets to be used for different tests cases;
but do the X, y feature setup ahead of time, and the data parsing, in MASTML the way Ryan has it, for each dataset.

GA

mean and std dev if do multiples
Multiple GA just run in serial one after another the way Josh had had it; that way MASTML can control

Add check that all test_cases exist in [Test Parameters] before running tests

I've had a few runs where I add a new test or change an existing test name but forget to change the name up in the [models and tests to run], test_cases. Currently it doesn't look like there's a check that verifies if each test_case has corresponding [Test Parameters] which causes a crash when it finally tries to run the test and can't lookup the parameters.

Having a check would make this error show up sooner and prevent wasting time running tests when the workflow won't end up completing

doc note multiprocessing and ParamOptGA

Make sure virtual memory (and memory) are large enough for ParamOptGA

Note that morgan3 apparently works without setting pvmem tag (otherwise actually limits pvmem to 1000mb which is not enough) while morgan2 with pvmem tag gives unlimited pvmem

qstat -f | grep used to see resource usage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.