uw-cmg / mast-ml Goto Github PK

View Code? Open in Web Editor NEW

101.0 101.0 58.0 408.01 MB

MAterials Simulation Toolkit for Machine Learning (MAST-ML)

License: MIT License

Python 29.99% Jupyter Notebook 70.01%

mast-ml's People

Stargazers

Watchers

Forkers

haijinlu h0lland robertmaxwilliams jdkern11 ethoeng leschultz mahendra-ramajayam ecnuitaa donglin-wang wardlt xiaod123 martin-hunt nkrakauer nakzz kennykangmpc alexdo21 lishin1980 coleerickson giribio lizhen-dlut yfzhang910 xue-smile alexp205 anilkunwar jhke01 ucsdlxg marzieghorbani chenhuayz yuwenxianglong tafaltens sunatthegilddotcom fermiq hypersyf parvathycs88 singhgp4321 rajibmondal dishwor avery2 agoyal0512 chrinide quandewang yrohiman270 boneyag shrisbu kuan-ru-chiou oieieio raolixiang-up gzw1234 wwjcmp harel-coffee mihirjagtap imoleayomideajay bug0306 benben1984 gariperdogan wenjunheng

mast-ml's Issues

Add keyword for input features to read-in from csv using pandas?

Right now user manually enters name of features. Could add keyword, e.g. "Auto", that is passed to (future) modified data_parser class that can use pandas to automatically extract all features based on names of their columns, excluding whatever column is listed as the y_values

html summary file, add smaller size pictures

Add small size pictures to html summary file so can see at a glance without having to click on each link. Maybe 1/4 or 1/8 size images, with link to full size.

add remaining DBTT_skunkworks master code

Add Josh C's code and models from DBTT_skunkworks master, and
Josh P's composition-dependent p value function to GA.
Preserve file history!

refine configobj validation to handle specific datatypes and keywords

(in progress)

LeaveoutGroupCV not writing Group Prediction column in csv files

in the csv files written for each group "group"_test_data.csv every file except the first file is missing the "Group Prediction" column

array to dataframe reindexing causes other issues

@rjacobs914
Question for Ryan:
Line 322 in DataParser.py

def _array_to_dataframe(cls, array):
    dataframe = pd.DataFrame(data=array, index=range(1, len(array)+1))
return dataframe

So if I had read in an array from a CSV, the indexing defaults to starting at 0.
Then if I normalize the dataframe's x_features with FeatureNormalization, the indexing is set to start to 1 in line 322.
So then when I try to recombine the normalized x_features with the previous array (which might have other columns in addition to the x_features), the indexing is all off. Reindexing doesn't work, because the 0 index of the normalized data is just null.

Was there compelling reasoning behind starting the indexing explicitly at 1? Would it work to not set the index to anything and let pandas index automatically?

fix extrapolation plotting

Give user ability to modify colors, choose between line/no line for lots of points, modify legend labels, and plot arbitrary number of test lines (no more topredict vs standard)

write in switch to automatically normalize input features

Add LO% to GA

remove execute functions

Just use classes (init, and run)

timex broken for PlotNoAnalysis

Look at compatibility of matminer (python 2.7) vs MASTML (python 3+)

Compatibility, whether to copy/modify their framework

generalize eony e900 columns

favorites or summary directory

Where favorite plots are pulled out
Could be difficult to pre-determine;
Maybe would want some kind of input file section, or we determine which plots are "favorite" plots
also key statistics and readme file

add ability to fit and predict on composition-dependent newly created features

add ability for new features on the fly that aren't part of the dataset, for example, creating the optimized composition-dependent p for each composition.

Permission error when save path not set as current directory

if save_path is ./"anything" I am seeing a permission error when writing MASTMLlog.log

File "C:\ProgramData\Anaconda2\envs\ML\lib\shutil.py", line 544, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another pro
cess: 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\MASTMLlog.log
' -> 'C:\Users\Ben\Documents\GitHub\mast-ml-data\DATA_solute_diffusion_class_update\test_paper_nor
malization\MASTMLlog.log'

This may be related to the fact that it looks like the code is writing MASTMLlog.log to both the specified save directory and working directory that the code is run from.

creating group directories with numbers as the name is throwing errors on windows

specific error is in os.join(X,Y) where Y is read from the csv. error says it expects string and is getting int64 in my case.
error was coming from fullfit.py in singlefit
may be as simple as assigning the int to string before feeding it into os.join?

code documentation

pull docstrings
see structopt?

Add jupyter notebook for each plot so can customize plots

meta-readme at top level as html

Link into each test directory?

add R2

Document output for LeaveOutGroup not writing predicted values for each of the different groups

allow precision of printing statistics to change

Currently RMSE's etc. are printed with various precisions in various places, all hardcoded.

Add readme files for contents of test folders

Write in to the rest of tests. template and fullfit done.
dad734e
dc6b0c5
038d583

allow change of logger level

Output comments or not; verbosity level - allow logger to print out more or less data

Input file model list

Move list of all models available and input params out of input file and into documentation.

Use logger (.log) file to provide summary of each routine

dbtt - composition dependent features for optimization

add to GA for optimization: composition-dependent p, composition-dependent ref flux

dbtt - composition dependent p

don't lose this feature in the new code

move dbtt specific to other classes and folders

make dbtt specific class on GA class?
move dbtt specific models like eony
move dbtt specific columns and conversions

clean up dependencies

some dependencies left over from MAST workflow; some not used (deap?)
add pandas

X features, y features

if not set for a particular test parameters, then look in Data Setup X features and y feature, and if not there, error out

readme for each test's csv and png output

Code should create readme for each test folder

Add hyperopt

Probably serial for now

save path currently changes between savepath and save_path in different files

This is causing issues correctly forwarding the save path around. Not sure if this is just in flux with what someone is changing, but revisit to make sure it gets cleaned up once major changes settle down

ipython and docker?

Encapsulation ability - look into docker for environment and ipython notebook for display

Enable storage and updates of previous models

From meeting with Dane 5-25-2017

Idea is being able to have an existing model be updated with new data and persist through user sessions

Check Matminer Compatibility with Python 3

Ben - check compatibility and make adjustments

Grid search heatmap for GKRR currently not useful as lower values all get washed out

look at how we are passing information through the workflow

For example, hyperparameter optimizations are not passed through to other tests.
Maybe have a controlling class be able to run hyperparameter test and then subsequent tests, and the controlling class would have optimized parameter info? (which could then be updated with subsequent hyperparameter optimizations)

consider class structure for tests

Each test typically requires:
data read-in
*which datasets to use
*which features to use for fitting
*which feature to predict on
data division
*into test/train groups
*also sometimes by additional grouping like by category
*also sometimes filtering out of certain test and/or train data according to separate criteria
fits
*how many fits and predictions may depend on data division
*which model
*which hyperparameters
*sometimes may want additional features to be calculated on the fly, which weren't in the original data
predictions
*how many predictions may depend on data division
statistics collected on predictions
*sometimes validation (RMSE, R2)
*sometimes the testing set has no measured data to validate with
printing
*printing of essential statistics and model outputs
plotting
*can be complex
*often requires specific annotations
*plots of measured versus predicted
*plots of predictions versus some numeric data (not necessarily a fitting data column) split out by groups)
*plots with certain data filtered out, even though that data WAS included in training and/or in testing and testing statistics

Tests could inherit from some basic class
Class structure would prevent having to pass a lot of variables.
Class structure may encourage modular programming and uniform extension

Consider configobj module for input file

configobj may be worth using over built-in ConfigParser. More info here: https://pypi.python.org/pypi/configobj/5.0.6

dataset dictionary in MASTML

data as a dictionary, with keys as csv names or some other way
some additional input parameter keys for each test_case in order to allow different sets to be used for different tests cases;
but do the X, y feature setup ahead of time, and the data parsing, in MASTML the way Ryan has it, for each dataset.

check matplotlib on cluster

May need matplotlib.use('Agg') or similar if no display.

remove references to lwr_data_path

GA

mean and std dev if do multiples
Multiple GA just run in serial one after another the way Josh had had it; that way MASTML can control

Add check that all test_cases exist in [Test Parameters] before running tests

I've had a few runs where I add a new test or change an existing test name but forget to change the name up in the [models and tests to run], test_cases. Currently it doesn't look like there's a check that verifies if each test_case has corresponding [Test Parameters] which causes a crash when it finally tries to run the test and can't lookup the parameters.

Having a check would make this error show up sooner and prevent wasting time running tests when the workflow won't end up completing

qstat -f | grep used to see resource usage

uw-cmg / mast-ml Goto Github PK

mast-ml's People

Stargazers

Watchers

Forkers

mast-ml's Issues

Recommend Projects

Recommend Topics

Recommend Org