Input data: https://bit.ly/2YfxTAi
Claat: https://codelabs-preview.appspot.com/?file_id=17iO3pKXC5vyRMQFh5PLCRtJOvvmEDHGLeW02zzXlzBw#0
Task 1:
-Irs the professor is going to need the whole data set for the study. The professor would like to know the trend and the factors affecting the interest rate of the loan as the students have more data(features) to explore. This strengthens the point that in the industry most of the efforts goes in improving and understanding data. It is more important to have more data when compared to complex algorithms.
-The data had many colors which were empty or had values which can not be used.
-Analysis of the graphs generated from the lending club data. The graphs have been studied keeping in mind the need of the professor.
Task 2:
The data was cleaned using panda library. The columns with more than 80% of empty data were dropped. The columns with functions not related to prediction were dropped too. Converting the grade from string to numeric value. Changing the type of data and filling the empty cells with the median of the columns.
Used manual feature engineering and auto feature tools (https://github.com/featuretools/featuretools/ )
In manual feature tools, we could change the things according to our needs while in auto we are restricted
Task 3:
Building 3 manual ML models:
-Linear regression
-Random Forest
-Neural Network
We have taken a MAPE of the output to choose the most reliable model. Random Forest has the least MAPE.
Task 4: Hypertuning
Using Hypertuning to increase the efficiency.
a. Regression: Try L1, L2, Elasticnet regularization
b. Neural networks: Change epochs, optimizers, the learning rate
c. Random forest: No of trees, Tree depth
. https://scikitlearn.org/stable/modules/grid_search.html
Libraries Used:
-
MLPRegressor
-
numpy
-
pandas
-
tensorflow
-
StandardScaler
-
Building 3 AutoML models:
-
TPOT
-
AutoSklearn
-
H2O.ai
-
LinearRegression
-
RidgeCV
-
LassoCV
-
ElasticNet
-
RandomizedSearchCV
-
KFold
-
learning_curve, GridSearchCV
-
train_test_split
Using AutoML makes it more Interactive and Interpretable. AutoML makes it easy to change the input files without many changes.
It can be used on any input data to get the desired output
Task 5:
Building test cases to check our Random Forest model and Analysing the output.
Citations:
-
James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015. [https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf]
-
https://cmdlinetips.com/2018/01/7-tips-to-read-a-csv-file-as-pandas-data-frame/
-
Team 4 (Presentation given in the class)