Data analysis part of big data project.
The project involves analyzing data from an insurance agency to help it automatically predict the risks associated with opening new customer accounts.
For the entire project, the tools we need to use:
- HortonWorks Sandbox de Cloudera
- Amazon AWS
- Jupyter Notebook
- Python 3.7
- MaongoDB
EDA.ipynb
is for analyzing data distribution.
data processing and modeling.ipynb
is to process and model the data, and finally obtain a baseline accuracy of 64% on total training set.
bigdata_project_final.ipynb
is to make the prediction in addition to processing and modeling.
big_data_project.py
is to convert file to .py format for easy execution directly via python on the instance EC2.
- Xuanlong YU
- Yueming YANG
- Chaymae El Abbadi