Forest Fires Regression

Introduction

Forest fires have somewhat of a major impact on the environment. That's why having a model that predicts the burned area of a forest in advance will help prevents such terrible impacts. These predictions shall trigger the appropriate level of precautions to maintain or even prevent them from happening.

This regression project uses the forest fires dataset from UCI Machine Learning Repository. This project is aimed to analyse the dataset and create regression models based on the best R2 scores. Throughout the three notebooks of the project, we will implement data cleansing, transformation, and modeling as appoved in the literature. See References at the end of this README file to find out more.

Dataset Description

Attributes

Features

Spatial Attributes (S)

X : x-axis coordinate (from 1 to 9)
Y : y-axis coordinate (from 1 to 9)

Temporal Attributes (T)

month : Month of the year (January to December)
day : Day of the week (Monday to Sunday)

Fire Weather Index Attributes (FWI)

FFMC : Fine Fuel Moisture Code
DMC : Duff Moisture Code
DC : Drought Code
ISI : Initial Spread Index

Weather/Meteorological Attributes (M)

temp : Outside temperature (in Celsius)
RH : Outside relative humidity (in percentage)
wind : Outside wind speed (in kilometer per hour)
rain : Outside rain (in millimeter per square meter)

Target Variable

area : Total burned area (in ha)

Link to the Dataset

UCI ML Repository, Forest Fires Data Set

Notebook 01 Summary

In this notebook we have explored different regression models trying to come up with good R2 scores but the data distributions and coorelations among features themselves and with the target vaiable are not satisfying the linear regression models.
Also, as suggested from literature, we have tried using different subsets of the full data and tried to regress on only the nonzero observations but still no good results show up.
For the sake of finding different interactions among data features, we tried different techniques including adding polynomial and spline features. Yet, nothing produced good results but the spline transform proved to have good significance on the R2 score. For that reason, we only included the spline transformations and obmitted the polynomial transformations.
In addition, we defined a function to print out the maximum R2 score when selecting subset features of the full data attributes scoringfn. we used two techniques: the first was using different combinations using the combinations class from the itertools library, and the second was using automatic feature selection from feature_selection module. When using automatic feature selection, we used only the RFE (recursive feature elemination) class for having the best selection process.
Lastly, we reied different transormation techniques on the features including boxcox transformation, but it gave errors. So, we kept only the np.sin and np.log1p.

Notebook 02 Summary

In this notebook we used tree based regression models to fit the data. Of course, after somewhat overfitting the data, we managed to produce better R2 scores. The best three models came out to be DecisionTreeRegressor, ExtraTreeRegressor, and ExtraTreesRegressor. The R2 scores were respectively as follows: 0.996112, 0.996112, 0.996112. For the sake of producing these results without repeating any steps, we produced a python class to fit and print the scores on the training and testing datasets Evaluate_Model.

Notebook 03 Summary

In this notebook we used the deep learning approach to predict the burned area. We have applied different Sequential models that have different layers from the tensorflow software using keras library. The results weren't that much perfect but loss and val_loss values converged to their minimum. We also defined some utility_functions to produce the results and plot the propagation of loss and val_loss values.

e-hossam96 / forest-fires-regression Goto Github PK