ASHRAE - Great Energy Predictor III

Final Score: 8th/3595

Key information in the resources csv files

There are 6 csv files: 'train.csv', 'building_metadata.csv', 'weather_test.csv', 'weather_train.csv', 'test.csv', 'sample_submission.csv'

train.csv

building_id - Foreign key for the building metadata.

meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, hotwater: 3}. Not every building has all meter types.

timestamp - When the measurement was taken.

meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.
building_meta.csv

site_id - Foreign key for the weather files.

building_id - Foreign key for training.csv

primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
weather(train/test).csv

site_id

air_temperature - Degrees Celsius

cloud_coverage - Portion of the sky covered in clouds, in oktas

dew_temperature - Degrees Celsius
test.csv

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

row_id - Row id for your submission file

building_id - Building id code

meter - The meter id code

timestamp - Timestamps for the test data period

sample_submission.csv

A valid sample submission.

Pre-process the data

Change the timestamp to a format: '%Y-%m-%d %H:%M:%S' which can be compared
reduce the memory by change int to int16 and float to float 16
final predictor 'meter_reading' has a right-tailed distribution, therefore use log to normalize.
merged the csv by building_id, site_id and timestamp

visualize the important information

check the factors (meter and meter_reading) in train.csv, and check power_transform and log of meter_reading

check the factors in building.csv: ('year_built', 'floor_count', 'primary_use', 'square_feet')

compare weather(train and test).csv for different factors: ('air_temperature', 'cloud_coverage', 'dew_temperature', 'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed')

get the correlation_heatmap for all potential factors

check timescales changing of meter_reading in different buildings:

figures are too large so not showing here.

random building in certain types of primary use in whole period: 2016
random building in certain types of primary use in certain month: Auguest
random building in certain types of primary use in certain day: 10-01
mean meter_reading in certain types of primary use in whole period: 2016
mean meter_reading in certain types of primary use in certain month: Auguest
mean meter_reading in certain types of primary use in certain day: 10-01

scrapping in the website (https://www.oeis.ucf.edu/buildings)

data cleaning

weather_(train/test).csv cleaning:

(not sugguestion based on the rmse and learderboard score)

use 3 different ways of interpolation to fill in the data (polynomial, linear, both (mixture))

(Best way to fill based on the rmse and learderboard score)

group the weather with day and month and site_id and fill in the missing value with the mean because for each day the weather will not have a strong difference and air_temperature has small amount of missing values.

drop the outlier

(key part for reduction of rmse and lb)

initial cleanup

141 days of electricity for site 0: Most would be covered by the previous sets, but there are a few stray non-zero values that we ignore because they don't fit the overall pattern.
Abnormally high readings from building 1099: These values are just absurdly high and don't fit an established pattern.

second cleanup

https://www.kaggle.com/kernels/scriptcontent/24052407/download

More than 48 hours with zero readings which do not occur during the typical seasons
There's no reason for a building to ever have zero electrical usage

feature engineering

site_0 meter 1 unit need to be changed and changed back before submission
Add month, hour, weekday, relative_humidity, feels_like.
logarithm for meter_reading and squared_feet.
encode categorical data: primary_use.
categorical features: building_id, site_id, primary_use, meter, dayofweek. Decided by rmse and lb.

feature usage

2 types of data were performed:

initial cleanup features are building_id, site_id, air_temperature, dew_temperature, hour, sqare_feet, cloud_coverage, meter, weekend, precip_depth_1_hr and primary_use
In final best cleanup data, features are added: sea_level_pressure, relative_humidity, wild_direction, feels_like, year)built, floor_count

single model fit

Light GBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms.

initial cleanup without site_0 change, kfold = 3, lb is 1.086
initial cleanup with initial features, kfold = 3, lb is 1.089
initial cleanup with initial features, kfold = 2, lb is 1.089
final cleanup with initial features, kfold = 3, lb is 1.086
final cleanup with initial features, kfold = 2, lb is 1.092
final cleanup with final features, kfold = 3, lb is 1.082
final cleanup with final features, kfold = 5, lb is 1.074
final cleanup with final features plus month, kfold = 5, lb is 1.078

XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

initial cleanup with initial features, kfold = 3, lb is 1.140
final cleanup with initial features with bayes opt, kfold = 3, lb is 1.141
final cleanup with final features, kfold = 5, lb is 1.084
final cleanup with final features plus month, kfold = 5, lb is 1.087

CatBoost

CatBoost is a high-performance open source library for gradient boosting on decision trees.

initial cleanup with initial features, kfold = 3, lb is 1.100
final cleanup with initial features with bayes opt, kfold = 3, lb is 1.099 (Caused by limited running iterations)
final cleanup with initial features, kfold = 3, lb is 1.082
final cleanup with final features, kfold = 5, lb is 1.074

Neural network

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns.

initial cleanup with initial features, lb is 1.138
final cleanup with initial features, lb is 1.100

Blending model based on leaking data

6 submission were used to fit leaking data for final: 3 lgb data, 1 nn, 1 cat, 1 xgb

gradient ensemble by tensorflow to fit leaking data. (Not a good way, because the weight could be negative) lb = 0.966 with leaking data
6 0.0~0.5 float elements list were generated randomly to weight submissions to fit the leaking data. (Not fitting well) lb = 0.967
PSO (Particle swarm optimization) is the best way. Best lb = 0.946 (Top 2% 62/3669)
average all these 6 submission and get Top 1% 8th/3595

renmoji / ashrae_energy_predictor Goto Github PK

ashrae_energy_predictor's Introduction

ASHRAE - Great Energy Predictor III

Final Score: 8th/3595

Key information in the resources csv files

Pre-process the data

visualize the important information

data cleaning

initial cleanup

second cleanup

single model fit

Blending model based on leaking data

ashrae_energy_predictor's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent