To-dos to settle down

Create a Kaggle team for your own account and name it with your DSG team name
Decide which method to use - deep learning? or other way?
Decide which framework to use - e.g. tensorflow, theano, etc..
Do exploratory data analysis in the data set - here we can see some examples https://www.kaggle.com/kernels

Shinho

Create 10 samples on CSV.
Write the options to use listen_type.
Functionalize preprocess modules
Add another column with a binary status weekday and weekend.
Correct the release date preprocessing.

Baldo

Plot a age and release year graph.
Plot a relation between age and platforms and is_listened column.

Omar

Send a naive solution with random numbers.

All

Clustering or kriging with columns: album, media, genre, media_duration, and artist

Modeling strategy

Model A: train a model without user_id to find correlations between (user_gender, user_age, and the other features) to (song's cluster).
Model B: train models for each user's history using Model A and get the result of test sample.
Model B should be temporary and wrap Model A.

Ensemble models

keyword: Gradient Boosting Decision Trees

Get a representative sample from the dataset

The dataset has more than 7 million rows. I think we can get a sample with around 100 000 rows, (still a lot of data).

My first idea is read the whole dataset and get the sample randomly. Of course, we can do this in a better way, but how O.o?

Stratified sampling is an option: https://en.wikipedia.org/wiki/Stratified_sampling

Usage of listen_type

Since there's no data with listen_type=0 in test set, the usage of listen_type is an issue.

Option 1.
Ignore listen_type
Option 2.
Use listen_type=1 only in the training
Option 3.
Use listen_type=0 as training set, listen_type=1 as validation set
Option 4.
Just Use It.

Translate Unix timestamp to date.

This is a nice observation. As the dataset is from one month, maybe we can have as a result only the hour and another column indicating the day.

omartrinidad / challenge Goto Github PK

challenge's People

Contributors

Watchers

challenge's Issues

To-dos to settle down

Database sharing

Get more information using Deezer API

Do preprocessing by columns

Shinho

Baldo

Omar

All

Modeling strategy

Ensemble models

Get a representative sample from the dataset

Usage of listen_type

Translate Unix timestamp to date.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent