Light

erik-lance / house-prices-analysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 3.09 MB

STINTSY-MCO

License: MIT License

Jupyter Notebook 100.00%

house-prices-analysis's Introduction

House Prices Data Analysis

Read this section before committing and pushing to GitHub on a collaborative project.

Pre-requisites

Since Python Notebooks save outputs, it is important to clear all outputs before committing and pushing to GitHub. This can be done by selecting Cell > All Output > Clear from the menu bar.

There is also an automated way to do this. In the terminal, run the following command:

pip install nbstripout
nbstripout --install

git config filter.nbstripout.clean 'nbstripout'
git config filter.nbstripout.smudge cat
git config filter.nbstripout.required true

This will install the nbstripout package and configure Git to automatically strip out the outputs of all notebooks before committing them. This will make the diffs of the notebooks much easier to read and review.

house-prices-analysis's People

Contributors

Watchers

house-prices-analysis's Issues

Geospatial Analysis

Plot houses on a map using the latitude and longitude coordinates. Color-code them based on their condition to identify any geographical patterns.
Create a geographical heat map to Explore whether certain zip codes have houses in specific conditions.

Random Forest Model

Data Preprocessing:
- Similar to Logistic Regression, preprocess the dataset by handling missing values, encoding categorical features, and scaling numerical features if necessary.
Model Training:
- Train a Random Forest model using the preprocessed data, treating 'condition' as the target variable.
Hyperparameter Tuning
- Experiment with the number of trees, maximum depth, and other hyperparameters. Mention the chosen hyperparameters.
Model Evaluation
- Evaluate the Random Forest model's performance using classification metrics. Pay attention to its robustness and its ability to generalize to unseen data.

Naïve Bayes Model

Data Preprocessing
- Prepare the dataset as needed, but note that Naive Bayes models can handle categorical data naturally.
Model Training
- Train a Naive Bayes model, choosing the appropriate variant for your problem (e.g., Gaussian, Multinomial, or Bernoulli).
Hyperparameter Tuning
- If applicable, fine-tune the hyperparameters associated with the chosen Naive Bayes variant. Mention the selected variant.
Model Evaluation
- Assess the Naive Bayes model's performance using classification metrics. Evaluate its ability to handle your classification task effectively.

Correlation Analysis

Calculate and visualize the correlation between the 'condition' feature and other numerical attributes (e.g., 'bedrooms,' 'bathrooms,' 'grade').
Use correlation heatmaps to identify potential relationships between variables.

Note: this is different from #7 because this is focused on pairwise relationships with relation to condition.

Logistic Regression Model

Data Preprocessing
- Prepare the dataset by handling missing values, encoding categorical features, and scaling numerical features if necessary.
Model Training
- Train a Logistic Regression model using the preprocessed data. Use the 'condition' as the target variable and all relevant features as inputs.
Hyperparameter Tuning
- Experiment with different regularization strengths (e.g., L1 or L2 regularization) and solver types (e.g., 'liblinear' or 'lbfgs'). Mention the chosen hyperparameters.
Model Evaluation
- Evaluate the Logistic Regression model's performance using appropriate classification metrics like accuracy, precision, recall, F1-score, and ROC AUC. Ensure that the model does not overfit and generalizes well to new data.

Outlier Detection

Identify potential outliers in numerical features related to house attributes (e.g., 'sqft_living,' 'sqft_lot'). Analyze whether these outliers are associated with specific condition categories.

Hyperparameter Tuning (All)

For each model everyone worked on:

Hyperparameter Tuning Method
- Explain the method you used for hyperparameter tuning, whether it's grid search, random search, or another technique.
- Provide details about how you conducted the tuning process.
Hyperparameters and Ranges
- List the specific hyperparameters relevant to each model.
- Specify the range of values or options you explored for each hyperparameter.
Performance Evaluation
- Report the performance of different hyperparameter configurations using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score, ROC AUC).
- Utilize tables and visualizations to present the performance results for each configuration.
Interpretation
- Interpret the results based on the evaluation metrics. Explain which hyperparameter configurations performed best and why.
- Discuss if overfitting or underfitting was observed and how hyperparameters influenced these aspects.

Cross-Feature Relationships:

Explore interactions between different features, such as 'bedrooms' and 'bathrooms,' to understand how they may collectively affect house condition.
Create scatterplots or pair plots to visualize relationships between pairs of features.

Note: This is different from #5 because correlation focuses on pairwise relationships while cross-feature is multi-feature. In here we want to understand how a combination of factors affects an outcome.

Data Summary and Visualization

Generate a dataset summary, including each feature's number of records, data types, and basic statistics.
Create visualizations such as histograms, bar plots, or box plots for key features like 'condition' to understand their distributions.

Dataset Description and Structure Documentation

Brief Dataset Description
- Provide a concise overview of the dataset. Include key details such as its title, source, and a high-level summary of its content.
Collection Process Description
- Investigate and describe the process used to collect data for the dataset. Mention any relevant sources or references used during data collection. Discuss how the data collection method may influence the conclusions and insights drawn from the dataset.
Dataset Structure
- Describe the structure of the dataset file, including the file format (e.g., CSV, Excel), the delimiter used (if applicable), and any header information.
- Clarify what each row and column represents. Explain the meaning of rows and columns in the context of the data.
- State the dataset's total number of instances (rows) and features (columns).
Features Description
- Provide a detailed description of each feature in the dataset, even if some features are not used in the study.
- Explain the purpose and meaning of each feature. Readers should understand the role of each feature without needing to consult external sources.
- Include information such as data types (e.g., numerical, categorical), units of measurement, and any special considerations.

Class Distribution Analysis

Plot a bar chart to visualize the distribution of houses across different condition classes.
Calculate and display the percentage of houses in each condition category.

Handling Duplicate House Sales Records

We have identified that the dataset contains multiple records for the same houses due to multiple sales. To make sure our analysis of house conditions is based on the most recent data and to avoid potential bias, your task is to keep only the most recent sale record for each house. This will involve removing or marking duplicates in the dataset.

Identify duplicate records in the dataset based on a unique identifier, such as the house ID.
Select and retain only the record with the most recent sale date for each group of duplicate records.
Ensure the final dataset contains only one entry for each unique house.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.