orie4741 / projectsfall2017 Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 57.0 68 KB

Repository for Fall 2017 ORIE 4741 projects.

projectsfall2017's People

Contributors

Stargazers

Watchers

Forkers

sam463 marleneberke annasxu fjt37 llongpre tinali0923 nancy15689 juanfe514 johnzhong1468 rh443453518 sz244 hanli1 ceanpark ben1605 deep14 gsymine themichaelhu mluop kevinguo344 charlottewang817 kel89 ericmdai yihengxiao dbcarrasco xinyuntang archerhaha bakulcsingh poojachoudhary95 lillyanpan jimmychen623 haoqiyao tsl-123 sstxneo felixwenthu vanshikabansal laurensedita shrutisanghavi eve208 bethanywu jx266 dancyfang ft99 joyceccy faisalmk89 kathleenzhu kgcornell em656 nehalrawat dwkwvss chelseazhangx gracemitch christine340 alicefromthepalace austonli wz299 milanrshah

projectsfall2017's Issues

Final Report - Peer Review (kc594)

The goal of this project is to take the CBOE Volatility Index and other related information surrounding the VIX to first be able to predict the movements and fluctuations of the VIX, and second to use these to create a trading strategy on VIX ETF.

Some highlights of the project:

I am not any sort of expert in the financial field, but their introduction and reasoning of the features they were including were very well explained and made sense even to someone unfamiliar with the area.
Along with this, the report was well organized and clearly broken up into sections and subsections and labeled distinctly to make it easy to follow and see the progression of steps you took to complete the project.
Very thorough in their methods to find additional information to use as features, from many different sources. Once they had information from these sources, they methods they described to clean and aggregate the information into features also seemed extremely thorough, making their project and results to come seem trustworthy to a reader.
Used many feature engineering techniques, apart from ones we learned in class, giving me confidence their methods were well thought out for this project to give the best possible results. Also tried many different loss functions from class as various models to compare.

Some room for future improvements:

Graphs and visuals were well annotated (such as figure 1 with descriptions of the events occurring at specific times), but including axis labels or more detailed captions describing what values are being graphed would have been helpful to a reader looking to be convinced of claims in the report.
It is talked about that some features have similar distributions as can be seen in box plot, although it is not clear to the reader where these are in the graph as the writing is too small. I am also concerned that although they may be correlated, the combination of the two may have predictive power for the result as we have discussed in class, and dropping one would lose this. Did you try running with and without removing the seemingly correlated features to see if it happened to improve accuracy?
The report very clearly spelled out the data cleaning and model building for the part of trying to predict the VIX movements, but when you jump to the second part of your proposal, creating a trading strategy, it is much more brief and less clear about what you are using and why you are choosing what you do (maybe this is due to my lack of knowledge regarding VIX, but it seems much more rushed at the end after a very detailed report).

Overall, this is a very impressive report and very well done.

peer review response from xt222

This project wants to predict whether a person makes over 50k a year using a collection of demographic variables such as race, age, working hours per week, etc.

The things I like about the proposal:

The independent variables were described in details. I can get a very good sense of what the data looks like without actually looking at the data.
The goal of the project is very clearly stated. There is no confusion about the purpose here.
I really like the fact that they made plans about what kind of machine learning models to use, which showed great motivation for the project and good project management skills.

Areas of improvement:

Rethink about the project goal: why is it so important to make the salary cutoff to be 50k? Is there something special about this number? Also, is there a specific reason to build a classification model? Since salary is a continuous variable, I believe it makes more sense to build a regression model.
Specify where the data comes from: If I am a manager reviewing this project, I would like to know the sources of the data, how many data entries there are.
Is it possible to know the cities people work in: I am assuming that it's a dataset from the US since the salaries are in USD. If it's the states, one important thing about the absolute salary is where you work. A person doing the exactly same thing can make 10k more in NYC than in some small town. Therefore, I think this information is very important especially your cutoff here is 50k, a change in location can easily make that difference.

Generally speaking, I have some serious doubts about this project. It felt more like a individual homework assignment than a team project. I think the team should spend more time thinking about the implications of the project and what they can learn from it.

Final Peer Review_hc936

This project is impressive. Medical cost has been discussed more and more frequently, and how to reduce that cost becomes an important problem. In details, this project has several merits. Firstly, the dataset is messy, containing various types of data and some outliers. This team utilized some typical feature engineering methods such as one hot encoding, properly involved the categorical variables into the model. Secondly, the innovative method of measuring error really impresses me. However, although this method can properly measure the accuracy of the model, I think the comparison between this new method and traditional methods is necessary. Thirdly, the use of quantile regression pushed the project one step further. This part of research tells us that the influence of features on different quantile of cost is different, giving us new insight on how to solve the initial problem: finding proper ways to cut down the medical cost.

This project also has some drawbacks. Firstly, we study many types of loss functions and regularizers in the class. I think more loss functions, regularizers and their combination worth to be tried. Secondly, the correlation between some variables may be high. In this case, we can drop some of the features. Thirdly, this team can analyze the reason why random forest gives us the best result while the result of L2 regression is not satisfying. This kind of analysis can help us better understand the pros and cons of different models.

Proposal peer review

This study plans to use datasets containing information about specific movies, their first week reviews, and their box office performance. The researchers are hoping to train a model which is able to predict box office performance based on those input features. Three separate datasets will be used.

The proposal explores and identifies an interesting hypothesis—that the first reviews of a movie have a significant impact on the box office performance of movies. The explanation of the situation and the hypothesis is clear, and the researchers have identified specific characteristics about the movies and reviews that will be taken into account. The questions proposed are quite interesting and show the model can be used to answer not only one small question, but how it can provide further insight into behavior over time.

There are a couple things that should be improved in the proposal. There are typos present in the proposal that should be fixed. In addition, the list organization can be used, but it doesn’t seem to be very well integrated/structured into the remained of the document. The researchers use the term “box office performance”, but never initially explain what this means, so readers don’t have a concrete sense of what the model will be predicting. In the questions section, question 1 and 2 are effectively the same thing—it would be best to remove one of these duplicate questions.

Peer Review - Vidita Gawade

Final Project Peer review --cjb327

The goal of this project is to build a model that predicts whether or not a consumer will be a repeated buyer of the online merchant. This will help merchants target their advertising towards customers who are more likely to return.

Strengths :

You use a wide variety of models -- I can see you really experimented and did not just assume one might work better than the other. I think the bredth of models you tried helped you gain a sense of which models you should pursue further.

Your report is broken up into clear, manageable subsections. The organization of information makes it really easy to grasp the main ideas you are trying to convey. This is especially helpful in such a technical report :)

Areas of improvement :

My biggest critique is that your report appears unfinished. You briefly touch on several models you used, several of which had promising error results, however, at no point do you select a "best model" and really evaluate its implications on shopping trends There is no "improvements" or "further steps" section.

I think some of the explanation of how the models work is unnecessary. Instead, I would have focused on dedicating more space to explaining your methods. For example, the picture from the textbook of random forests are unnecessary within the text itself and should have perhaps been listed as a reference instead. The same can be said for many of the bold equations. To touch on what I would have liked to learn more about -- what were the weights associated with the perceptron algorithm? What does this reveal about which features affect a consumers likelihood to return?

I think the application of this model is a little backwards. You state that the benefit of a model like this is so that buyers who are more likely to return can be sent promotions, but it doesn't really make sense for merchants to be spending more resources on these customers, as they are already more likely to return. Wouldn't a better use of this information be to help merchants figure out which customers aren't coming back so they can develop strategies to change that? Though this does not have as much to do with the technical aspect of the project, I think having a solid grasp of where the value lies in a model like this is fundamental to conducting a study like this.

Final Report Peer Review by bc638

The purpose of the project is to predict how likely a crime in Chicago will lead to an arrest. They then use the prediction to provide suggestions for better police recourses allocation.
The dataset used is called "CLEAR" database that is maintained by the Chicago Police Department. It contains over 6 million crime incidents and 22 features. The features contain the basic information about the crimes. The dataset is complete and well recorded.
One thing I really like about this project is it's data visualization. It not only gives us an intuitive understanding of the dataset, the pictures were also used for selecting and dropping features. This project also uses one hot encoding to transform nominal features into binary. We were only taught to use it on the output variable. So I think it is a smart idea. I am also amazed by how accurate the models turn out to be. It seems like it is a very good choice of topic.
One concern I have with this project is that the dataset might have an inherent bias. That is, there might be some crime that took place, but were never detected by the police. This might cause a problem if the project's objective is to predict whether a crime is likely to lead to an arrest. Another thing I am concerning is that although the overall accuracy is high, the accuracy for correctly predicting arrests is only around 50%. This can be problematic if we care about predicting arrests more than predicting no arrests. A final issue I want to mention is that there are some other classification models this project could try. For example, perceptron algorithm is a relevant and easy model to try out.
Overall, it is a well written project and the topic is very interesting. Good to see this kind of effort from my peers.

Midterm Report Review (lz469)

The aim of the project is to predict the Airbnb price and occupancy rate. They are training the model with data on property location, characteristics, and reviews of 3818 Airbnb listings in Seattle. The prediction results will instruct new owners to set prices and optimize revenue for both the property owner and Airbnb.

Three good things:

The report has a clear logic and good format which make the reader easy to understand.
The bubble plot is a good choice to demonstrate the price and geography relationship.
It makes sense to approximate the occupancy rate with the number of reviews per month.

Three things to improve:

You mention that there are total 93 features included in the data set and list the name for some of them by four categories. It will be great if you could provide more data exploration with some interesting findings or visualization plots. You can also describe your data based on the type: discrete, continuous, and nominal.
I think you can include more details on how the 7 features are selected among your 30 discrete and continuous features. The plots in the report give a good overview of the price distribution and some specific feature analysis, however, you haven't told how such information contributes to your decision making and model development.
Instead of simply dropping the the missing values in one or more columns (since your data set is not big), you could try different methods to handle the missing value (eg. assign the average to NA, or define your own function...).

I am looking forward to seeing more on your future work!

Peer Review

SUMMARY:
This project examines datasets about Airbnb habits. The goal is to estimate the next Airbnb booked.

GOOD:
The project's application is something I'm directly interested in.
Strong justification for the project.
The context of the project is set very effectively

IMPROVE:
I don't know how remerging the data will improve model validity.
I foresee issues where the data might slant with Airbnb's initial crop of users before it became more accepted to rent out your home on this app.
Make sure the model doesn't accidentally statistically discriminate.

BitcoinProjection peer review - dcb86

Th main goal of the project is to analyze the historical price of bitcoin, its mining difficulty and transaction volume among other metrics to predict the future value of the cryptocurrency. This project is clearly interesting because of the changes in the financials markets that cryptocurrencies could have.

The three things I like more about the project are:

The relevance of the subject, cryptocurrencies are expected to become more and more relevant in the future years as the team pointed out mentioning examples of big companies investing/accepting bitcoin
Feasibility, I also liked that the project is feasible with the tools that we have learned already in the class
Link to future plans, this teams clearly explained how this project will help them gain relevant skills for their future work in the financial sector

Possible areas of improvement:

Originality, a lot of people have studied this subject and it's going to be hard to perform a better analysis that the ones currently existing.
How to incorporate external factors, in their analysis the team did not mention if they are going to use external factors that could be affecting the price, such as publicity/articles, big investors, new markets accepting bitcoin, other cryptocurrencies gaining market share..etc

Peer review by Kerou Gao

Final Peer Review - gas258

Sorry - wrong repository

Peer Review - sss342

This group's idea is to predict the cost of medical bills based on the patient's illness, length of stay, and other factors. They are going to use the SPARCS dataset along with algorithms that we have used in this class to come up with a solid approach to the problem. I think this would be really helpful data to use since the cost of medical bills is increasing by a lot in today's world. I think another thing they could potentially add to their solution set is to figure out which hospitals have the cheapest cost as well as the lowest mortality rate.

What I like:

How this is addressing a very big topic in todays world
How it could incorporate many of the things we have learned in this class
How it can tell us where costs can come from from a large dataset

Areas of Improvement:

Try to find another dataset that can help you come to a more conclusive solution
Try to see what other people have done that are similar in their models and perhaps you can work from there
I would not make it about helping patients seeing where their costs come from. I would focus it more on predicting the cost of someones bill based on what condition they have.

Midterm Report Peer Review

This project aims to predict which features of a film are most important to get a high IMDB rating and uncover trends and patterns in movie data. The group states that the results may be of interest to moviemakers in a saturated industry.

The good:

It appears the team did a great job in data cleaning, ignoring some irrelevant features.
The preliminary analysis with linear model seems reasonable but shows severe under fitting; however, the group is aware of it and gives some thoughts around it, such as selecting certain features rather than using all.
The visualization is very insightful. It uncovers the correlation between features and average score.

The bad:

The team may consider engineering the features to create a model which better fits the data.
Just in terms of "selling" the project, I wonder if there's a better audience than moviemakers/studios. It seems that most of the features are independent of things that moviemakers necessarily have control over.
There is no indication that the team is separating their training data from testing data.

Overall, it's solid project, and I look forward to seeing the result.

Mid-term Report

This project is using horse races in Hong Kong from 2014 to 2016 to be able to predict horse's ranking at the end of the race based on the specific features. They want to know the ranking of each horse not just necessarily which horse will be in first place.

I really like the idea of being able to predict horse race outcomes. I think this is a fun dataset to play with. I would like to see the dataset itself so try to provide the link for tht somehow. I also liked the way you went about explaining your model selection. It seems this team is aware of how to move on the the next steps. It was very easy to understand where you were taking this project.

What I think could be improved is to be more specific on how you cleaned the data. What type of transformations did you do. For example, did you use one-hot encoding for your nominal features. You mentioned how different types of models would need different types of data selection/cleaning. I think is good since you are trying to get the best outcome. However, you seemed to not go in detail of what you did so i'm still left wondering how your dataset looks like. You should also do some tests to see if features you take out are actually significant or not. I agree with the other peer reviews in that taking features out because its intuitive is not the best approach. You should provide evidence on why certain things wont make an impact. I also think you could make your visuals i bit more visualizing appealing. The histogram categories are very small and I can barely read it. Also there are titles missing for the graphs.

final review - xt222

The team wants to predict LOL game wining result (binary classification) using game features from the game (e.g. first kill, etc).

Things I like:

detailed information about LOL game. Not being a game player myself, I got a very good sense of what the project is about from the introduction
you tried various models from the class and compared their misclassification rates.
The feature visualization looks great.

Some suggestions for improvements:

I found the rank of features from the model with quadratic loss function and quadratic regularizer misleading. I don't think you can draw conclusions directly by comparing the coefficients numerically. The features are not on the same scale so the coefficients don't mean anything and you didn't say anything about scaling in the report so I assumed you didn't do it?
Why are you only drawing conclusions about feature ranking in the linear models? I think in your case (no scaling with any of the features), the random forest or any tree classification is a better approach for feature selection. And the random forest has the lowest misclassification rate anyway.
I would like to see more analysis about your choice of models here. For example, why do you think the random forest had the best misclassification rate and what does that tell us about winning in LOL?

final report review- jiahui yi (jy764)

Attrition Analysis peer review - dbc86

This study, will get deep into attrition main causes by studying a simulated data set created by IBM. The main goal is to find the most important features that led to employees leaving their respective companies. This information can be used after to create talent retention programs that could save money (in hiring costs) to companies and retain their key employees.

The aspects that I like more about the project are:

The topic, taking an analytical approach to these kind of problems is very hard. Usually companies rely on surveys that may not be so representative or just on "felling'
The applications of the project, if the team manages to create a good model that can be applied to other companies it could be very interesting for real companies with attrition problems.

Some areas for improvement could be:

I do not quite like that the data is simulated, I understand the challenge of finding real data but I am worried about the data set and how much it differs form reality, maybe the team has more insights in this issue.
Only IBM, if the data comes from only IBM I don't know if it's going to be possible to build something that applies to other companies.

peer review_yj333

The project studies the effect of each in-game event on the result of the LOL game as well as the relative importance of the factors in determining the victory. The study is aimed at helping LOL players designing winning strategy by picking in-game events that would most likely result in victory. The data is obtained from Kaggle.
What I like about the project is it is such an interesting idea that the result could make a hit among young people. Also, the data file size is manageable with enough columns detailing the in-game occurrences.
I like the idea of studying the optimal combination of in-game events to better inform players of the strategy. It would be better if it could suggest the order of the events.
As much as the project will attract a large group of young people, but the result of the project very likely only pertains to the players of the game, which is a tiny part of entire population.
One concern is whether the records of each in-game event during each play is easy to be included in the model. Are there any transformation on the data needed to be done? Are there any dummy variables needed to be added?

Peer Review - eer48

The purpose of this project is to create a model that can help restaurants to understand the current and future state so that they may make data-driven decisions to move towards success, where, in this case, success is measured by restaurant rating. The data will be created through a program that studies the source code of Trip Advisor for New York and will include geographic data as well as historical rating data.
I really like the fact that the data itself will be parsed from user input data expanding their project scope to gathering data rather than starting at pre-processing an old dataset. I also like the direction that the question about predicting the future rating of a restaurant is going as data will be parsed from comments in a focus group. I'm also interested in how this group plans on helping restaurants to improve their rating.
I think this group does a great job of outlining an early plan and it's hard to avoid being vague, but the first question about understanding industry is a bit unclear when it comes to what data will be used and what exactly "industry" means (type of food? restaurants in general? location?)
Another possible improvement could be testing the model on another city's input data and comments to see how different people feel when it comes to food, geographically.
A final improvement suggestion would be to similarly test the model against reviews in say, Yelp, to see if there is any success there.
Good luck!

Box-Office-Prediction peer review by pc676

The project I am reviewing aims to predict the box office of the movie based on the basic profile of the movie and the first week’s online reviews. They also want to check if first week’s review has similar effect ten years ago as they have now. They are planning to use data set from Kaggle, Rotten tomatoes, the numbers.com, and some other open movie databases
Things which I liked about the proposal

They have given upfront information on what will they base their prediction on
They have explained why they think their predictor variable i.e., first week’s review will help them in arriving at the correct prediction.
They have provided the list of datasets they are going to use to make their prediction
Things to improve in the proposal:
The proposal is a very generic and it doesn’t seem like it is written to a manager. The proposal is also silent about the importance of the project and why do they want to work on this project.
It would have been better if they have explained more about the different data sets they are planning to use. They have just provided the links to the data-sets. The manager will not have time to check all the links and check different features of the data-sets.
They have assumed that linear model would fit the data well without talking anything about any features of the data that they are planning to use. It would have been better if they have given some basis for coming to the conclusion

peer review - arshi

The proposal is about predicting whether the police will eventually arrest someone for a crime. An accurate prediction would allow people to gain insight into biases to improve dispatchment of police and procedures for handling crime. Here are some positive points:

Your topic is very interesting and definitely seems like it could be useful for society.
You clearly describe how you will go about predicting review scores.
Your proposal clearly talks about the problem and how this project provides a solution

Here are some things you can improve on:

You said you want to predict if someone will be arrested for a certain crime. Will you be considering what the outcome of the arrest was? Was the arrested person actually guilty? Arresting the wrong person could also have biases involved.
How will you decide which features are important?
What will you do about data points that have missing information, such as the location or outcome?

Peer Review

The project is about predicting the number of traffic accidents that will happen in each area based on the features of the area. They will be using data collected from the UK over 5 years to make these predictions.

I like the idea of the study, and I think the results would be really useful. It could, as you described, help towns deploy emergency resources more efficiently, which would be a great use of your data set. Great idea! It would also be awesome if you could predict the road conditions correlated with (and possibly causing) accidents. The features include possible causes of accidents -- like junctions, two-way vs one-way roads, weather conditions, etc, so it looks like such a prediction would be possible. Guesses as to the contributing causes of accidents would be great for fixing the problems. That way, if a town had a particularly bad intersection, they could fix it.

I would have liked to know what your features are without having to download a giant dataset and look myself. Also, there's no key, so I don't know what some of the features stand for or what the 0s or 1s mean for urban vs. rural. I wish you had described the features that you were going to use and described the data better.

It looks like you have a second data set about traffic flow in 10 regions. If you can line that up with the 18,000 areas in your other dataset, you can use the traffic flow features as another way to predict accidents. I like that you are able to incorporate multiple data sets in your project. That seems like a challenge! However, I'm concerned that that might be difficult, and I want to know how you plan to match up the locations. Once again, I wish that you had explained this in your proposal.

Peer Review by Jessie Wang

Peer Review by Jiahui Yi (jy764)

Vixprediction.docx

Final Report - Peer Review (cc2626)

This project is trying to predict the loan status for every loan application based on features of loan terms. The dataset they applied is from Lending Club, the peer-to-peer lending company.

Three things I like:

Use Generalized Low Rank Model to deal with missing data. Instead of deleting all the rows with missing entries, the team is very clever to apply unsupervised learning method to organize dataset. GLRM is an efficient tool to provide accurate predictions.
The "Improvements" part indicates that the team has a clear understanding of the project. It is good that they know what challenges they need to study in the future to better apply their project to real life.
Apply six different models. The two tables listing the results for training set give audience a clear comparison between different models. It is impressive that the team spends time working on 6 models to look for the best one.

Two things to improve:

I would suggest the team to use some graphs in the report. Graphs are always helpful for audience to have a straightforward understanding of the things going on. Also, graphs are powerful tools for data anaIlysis.
I would suggest the team to construct the overall report in a more logical way. It is confusing sometimes to understand the flows of the report. Though the team has done a hard work imputing data and building models, readers have difficulty understanding the real application of these results.

Overall, it is a very meaningful project! Good job!

peer review from xt222

This project wants to predict whether a customer will repurchase from an online store using the customer's activity log data from the interested online shopping platform. Such prediction model is really helpful if applied to the market since can help merchandise push targeted promotions to long-term buyers and therefore increase sale.

The things I love about this project:

the topic is really interesting and the end-product can be applied directly to marketing and have an impact on online promotion strategy.
The proposal is beautifully written. The background introduction is really helpful in terms of understanding the implications of the project
There are a lot of potentiality with the users' log data. It's really exciting to see the feature engineering there.

Things to improve:

I think you should try to find more data about the merchandise. I know it's a dataset from a competition and you don't have the data ready right now but I think it's worth the effort to scrape this data from taobao.com, you can easily get its overall rating/ customer reviews, etc.
I think your current topic is a bit too broad. It's a good concept but it will help readers to get a detailed idea if you can specify what exactly the X and Y variables are.
I guess the majority of your features are going to be from the user logs so it would be better if you can specify what your plans are in terms of extracting features from the logs.

Overall, I think it's an exciting project. I can't wait to see the results

Midterm Report Review

The project intends to predict the next destinations of travelers by studying historical Airbnb data. The data has many important features, such as the traveler's previous destinations, their genders and so on. This is an interesting topic to discover as people are traveling more often than ever and travel agency is eager to have the most efficient models to predict their customers' behaviors.

After evaluating this report, I have summarized the following three things I like and three other things that could be improved.

Pros:

I like the way they cleaned their data. They first examined the overall data, dropping all empty or NA data. In addition, they also represent Gender feature as one-hot encoding.
I like their idea to investigate the different customer behaviors between OSX and Windows users. Their belief that Apple users tend to be willing to spend more on traveling expenses is very investing and worth investigating.
I like their "Moving Forward' section which shows clearly their next steps and also conclude a bit about their project process.

Cons:

The 3 histogram graphs shown in the 1st and 2nd page are lack of clear labels. It is hard to track the right information. It will be helpful if they can label some legends.
Regarding with the 'Preliminary Analysis" section, I feel more things should be discussed there. They talked about using 11 regularizations, however, I am not sure what these information tell you about their models.
This project is about a competition from Kaggle. While using some of the ideas from the available resources at Kaggle is helpful, I hope they can try to balance it so that to provide some different, insightful ideas and models.

Overall, the midterm report is pretty well written.

Final Report - Peer Review (kc594)

This project is about predicting the box office performance of movies for their opening weekend. The group collects data from four different movie sites and aims to assist marketers and decisions about how many theaters to plan to release a movie at.

Some highlights of the project:

Data examples and features are pulled from many sources and aggregated, but the process is very well outlined and explained as it progresses. Reasoning is included as are definitions of what values made up the features.
The process of cleaning the data and extracting from the examples and features pulled from websites is very methodical, and each decision to remove information is well explained to the reader.
Visualizations of each model are very helpful to follow how the performance is doing and view the accuracy of each.

Some room for future improvements:

The number of sources the team pulls data from is impressive, but although I understand data scraping is time consuming, I would not be convinced that the number of samples used to train the model is sufficient to make accurate predictions moving forward. More than one year of movies would also likely be more accurate to capture trends better.
You mention using the column mean to fill missing values, but given the number of missing values is 84 and 41 out of only 165, I wonder if the mean is really the best choice for say all 84 missing values. Did you try any other methods such as matrix completion, or possibly removing this column altogether to see how it really affects the accuracy of the predictions? It is later said director_gross does not seem to be highly correlated with open_gross, and I wonder if this is due to the number of values imputed with the mean of the other 81/165 values in the column.
The page of visualizations is a little hard to follow. Maybe including the most indicative graphs of important features would have been easier to include. As a reader I am not sure what to focus on for this page.

Overall, I can tell a lot of work has been put into this project, and it was definitely one of the more interesting ones to read. Great work!

Peer review for LOL prediction (wl596)

This project aims to predict the result of the League of Legends game using the in-game events like the first blood, the first tower, etc. And the conclusions from this project can be used to improve the players’ strategies in competitions.
This is an interesting idea, and I believe people who play this game cannot wait to see the results. Besides, although the results of the game may be influenced by a lot of factors, they specified the factors their concerned about in the question part very clearly. And finding the possible optimal combination of several factors is also a good method for prediction.
One thing that concerns me is the ‘stationary’ of the winning strategy. For example, will the important factors that affected victory 2 years before still be the important factor 2 years later? And since it seems that there are many dummy variables in the dataset, I am not sure whether it will influence the models’ results. If so, are there any transactions need to be made to these dummy variables, or are there any other factors needed to be considered into the model?

Roll back changes

Hi,

I accidentally modified this file, and my commit was pushed. Can you roll back the changes? Sorry, I'm new to Github

Peer Review gmm93

peer review (jr2254)

pee review-stephanie zhou

What is nice about the project is that the data seem comprehensive, and you already start looking into the structure of the data. Also you have a clear evaluation plan.

First thing that concerns me is the scope of the data. You stated that the data consist of all transaction before Oct 2016, does it mean that it consist the data of all transaction since the founding of the corresponding States? Probably not, so I wonder exactly how many year worth of data/how many entries are there , is there a lot of missing entires, and if you actually want to use all of the data.

The second thing that concerns me that there is no data about the location of the real estate, such as whether it's close to a subway station and when a subway station is built. Is there any mall/entertainment facility near the real estate. I think those information can be crucial in determining the price of the estate, so maybe you would like to find other complementary datasets.

Another thing I like about the project is that it has practical value. But at the same time, it's concerning whether your model will generalize well on real estate sale in other states and countries. That would also be an interesting area to explore.

Midterm Peer Review

This report is well-written. It has a very clear structure. I am especially amazed by the data cleaning part. The team took a lot of effort in picking the reasonable features to use. The pairwise comparison and correlation plots are outstanding. I really liked the idea (how they chose the variables). They also included detailed explanations of what each variable means and that is very helpful. Overall, the outcome of their project is also going to be useful in the real world setting. The land price of New York is something that is definitely important in the market.

I believe at this point the team can focus on improve the predication accuracy. Also, the model is a little too complex at this point. Some variables might have some effects on the prediction which the team does not necessarily know. So, I would suggest the team can focus an even smaller number of variables first, observe how that goes, and then later expand the group of independent variables that are used.

In general, I really liked the project and I think the group is definitely on the right track.