orie4741 / projectsfall2017 Goto Github PK
View Code? Open in Web Editor NEWRepository for Fall 2017 ORIE 4741 projects.
Repository for Fall 2017 ORIE 4741 projects.
The goal of this project is to take the CBOE Volatility Index and other related information surrounding the VIX to first be able to predict the movements and fluctuations of the VIX, and second to use these to create a trading strategy on VIX ETF.
Some highlights of the project:
Some room for future improvements:
Overall, this is a very impressive report and very well done.
This project wants to predict whether a person makes over 50k a year using a collection of demographic variables such as race, age, working hours per week, etc.
The things I like about the proposal:
Areas of improvement:
Generally speaking, I have some serious doubts about this project. It felt more like a individual homework assignment than a team project. I think the team should spend more time thinking about the implications of the project and what they can learn from it.
This project is impressive. Medical cost has been discussed more and more frequently, and how to reduce that cost becomes an important problem. In details, this project has several merits. Firstly, the dataset is messy, containing various types of data and some outliers. This team utilized some typical feature engineering methods such as one hot encoding, properly involved the categorical variables into the model. Secondly, the innovative method of measuring error really impresses me. However, although this method can properly measure the accuracy of the model, I think the comparison between this new method and traditional methods is necessary. Thirdly, the use of quantile regression pushed the project one step further. This part of research tells us that the influence of features on different quantile of cost is different, giving us new insight on how to solve the initial problem: finding proper ways to cut down the medical cost.
This project also has some drawbacks. Firstly, we study many types of loss functions and regularizers in the class. I think more loss functions, regularizers and their combination worth to be tried. Secondly, the correlation between some variables may be high. In this case, we can drop some of the features. Thirdly, this team can analyze the reason why random forest gives us the best result while the result of L2 regression is not satisfying. This kind of analysis can help us better understand the pros and cons of different models.
This study plans to use datasets containing information about specific movies, their first week reviews, and their box office performance. The researchers are hoping to train a model which is able to predict box office performance based on those input features. Three separate datasets will be used.
The proposal explores and identifies an interesting hypothesis—that the first reviews of a movie have a significant impact on the box office performance of movies. The explanation of the situation and the hypothesis is clear, and the researchers have identified specific characteristics about the movies and reviews that will be taken into account. The questions proposed are quite interesting and show the model can be used to answer not only one small question, but how it can provide further insight into behavior over time.
There are a couple things that should be improved in the proposal. There are typos present in the proposal that should be fixed. In addition, the list organization can be used, but it doesn’t seem to be very well integrated/structured into the remained of the document. The researchers use the term “box office performance”, but never initially explain what this means, so readers don’t have a concrete sense of what the model will be predicting. In the questions section, question 1 and 2 are effectively the same thing—it would be best to remove one of these duplicate questions.
The goal of this project is to build a model that predicts whether or not a consumer will be a repeated buyer of the online merchant. This will help merchants target their advertising towards customers who are more likely to return.
Strengths :
You use a wide variety of models -- I can see you really experimented and did not just assume one might work better than the other. I think the bredth of models you tried helped you gain a sense of which models you should pursue further.
Your report is broken up into clear, manageable subsections. The organization of information makes it really easy to grasp the main ideas you are trying to convey. This is especially helpful in such a technical report :)
Areas of improvement :
My biggest critique is that your report appears unfinished. You briefly touch on several models you used, several of which had promising error results, however, at no point do you select a "best model" and really evaluate its implications on shopping trends There is no "improvements" or "further steps" section.
I think some of the explanation of how the models work is unnecessary. Instead, I would have focused on dedicating more space to explaining your methods. For example, the picture from the textbook of random forests are unnecessary within the text itself and should have perhaps been listed as a reference instead. The same can be said for many of the bold equations. To touch on what I would have liked to learn more about -- what were the weights associated with the perceptron algorithm? What does this reveal about which features affect a consumers likelihood to return?
I think the application of this model is a little backwards. You state that the benefit of a model like this is so that buyers who are more likely to return can be sent promotions, but it doesn't really make sense for merchants to be spending more resources on these customers, as they are already more likely to return. Wouldn't a better use of this information be to help merchants figure out which customers aren't coming back so they can develop strategies to change that? Though this does not have as much to do with the technical aspect of the project, I think having a solid grasp of where the value lies in a model like this is fundamental to conducting a study like this.
The purpose of the project is to predict how likely a crime in Chicago will lead to an arrest. They then use the prediction to provide suggestions for better police recourses allocation.
The dataset used is called "CLEAR" database that is maintained by the Chicago Police Department. It contains over 6 million crime incidents and 22 features. The features contain the basic information about the crimes. The dataset is complete and well recorded.
One thing I really like about this project is it's data visualization. It not only gives us an intuitive understanding of the dataset, the pictures were also used for selecting and dropping features. This project also uses one hot encoding to transform nominal features into binary. We were only taught to use it on the output variable. So I think it is a smart idea. I am also amazed by how accurate the models turn out to be. It seems like it is a very good choice of topic.
One concern I have with this project is that the dataset might have an inherent bias. That is, there might be some crime that took place, but were never detected by the police. This might cause a problem if the project's objective is to predict whether a crime is likely to lead to an arrest. Another thing I am concerning is that although the overall accuracy is high, the accuracy for correctly predicting arrests is only around 50%. This can be problematic if we care about predicting arrests more than predicting no arrests. A final issue I want to mention is that there are some other classification models this project could try. For example, perceptron algorithm is a relevant and easy model to try out.
Overall, it is a well written project and the topic is very interesting. Good to see this kind of effort from my peers.
The aim of the project is to predict the Airbnb price and occupancy rate. They are training the model with data on property location, characteristics, and reviews of 3818 Airbnb listings in Seattle. The prediction results will instruct new owners to set prices and optimize revenue for both the property owner and Airbnb.
Three good things:
Three things to improve:
I am looking forward to seeing more on your future work!
SUMMARY:
This project examines datasets about Airbnb habits. The goal is to estimate the next Airbnb booked.
GOOD:
The project's application is something I'm directly interested in.
Strong justification for the project.
The context of the project is set very effectively
IMPROVE:
I don't know how remerging the data will improve model validity.
I foresee issues where the data might slant with Airbnb's initial crop of users before it became more accepted to rent out your home on this app.
Make sure the model doesn't accidentally statistically discriminate.
Th main goal of the project is to analyze the historical price of bitcoin, its mining difficulty and transaction volume among other metrics to predict the future value of the cryptocurrency. This project is clearly interesting because of the changes in the financials markets that cryptocurrencies could have.
The three things I like more about the project are:
Possible areas of improvement:
Sorry - wrong repository
This group's idea is to predict the cost of medical bills based on the patient's illness, length of stay, and other factors. They are going to use the SPARCS dataset along with algorithms that we have used in this class to come up with a solid approach to the problem. I think this would be really helpful data to use since the cost of medical bills is increasing by a lot in today's world. I think another thing they could potentially add to their solution set is to figure out which hospitals have the cheapest cost as well as the lowest mortality rate.
What I like:
Areas of Improvement:
This project aims to predict which features of a film are most important to get a high IMDB rating and uncover trends and patterns in movie data. The group states that the results may be of interest to moviemakers in a saturated industry.
The good:
It appears the team did a great job in data cleaning, ignoring some irrelevant features.
The preliminary analysis with linear model seems reasonable but shows severe under fitting; however, the group is aware of it and gives some thoughts around it, such as selecting certain features rather than using all.
The visualization is very insightful. It uncovers the correlation between features and average score.
The bad:
The team may consider engineering the features to create a model which better fits the data.
Just in terms of "selling" the project, I wonder if there's a better audience than moviemakers/studios. It seems that most of the features are independent of things that moviemakers necessarily have control over.
There is no indication that the team is separating their training data from testing data.
Overall, it's solid project, and I look forward to seeing the result.
This project is using horse races in Hong Kong from 2014 to 2016 to be able to predict horse's ranking at the end of the race based on the specific features. They want to know the ranking of each horse not just necessarily which horse will be in first place.
I really like the idea of being able to predict horse race outcomes. I think this is a fun dataset to play with. I would like to see the dataset itself so try to provide the link for tht somehow. I also liked the way you went about explaining your model selection. It seems this team is aware of how to move on the the next steps. It was very easy to understand where you were taking this project.
What I think could be improved is to be more specific on how you cleaned the data. What type of transformations did you do. For example, did you use one-hot encoding for your nominal features. You mentioned how different types of models would need different types of data selection/cleaning. I think is good since you are trying to get the best outcome. However, you seemed to not go in detail of what you did so i'm still left wondering how your dataset looks like. You should also do some tests to see if features you take out are actually significant or not. I agree with the other peer reviews in that taking features out because its intuitive is not the best approach. You should provide evidence on why certain things wont make an impact. I also think you could make your visuals i bit more visualizing appealing. The histogram categories are very small and I can barely read it. Also there are titles missing for the graphs.
The team wants to predict LOL game wining result (binary classification) using game features from the game (e.g. first kill, etc).
Things I like:
Some suggestions for improvements:
This study, will get deep into attrition main causes by studying a simulated data set created by IBM. The main goal is to find the most important features that led to employees leaving their respective companies. This information can be used after to create talent retention programs that could save money (in hiring costs) to companies and retain their key employees.
The aspects that I like more about the project are:
Some areas for improvement could be:
The project studies the effect of each in-game event on the result of the LOL game as well as the relative importance of the factors in determining the victory. The study is aimed at helping LOL players designing winning strategy by picking in-game events that would most likely result in victory. The data is obtained from Kaggle.
What I like about the project is it is such an interesting idea that the result could make a hit among young people. Also, the data file size is manageable with enough columns detailing the in-game occurrences.
I like the idea of studying the optimal combination of in-game events to better inform players of the strategy. It would be better if it could suggest the order of the events.
As much as the project will attract a large group of young people, but the result of the project very likely only pertains to the players of the game, which is a tiny part of entire population.
One concern is whether the records of each in-game event during each play is easy to be included in the model. Are there any transformation on the data needed to be done? Are there any dummy variables needed to be added?
The purpose of this project is to create a model that can help restaurants to understand the current and future state so that they may make data-driven decisions to move towards success, where, in this case, success is measured by restaurant rating. The data will be created through a program that studies the source code of Trip Advisor for New York and will include geographic data as well as historical rating data.
I really like the fact that the data itself will be parsed from user input data expanding their project scope to gathering data rather than starting at pre-processing an old dataset. I also like the direction that the question about predicting the future rating of a restaurant is going as data will be parsed from comments in a focus group. I'm also interested in how this group plans on helping restaurants to improve their rating.
I think this group does a great job of outlining an early plan and it's hard to avoid being vague, but the first question about understanding industry is a bit unclear when it comes to what data will be used and what exactly "industry" means (type of food? restaurants in general? location?)
Another possible improvement could be testing the model on another city's input data and comments to see how different people feel when it comes to food, geographically.
A final improvement suggestion would be to similarly test the model against reviews in say, Yelp, to see if there is any success there.
Good luck!
The project I am reviewing aims to predict the box office of the movie based on the basic profile of the movie and the first week’s online reviews. They also want to check if first week’s review has similar effect ten years ago as they have now. They are planning to use data set from Kaggle, Rotten tomatoes, the numbers.com, and some other open movie databases
Things which I liked about the proposal
The proposal is about predicting whether the police will eventually arrest someone for a crime. An accurate prediction would allow people to gain insight into biases to improve dispatchment of police and procedures for handling crime. Here are some positive points:
Here are some things you can improve on:
The project is about predicting the number of traffic accidents that will happen in each area based on the features of the area. They will be using data collected from the UK over 5 years to make these predictions.
I like the idea of the study, and I think the results would be really useful. It could, as you described, help towns deploy emergency resources more efficiently, which would be a great use of your data set. Great idea! It would also be awesome if you could predict the road conditions correlated with (and possibly causing) accidents. The features include possible causes of accidents -- like junctions, two-way vs one-way roads, weather conditions, etc, so it looks like such a prediction would be possible. Guesses as to the contributing causes of accidents would be great for fixing the problems. That way, if a town had a particularly bad intersection, they could fix it.
I would have liked to know what your features are without having to download a giant dataset and look myself. Also, there's no key, so I don't know what some of the features stand for or what the 0s or 1s mean for urban vs. rural. I wish you had described the features that you were going to use and described the data better.
It looks like you have a second data set about traffic flow in 10 regions. If you can line that up with the 18,000 areas in your other dataset, you can use the traffic flow features as another way to predict accidents. I like that you are able to incorporate multiple data sets in your project. That seems like a challenge! However, I'm concerned that that might be difficult, and I want to know how you plan to match up the locations. Once again, I wish that you had explained this in your proposal.
This project is trying to predict the loan status for every loan application based on features of loan terms. The dataset they applied is from Lending Club, the peer-to-peer lending company.
Three things I like:
Two things to improve:
Overall, it is a very meaningful project! Good job!
This project wants to predict whether a customer will repurchase from an online store using the customer's activity log data from the interested online shopping platform. Such prediction model is really helpful if applied to the market since can help merchandise push targeted promotions to long-term buyers and therefore increase sale.
The things I love about this project:
Things to improve:
Overall, I think it's an exciting project. I can't wait to see the results
The project intends to predict the next destinations of travelers by studying historical Airbnb data. The data has many important features, such as the traveler's previous destinations, their genders and so on. This is an interesting topic to discover as people are traveling more often than ever and travel agency is eager to have the most efficient models to predict their customers' behaviors.
After evaluating this report, I have summarized the following three things I like and three other things that could be improved.
Pros:
Cons:
Overall, the midterm report is pretty well written.
This project is about predicting the box office performance of movies for their opening weekend. The group collects data from four different movie sites and aims to assist marketers and decisions about how many theaters to plan to release a movie at.
Some highlights of the project:
Some room for future improvements:
Overall, I can tell a lot of work has been put into this project, and it was definitely one of the more interesting ones to read. Great work!
This project aims to predict the result of the League of Legends game using the in-game events like the first blood, the first tower, etc. And the conclusions from this project can be used to improve the players’ strategies in competitions.
This is an interesting idea, and I believe people who play this game cannot wait to see the results. Besides, although the results of the game may be influenced by a lot of factors, they specified the factors their concerned about in the question part very clearly. And finding the possible optimal combination of several factors is also a good method for prediction.
One thing that concerns me is the ‘stationary’ of the winning strategy. For example, will the important factors that affected victory 2 years before still be the important factor 2 years later? And since it seems that there are many dummy variables in the dataset, I am not sure whether it will influence the models’ results. If so, are there any transactions need to be made to these dummy variables, or are there any other factors needed to be considered into the model?
Hi,
I accidentally modified this file, and my commit was pushed. Can you roll back the changes? Sorry, I'm new to Github
What is nice about the project is that the data seem comprehensive, and you already start looking into the structure of the data. Also you have a clear evaluation plan.
First thing that concerns me is the scope of the data. You stated that the data consist of all transaction before Oct 2016, does it mean that it consist the data of all transaction since the founding of the corresponding States? Probably not, so I wonder exactly how many year worth of data/how many entries are there , is there a lot of missing entires, and if you actually want to use all of the data.
The second thing that concerns me that there is no data about the location of the real estate, such as whether it's close to a subway station and when a subway station is built. Is there any mall/entertainment facility near the real estate. I think those information can be crucial in determining the price of the estate, so maybe you would like to find other complementary datasets.
Another thing I like about the project is that it has practical value. But at the same time, it's concerning whether your model will generalize well on real estate sale in other states and countries. That would also be an interesting area to explore.
This report is well-written. It has a very clear structure. I am especially amazed by the data cleaning part. The team took a lot of effort in picking the reasonable features to use. The pairwise comparison and correlation plots are outstanding. I really liked the idea (how they chose the variables). They also included detailed explanations of what each variable means and that is very helpful. Overall, the outcome of their project is also going to be useful in the real world setting. The land price of New York is something that is definitely important in the market.
I believe at this point the team can focus on improve the predication accuracy. Also, the model is a little too complex at this point. Some variables might have some effects on the prediction which the team does not necessarily know. So, I would suggest the team can focus an even smaller number of variables first, observe how that goes, and then later expand the group of independent variables that are used.
In general, I really liked the project and I think the group is definitely on the right track.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.