eric-w-h / orie4741_project Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 31.12 MB

Eric, Enya, Varun. Beating the bookies!

TeX 0.06% Jupyter Notebook 99.82% Python 0.12%

orie4741_project's Introduction

ORIE4741_Project

Eric (ewh73), Enya (eez25), and Varun (vm324)

orie4741_project's People

Contributors

Watchers

orie4741_project's Issues

Final report peer review

This team aimed to use machine learning to choose a best betting strategy. They try SVM's with various kernels, and various tree ensembles, including XGBoost and CatBoost. All of their models outperform the simplest betting strategy by over half. However, they were not able to use their model to make turn a profit.

Three things I liked

Very helpful explanation of how betting works. I would have gotten lost if I hadn't read the background explanation.
I love how you discuss the variance of your predictions. This is something that is very important but that very few other teams touched on. Way to go.
Great performance on improving your winnings by twice as much as just always choosing the best team! I think you also chose a great baseline, others would have just compared their results with random guessing but of course the goal is not to beat guessing but to improve the easiest strategy.
You all tried a wide variety of models!
I like where you do grid search and then check random perturbation around the best performing parameter values. That's a very smart of way of getting the best of both grid search and random search! I generally really like the graphs in your section about model parameters search.
Cool use of XGBoost and CatBoost, using interesting models from outside of class.
Creative to try the different betting strategies (proportional, quadratic, risky)

Couple things to improve next time

I had some trouble understanding a lot of the graphs. Some axis labels and titles had terms that weren't explained in the paper.
The graph of the network of teams playing against each other doesn't really add anything. It is also weirdly formatted compared to the graph next to it.
Some of the formatting is a bit messy.

Really great job! Good luck with finals and have a good break!

Midterm Peer Review

The goal of this group's project is to discover factors that contribute to a country's GDP and how to leverage them for growth. They plan to go about this by using a dataset from the World Bank, which includes 1443 variables over a 60 year span. The groups intends to use these variables to find factors that hold the most potential for a country's growth.

Some things I liked:

I liked that you made sure to look thoroughly at a few of the variables involved before doing any regression/ML models. I agree it's definitely a good idea to get to know the data you're working with before trying to learn from it.
I like that you tried two different ways of selecting features and tried to explain intuitively why the handpicked set might actually be worse at predicting than the features with the most data available. I think it's good to think critically about your models without just simply comparing their errors!
I really liked the discussion of how you plan to avoid underfitting and how you mentioned using many different categories of uncorrelated data, it is definitely important that your features can provide new information to you.

Some things to add:

I think some of the graphs could have been labeled/explained a bit more clearly. For example, on the first graph, I was a bit confused about what the values on the legend meant.
For features that had data missing, instead of throwing those out all together, it may be interesting to see if there is any reason/pattern for this data to be missing that might actually be helpful in predicting GDP.
To add onto my last point in the previous section, if you wanted to be sure that you were using uncorrelated/the "best" features for their category, you could always try using ControlBurn to extract those.

ORIE 4741 Peer Review

Summary

The purpose of this project is to analyze data from the World Bank World Development Indicators to help countries optimize growth. They will use features including debt, imports by percent of GDP, health expenditure per capita, and the GINI Index. Using this data, they hope to predict rankings of most popular imports and exports and GDP.

Things I like

There is a lot to work with in this dataset. There are numerical, nominal, and ordinal features, which allow for lots of techniques to experiment with for feature engineering.
If successful, the project is very useful. If you can classify the least and most important factors for country GDP, you can help countries optimize growth.
Finding economic trends can be very difficult, so I like that they're challenging themselves.

Areas of Improvement

It seems that there are lots of things to predict currently, from rankings of most popular imports to GDP to other indicators of wealth. I think narrowing down the specific label you are trying to predict will be helpful.
There may be too many features currently. Even though there is a lot of data, you may not want to use all of it, since overfitting is a concern.
The "Methods" section could use more detail. I was hoping to see which methods they were planning to use and what challenges they anticipated with those methods, but the section was too general.

Nice job overall, good luck!

Final report review

Using data from NFL teams and betting sites, this project seeks to predict the winnning team of an NFL match in order to develop profitable betting strategies.

Things i liked:

Grid search and model selection plots
Seems like you tried a number of different models
I liked the explorative data analysis and the visualisations in that section

Things for improvement

While i can appreciate a captivating and fun introduction, i think the problem statement was too unclear. A more professional/scientific statement could have been made (in addition to the current intro). What are you predicting? What data are you using? Could have been explained in one sentence.
Since you were using a data set that was perhaps a bit harder to understand (to me atleast) and included more features that were described throughout the report, it would have been nice with more thorough description of the features. It was not completely clear to me what the data (X and y) looked like.
Did not understand the plot in the conclusion
Did you consider any other performance measures aside from accuracy?

ORIE Peer-review

Hi group,

This project is set to find hidden indicator of GDP growth and international commerce trend. It is using economic indicator data from the World Bank, and the data comes as continuous variables. Their current plan is to use mainly regression to identify the most influential factor contributing to country’s growth potential.

3 things I liked:

Their goal of predicting economic growth using a large variety of indicators is very interesting, as some indicator like public health investment may have positive contribution to the economy in the long run but are typically hard to observe in the short run.
Finding economic trend is hard without large amount of data. The World Bank dataset has 266 nations and 60 years of data, making it more probable to extract useful trends. Also, we can trust the World Bank for unbiased data.
This project is very meaningful, if we can better understand the hidden relationship between various social-economic indicators and economic growth, it can help direct future policies and national investments in certain fields.

Area of Improvement:

Not sure how far in the future are they trying to predict for GDP. Current GDP alone (+some trend variable) could be a good enough estimate for next year GDP, but the further away we investigate the future, the harder and more uncertain our prediction may become. Could try framing the objective of the project more concretely.
The current task sounds general, it may be hard to come up with high accuracy model for a nation’s economy (especially for developing nations) over a longer time span, as political and economic landscape can change drastically (while not reflected in the 1400+ numeric features in this dataset). Maybe focusing on predicting something more specific (trade volume, % of education) and use GDP as one of the features.
It is highly unlikely that all these 1400+ features contribute linearly to the dependent variable, so some amount of non-linearity is desired. Running a heavy polynomial regression with many interactions could create too much features that overfits the training sample. Try running lasso regression/ feature selections could help with the issue. Also, could try out decision trees.

Good luck!

Final Review

The group tries to predict football game results and win money in sports betting. They have successfully preprocessed a long time series of teams and matches and used such data to build several models including SVM and tree-based models. Although the final model still could not make a profit, these models performed significantly better than the baseline model of betting the favored team.

3 things that interested me:

The preprocessing was done beautifully, with a rather complicated process of transforming team-based records to match-based records, dealing with new/vanishing teams, balancing older records from later ones. It is great to see how the team managed to finish all these and explained their reasons for making any major design decision.
The team identified a unique characteristic of this problem: predicting future data. Instead of a normal train-test-validate split, the team also sets aside data in the final years as a "future test set", and assesses the final model's performance on it. This aligns with what the model would have been used in reality and avoids any "future bias."
The team cleverly used the confidence generated by classification models as the amount to bet for each game. This utilization of confidence fits this specific task nicely, and the idea is well explained.

Things to work on:

The final report is not named as required, took me a while to figure out which PDF is the actual report. The final report also has some noticeable grammatical errors.
Some plots in the report is not very intuitive: for example, the linearSVC coefficients look very messy, the initial network of teams is a good idea but the plot was too dense and hard to interpret.

Peer Review

This project is about analyzing country GDP and identifying features indicative of stimulating/restricting growth. The objective of the project is to find opportunities for growth and create informed advisory/decision for better budgeting. The data that will be used comes from the World Bank World Development.

I like the project first at a broad level because it feels very relevant/applicable to an actual real life project and something that economic consultants may be tasked with. The project idea also feels very grounded in purpose and the "why" part seems very fulfilled/answered.

Three areas for improvement:

Area of focus. What country GDP will you be considering? How many countries are present within the dataset? I think you may need to clarify your objective more precisely to a specific country perhaps. Will your goal be to find opportunities for growth for all countries that are present in the dataset, or just a select few? I think this distinction will drastically change your model approaches and would be important to determine.
What type of learning problem is this? Is it a forecasting one or an analysis? From reading the proposal it is actually a bit unclear what exact problem you intend to tackle. The initial project objective seems to indicate that you are looking to identify high correlated GDP features, but the dataset section with how you intend to split the data seems like a forecasting problem.
What do you hope to be proving on the test set? This goes back to being a bit unclear on what type of learning problem this is. If you are hoping to find opportunities for GDP growth, what does the test set offer in terms of proving/disproving your hypothesized important features?

Peer Review by kxc4

This project is about economic growth in different countries. The goal is to gain information on how countries can manage resources to promote economic growth. The group will use data from the World Bank World Development Indicators including each country’s debt, health expenditure per capita, GINI index and more over a 60 year span from 1960-2020.

I like the idea of using this data to learn how different countries might be able to better manage their resources to improve their economic growth. This seems like a very applicable idea, but it may be difficult to implement in practice. The idea for handling missing data on a case because basis for features is good since it is likely that some features with many missing entries are unnecessary and can just be dropped. I also like your idea of looking at each individual country and attempting to predict the most important factors in its development. With the missing data, I think it is important to consider how you will impute missing data for features that seem veery relevant, as well as for possibly missing an entry for a given year for a country. There may not be very many places where this occurs, but if it is not handled carefully it could drastically impact your models. I also think may be a good idea to split your data into training and validation sets based on year, with a range of the most recent years (like 2005-2020) as your validation data. This will allow you to better evaluate how your model works for ranges of time it has not been trained on, giving a better idea as to how it will generalize to predicting future economic growth. Lastly, I think may be important to consider how different countries import/export needs vary especially when attempting to predict the rankings of the most popular imports and exports. Predicting the specific goods or types of goods which countries import and export the most may be less important than the overall quantities of all their imports and exports so maybe consider combining features on specific imports/exports into features for overall imports/exports. Overall, I think this will be an interesting project and look forward to seeing the final product. Good Luck!

Project Midterm Review

The project tis about analysis on the GDP growth and finding factors that may contribute to the changes of the GDP. The problem chooses a vast dataset over the last 60 years with appropriate features that may achieve the group's goals.

The things I liked:

Like I mentioned, I think the dataset is massive and contains a lot of information that the group could do a lot of work on. I think this could help out with prevention from neglecting any factors that may potentially yield unwanted results/data or overfitting.
The visualizations are clear and easy to take in as someone who views them for the first time. Also, the the explanation is adequate so that we can understand the direction of the group's attempts.
Detailed explanation on prevention of overfitting/underfitting is an essential part of the data analysis and the group does this well.

Thins for improvements:

I think there could be more visualizations, especially ones on what we have learned on the future. I think some of the points mentioned in the project could feel quite intuitiive.
For the feature selection , I think the heat map that shows the relationship and the correlation could have been a better version to use for feature selection.
Hopefully, a more clear explanation on the direction the group is headed? Kinda seems a bit too simple and similar to the project itself.

Overall, I liked it a lot and see a lot of potential! Good luck

ORIE 4741 Peer Review

Dear Valued Employees,

I have read your proposal and find your request very intriguing. From my understanding you want to analyze particular countries' growth opportunities in order to discover ways that these institutions can optimize their yearly budgeting. By looking at GDP statistics from a range of countries that go back several decades you hope to gain insight into what covariates contribute the most to a country's growth and then once these features are known you can suggest changes to policy makers so that the budget can be more heavily weighted in those particular areas. The data you will be using to make this analysis is from The World Bank, an international financial institution that provides loans and grants to a range of different countries.

There are 3 aspects of your proposal that really inspired me. The first is the source of your data. The World Bank is the foremost authority when gathering information on the GDP of the nations of the world and if you do manage to extract information from the dataset, you know it is not because of initial data manipulation. In addition, I have always been a firm believer in the notion that if you're going to do something, do something that matters. The topic you choose to talk about is a very pressing one that if you could make a breakthrough in, would positively affect the lives of millions of individuals. The last aspect of this proposal that caught my eye was your method plan. It appears to me that you have a general idea of the path you want to take.

However, there are 3 facets of this project that give me hesitation. The first is the general outline of the dataset. This dataset is extremely big. Narrowing down covariates that will help you answer the proposed research question will not be an easy feat. You need to find a balance between covariates that are specific to certain areas of a budget, but broad enough so that they apply to many different nations. My second concern is the size of the problem you are trying to answer. Each country has very unique geographical, populational, political ect. concerns. Many countries are extremely different! Trying to find optimizations of a country's GDP based on other countries may not work as well as one might expect. It will be very hard to train a classifier/ML algorithm that can generally suggest budgetary moves for a range of different countries. Finally, my last concern is the format of your proposal. I did not get the sense that you were asking permission from me more than telling me which greatly offends me as your boss.

Overall there are a few aspects of this project that need more flushing out but I could see myself in the future approving this proposal.

Sincerely,
Your Boss

Final Paper Peer Review

This group uses data from the Pro Football Archives and Odds Portal to predict score betting. They use SVCs with different kernels, and three different Cs, and gradient-boosted decision tree ensemble. Although their model is not yet a weapon of math destruction, as it doesn’t yet turn a profit, the analysis and model fitting is thorough and solid.
I like how the reports illustrate what the odds and betting ideas are, since it is not a very familiar topic to me. And the data processing graphs are easy to read. Also, they explain why the specific models are chosen.
However, there are still some spaces for improvement: it will be better if you could make the model results graph easier to read and emphasis ethe information’s we want to draw from the graphs. Also, analyzing the bias variance tradeoff for each model from model building perspective a bit more. Each section could be numbered so that the paper is easier to read.

Midterm Peer Review

This project aims to develop a model that determines the influence of various factors on a country's GDP.

What I liked:

I liked how you analyzed how many feature vectors had missing values for each feature.
Subsetting the features to analyze how they might affect the relationship between the actual and predicted GDP values will be helpful in developing your final model.
I like that you are considering how much of the data is missing in determining which features to keep using. I also think your conclusion to use a smaller subset of features is an important step towards reducing overfitting.

Areas of Improvement:

I think it might be a good idea to try some feature engineering–perhaps the features you are deeming to be less important can be preprocessed in a way that makes them more influential on the model's predictions.
In addition to creating the graphs for your preliminary analyses, I suggest using a metric to evaluate exactly how much better one feature was at predicting GDP compared to the other feature.
Lastly, I would recommend trying some more complex models that will allow you to incorporate more features. This will allow you to make more precise predictions, but be careful of overfitting the data here.

Peer Review from gz252

The project is about using a dataset to analyze the patterns of international trade and global economic development. A dataset from the World Bank is selected, including features such as Economic Policies, Financial Sector Performance, Gender, Health, Infrastructure, etc. The project aims to determine opportunities for growth for a country so a country can make better budgeting decisions.

Things I Like:

The dataset is from an official source, with the World Bank being one of the most trusted and global sources of world economic data
The project, if completed with high accuracy, would be extremely helpful for organizations and governments around the world to make predictions regarding their economic policies, thereby increasing the standard of living
The size of the dataset is very large, meaning that the produced model would be very generalized. This level of big messy data is very similar to a real-world data analysis project

Areas for Improvement:

The project aim was not stated very clearly in the proposal. It was stated in the opening paragraph that the project aims to help predict the development opportunities of countries but also said at the conclusion that it was to analyze world economic development patterns
Will it be better to train models for individual countries, since each country has a unique geographical and demographical environment, and economic development is highly associated with these factors?
It might be better for the machine to take an additional feature into account: human predictors' predictions about the economic development

Final Report Review

This project is going to predict the winner in NFL match in order to develop profitable betting strategies. They tried various machine learning methods, finally develop the model although it doesn't make profit.
Things I liked:

They discussed the variance of predictions clearly, it is an important part of model verification.
They tried different models, including SVM, XGBoost, CatBoost.
The visualization part is pretty good, it helps us to understand the performance of models.

Things need to improve:

The explanation of the problem needs to be more clear since the betting system is not so easy.
Some formatting problem needs to be fixed, such as the position of the plot.
You may consider any other performance measures aside from accuracy.

Midterm Peer Review

Projecting GDP
==============

Summary
-------

This project is about predicting increases and 
decreases in the GDP of countries around the world. 
The team seeks to analyze factors from trading 
patterns to education in order to develop a metric 
by which a country's future GDP can be measured. 
The end goal of their analysis is to potentially 
help a country understand its assets and methods 
to improve its growth.

Things I liked
--------------

1. I liked that the introduction was concise, and 
described not only the analysis that was to be done 
but also its purpose. The importance of the questions 
being answered by the analysis was very evident in 
the language used, although it avoided being overly 
flowery so as to prevent the reader's eyes from 
glazing over.

2. I also liked the use of least squares along with 
the visualization that accompanied it. It clearly 
demonstrated to the reader the degree of inefficacy 
the features had in demonstrating trends in the data.

3. The separation of categories from each other was 
good to see, and made it easy to parse through exactly 
where the requirements for the midterm report were met.


Areas for improvement
---------------------

1. The report mentioned the fact that some of the data 
were missing, but didn't described how the missing data 
were being dealt with outside of choosing variables with 
low percentages of missing data for the analysis.

2. It seemed like there was room for more elaboration. 
Although the report did seem thorough, a change in the 
overall scaling of the project could have made it seem 
more complete in order to fit the page limit.

3. There were no histograms of the dataset made on the 
report, despite the fact that they were suggested in the 
task description: "Include a few histograms or other 
descriptive statistics about the data."

Overall great job!

Project midterm report review

The goal of the project is to analyse which factors that contribute to the GDP of a country. This is useful in order to identify patterns in economic growth and opportunitites for growth. The data is from the world bank and contains features such as trading patterns, infrastructure, education etc.

Good:

I like that you consider using subsets of your features to avoid overfitting
interesting data set
Nice plots that are not just basic

Questions/improvemennts:

Maybe its just me, but i couldnt see the relationsship you were describinng in the first graph. Arent there also high population countries with a high rural percent and low GDP? in other words, i dont see the trend of lower rural population leading to lower GDP for high-population countries.
How did you choose the "relevance" for your feature subset 1 in the first preliminary analysis? Did you connsider using a regularization to choose features (eg. Lasso)?
Some of your argumentation could be a bit more detailed and based on data. You could show some correlation plots and describe the features in a bit more detail.

Overall, good job

eric-w-h / orie4741_project Goto Github PK

orie4741_project's Introduction

ORIE4741_Project

orie4741_project's People

Contributors

Watchers

orie4741_project's Issues

Summary

Things I like

Areas of Improvement

Recommend Projects

Recommend Topics

Recommend Org