meiqi-wu / vaccine-discovery Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 13.49 MB

Jupyter Notebook 99.70% Python 0.30%

vaccine-discovery's People

Contributors

Watchers

Forkers

vaishgajaraj

vaccine-discovery's Issues

Project Proposal Review

This project is about Covid-19 and the proteins within it that cause an immune response in the body. The goal of this is project is to predict the protein sequence of B Cell epitopes which could be very helpful in the production of a vaccine. In order to do this this group is using a dataset containing all combinations of over 14,000 epitopes and 757 proteins along with several features for each to aid in predicting a person’s epitope performance.

I think this is a very well thought out proposal. I think this is a phenomenal idea as I have not had much knowledge about the production of the Covid vaccine so it was very fascinating to read. I also appreciated the list of features that are in the dataset, which provided me with some sort of idea about what could possibly be used to form an accurate model. They also provided a solid overview of the effects that Covid has had on the world in the past year which shows how important it actually is for there to be an effective vaccine.

I do think there is some room for improvement. For example, this proposal is touching upon some complex ideas in terms of the vocabulary and overall chemistry that it is investigating, so it could be written in a way that may be easier for the audience to understand. I also think it may be helpful to have more detailed descriptions of the features, and a more elaborate explanation of how medical researchers could use the data to create a vaccine which targets the epitope regions.

Peer Review 1

The proposal is well written and the goal of the project is to identify the B Cell epitopes to further predict their protein sequence for the design and development of vaccines. The dataset they found at https://www.kaggle.com/futurecorporation/epitope-prediction?select=input_covid.csv is well-explained and appropriate to solve the problem. Their objectives are to uncover the features of COVID-19’s antigen protein and discover the significant regions that a COVID-19 vaccine can target.

I find the topic very interesting as 2020 is such an unexpected year because of COVID-19. It is great that the proposal provides a detailed explanation of the data columns in the dataset and describes how it helps answer their questions. Moreover, they have done an excellent job of presenting the proposal to an audience with no background knowledge.

In terms of areas of improvement, it would be useful to consider which features are most important for the model to make a prediction. It is also crucial to ensure that the model is not overfitting and think about ways to deal with noisy data. Furthermore, it is very important to make sure that the rest of the project is easily understood by all audiences including outsiders of the field.

Final project peer review (alh323)

This project seeks to analyze the protein sequence of COVID-19 so as to be able to then classify what could give off an immune response, which scientists could then use to develop a vaccine. They applied exploratory data analysis, k-fold cross validation, logistic regression, random forest, gradient boosted trees, and dense neural networks. This project was very well thought out, was well written and explained, applied a lot of different methods and analysis that we learned in class, and was a great job.

First of all, I really appreciated the cell biology introduction. As someone with absolutely no experience/knowledge of any cell biology, it was very helpful to have the glossary to turn to. This was incredibly useful. I also really liked how they listed and explained what each feature they used was; this increased my understanding of your visualizations and analysis a ton. And also, in your conclusion on logistic regression, you include the phrase "from a biological standpoint does not make sense" which I thought was amazing! I feel like so often in data analysis/data science/statistics, people focus too much on the raw numbers and p-values and graphs, and don't actually think about the significance and meaning of the numbers in application. The fact you were able to figure out that the underperformance of the logistic regression may have something to do with how it was using the data, because it doesn't make sense biologically, is outstanding!

Some of the labels on your graphs and text in tables were a bit small, and were hard to read at times. Just keeping some consistency in font size would make a huge difference. Additionally, half of the report was written in full justified text and half was left justified, so just keep an eye out for that in the future. Last, this group explained each of the methods in depth, which is important and would definitely be necessary in a publication, but for the purposes of this project with an 8 page limit, I think a lot of the text spent in just explaining the method itself could've been spent on a deeper discussion on their findings.

Overall, this project was incredibly well done though.

Peer Review | zp45

Development of a vaccine for COVID-19 is an exhausting endeavor that many medical scientists around the world are currently undertaking. Development of said vaccine is vital in the battle against the pandemic, and any efforts to speed up this process are worth attempting. This project examines a data set of COVID’s protein sequences, which they will parse through to find epitopes (sites in the protein sequence that could trigger an immune response). Their goal is to predict whether a certain epitope will trigger this response. The data set in question contains all combinations of a total of 14362 epitopes and 757 proteins, as well as detailed features of both the epitopes and proteins to examine.

What I like: The project proposal is very well structured, and answers each of the questions thoroughly. Second, I appreciate how relevant their project is. Because of its connection to current events, I really hope that they are successful in their project and come across some ground breaking insights. Finally, I appreciate how they explained technical terms. After reading their proposal, I had a really solid understanding of the mechanisms through which a vaccine could be developed as well as the biological indicators that vaccine developers look for.

What could be improved: One thing that can be improved upon is the occasional grammar mistakes throughout the proposal that sometimes detract from the flow and intent. Secondly, when they went into details about what specific traits the dataset catalogued, I wish they provided some insights into what exactly the features entailed. I felt like a lot of these explanations were very surface-level (what is antigenicity?). Finally, it would have been cool if they had an idea/talked about what specific learning techniques they were planning to employ in analyzing the data set.

Project Proposal Peer Review (sah346)

This project is about COVID's antigen properties. The objective is to determine a whole antigen and a target sequence of a protein that may be used to create a vaccine for COVID-19. The data set used is from Kaggle, and it includes peptides injected into a cell, the parent's protein, and whether an antibody was produced, among other factors.

I like that this training data set is large. There are over 14,000 training rows, which should allow for high accuracy and predictability of the model. I also like the importance of the project: it could be used to create a vaccine for COVID-19, which is very relevant and important to the world right now. I also like how this proposal explains the necessary biology, allowing me to understand the proposal even though it is not my chosen field.

One area of improvement I would suggest is explaining how training on your training data will extrapolate to COVID data, as the training data is not from COVID patients. Can this be extrapolated, or is using a different data set important? It is not clear to me from this proposal that the data set given will help you answer your question. Another suggestion I have is that, here, whether an antibody was produced is a boolean. However, the number of antibodies produced is important to vaccine creation. I would endeavor to find a data set where antibody production is an ordinal. My last suggestion is that this proposal suggests trying to predict two things. I worry that may be too ambitious. If so, you can instead try to predict just the target sequence or just the parent protein that will create antibodies.

Midterm Report Peer Review

The project is aiming to predict the presence of a certain antibody for combinations of different features (namely peptides combined with parent proteins). The overall process followed and the steps taken to treat the data are clearly explained and shown in the report. It is a great idea with clear amounts of work behind and the report demonstrated that.

Three positive aspects of the project are:

Steps taken in data preprocessing are explained in a concise and clear way. It helps with following the technical procedure and thought process that went on behind the scenes.
Very clear and insightful full EDA. Plots help understand the relevance of each feature regarding the target label and if I were to be more knowledgeable on the topic, I am sure it would provide great amounts of information.
It is clear that a lot of time has been invested into the project as demonstrated by the fact that they have tried various models and listed why they think are not suitable for the current problem and dataset.

Three possible areas of improvement are:

I would make sure to include an introductory paragraph with an explanation of variables, what they mean, and how they relate to each other. I found that in the project proposal but I think it would also be helpful for people unfamiliar with the subject (such as myself) to have a 1 paragraph introduction to the project terminology and goals.
I missed conclusions drawn from the provided plots. What do they tell us and what do the different data distributions imply? Some of these questions were addressed for the first group of plots but not for the second one (or at least not fully). It might be that I am misunderstanding it but I think some more explanation regarding the plots' insights would be helpful.
One final thing that could be improved is some perspective on the findings. I think that the model building and information obtained are there and it was done exhaustively and with good results. However, if in Future Steps, there were some explanations on how this might be used or the possible applications of this model I think it could be helpful.

In conclusion, I think this project was tackled in a very profound manner, and the method that was followed helps reach the desired conclusions. Although it might be hard to understand for people not familiar with the field, I struggled to find any technical or model-related problems with the approach taken by this project group.

Good luck!

Nico

Midterm report peer review

The goal of this project is to find the parts of antigen protein in COVID-19. The authors use two datasets (input_bcell.csv and input_sars.csv) to train their model, in which the rows are peptides, and columns are features of peptides. The output of the model is the target column. They aim to use this model to predict the label for peptides in file input_covid.csv.

What I love about this project

I love that you have neat and beautiful plots of your data. In addition, you also incorporated some statistical analysis of what's significantly different between features of two labels.
You did a briliant job on using the correlation matrix of your features to make sure of their independence.
I also like that you used the regularizer and plotted the error by the coefficient of the regularizer.

What I think could be improved about this project

You did a great job on describing your dataset. How does it help you to build your model?
Except for the reasons for that the task is 1/0 labeling, it is not clear to me why you choose logistic regression and random forest, or the models you plan to try in the future. What about the dataset and/or your problem that the models of your choice a good fit?
It would be helpful if you could use less jargons in your report, given your audience (e.g., me) may not have background in biology and it can be difficult to follow.

Midterm Peer Review

Covid-19 Vaccine Discovery
The goal of the project is to predict whether peptide taken from parent proteins that are applied to B cell culture will introduce an antibody on the surface of the B cell.
The team used COVID-19/SARTS B-cell prediction dataset. The dataset contains the label explaining if antibody is introduced on the surface of B-cell and features of the peptide.

Things I like
The team used multiple models including logistic regression and random forests and compared the performance of the models. They explained why they chose the models instead of the linear regression model.
The description of the dataset and goal is detailed and easy to understand. They explained the complex dataset by listing all the features and with enough detail including the existence of missing value.
Data visualizations of distributions of features and comparison between different features helped the readers understand the model and dataset.

Ways to improve
The data visualizations labels are difficult to understand. Including more details on the labels and scale units will help readers understand the graphs better.
It is difficult to understand what they were explaining by “We used $l2$ regularization with regularizer values in the set {$0.00001\times4^i : i=0,1,2,...,15$}, and then plot the training error and test error concerning the regularizer value as follows.” Including more details on this sentence would help the readers understand the regularizer values.
Why are you planning to use CNN? Including the reason for the choice of the model will help clarify.

Final Report Peer Review - oj33

This project is about identifying antigen proteins in COVID and predicting their protein sequences that would trigger an immune response, thus being useful for vaccine development. They use two datasets on antibody activity to gain a more complete understanding of antibody activity. They also have a dataset on covid which contains only features which they will predict antibody affinity from.

I liked that you used a correlation matrix to inform the type of classification that you used. Your analysis of which tree-based method to use was extremely easy to follow and very interesting. Finally, I appreciated that you mentioned an applicable drawback of black box analysis tools and how they permeate through topics as serious as COVID.

One improvement I would suggest is to better explain why you used Cross Validation. Additionally I think that the whole project was well written, but it was dense at times. This could be remedied by spending less time on explaining the equations used and more on analysis on visual outcomes of different models you used. Finally, some of the figures were very small, especially the random forest and gradient boosted trees figures, and it made it seem a bit hard to relate the writing to the visuals because they were small.

Overall, I thought this report was extremely high quality. Thank you for shedding light on the applications of what we've learned in class for the world's most pressing issue.

Final Peer Review - rz98

The vaccine discovery project attempted to determine whether they could predict COVID-19 antibody valence to help biologists find segments of the virus that creates antibodies in humans. The group used antibody valence data from B cell proteins and SARS virus proteins, including factors such as amino acid sequences, accessible surface areas, etc.
The group analyzed the data to create comparisons of isoelectric point distributions and related features, which influenced the types models they decided to build. The group used models such as logistic regression, random forest, and dense neural network to predict valid protein antibody valences. By doing so, the group was able to calculate 2700 COVID protein sequences to investigate that could result in a possible vaccine.

Three things I liked about the project are:

The project is both relevant to current events and very much humanitarian. The group was able to take a real life problem and produce results that could contribute to solving the current COVID crisis, namely identifying potential protein segments and determining important protein features related to determining antibody valence.
The very in depth explanation of each of the methods we did not cover in class was very much appreciated. The paper walks through the different biology terms we needed to know as background knowledge, mathematical formulas used in each of the models for clarification, as well as detailed explanations of models such as random forest, gradient boosting, and dense neural networks.
Clearly acknowledging the extent of the modeling done and covering potential future directions to take the research. The paper notes that the models could have been improved by using more complex features and avoiding the black box nature of certain models, but also explains the difficulty in producing complex features and modeling may be beyond the scope of the course.

Three areas I think could be improved:

The project avoided looking into bias and fairness in the models. There could have been discussion on whether the fairness of training the models using the data from B cell proteins and SARS virus proteins, or whether doing so introduced bias in some way.
Given that determining important features was a project goal, what is the group consensus on the result? The logistic regression model clearly valued start and end position way above all other features, while the other models did valued other features more highly. Should the most accurate model (Random Forest) be taken as truth, even though it is more black-box?
The paper could have gone more in depth about how they settled on their hyperparameters in models such as Random Forest, Gradient Boosting, and Neural Networks . The paper mentioned that grid search was used to determine tree height in Random Forest, but don't mention any other hyperparameters such as splitting conditions, number of samples, hidden layer size, etc.

Final Project Review (jc2473)

This project attempts to discover the parts of the virus most likely to lend itself towards vaccination trials using statistical learning models.

Three things I liked about the project:

First of all, I really like the general outline of the proposal. From introduction to conclusion, it is a one cohesive piece with logical reasonings behind each step of the experiment. As a reader, it is very easy to follow through the work and understand the various approaches and analysis the group has taken.
I really liked how the group explored four different models for the analysis and came up with a 85% accurate model. It really speaks for the in depth work they have done.
Lastly, intertwining the project with COVID data was extremely relatable. Since the pandemic is such an impending crisis to everyone currently, analyzing the COVID data was brilliant.

Three areas to improve:

Personally, I think this project is perfect and does not have any room of improvements to make. However, since COVID vaccine is starting, it would be great if you can test out the data from COVID vaccine to see whether your model can make a true impact.
As mentioned in the conclusion, since the model lies in the black box nature, it would be great if there is a way for the model to speak of its calculation to biologists.
Lastly, to improve model accuracy, adding more features and doing experimenting with various feature transformation might have been more helpful.

Peer Review (zl285)

This project is aiming to identify the B cell epitopes in COVID’s antigen protein that stimulate an immune response in the body and predicting their protein sequence, which is useful in the design and development of vaccines. The researchers want to use machine learning tools instead of the traditional medical perspectives to undercover the features of COVID-19’s antigen protein that stimulate an immune response in the body, and therefore to discover the significant regions that a COVID-19 vaccine can target. Their objective data, acquired from Kaggle, contains features of the peptide and its parent protein, and the label whether the antibody is introduced on the surface of B-cell as the response variable.

There are many things I really like about this proposal. First of all, it is elaborately structured and the project’s goal and methodology are clearly demonstrated. Secondly, the question the problem is focusing on is highly related to what is going on all over the world at this moment, the pandemic. This project aims to help to contain COVID-19 with the researchers’ effort, which I really appreciate. Thirdly, the target dataset is highly related to the topic, and have enough features and observations to support a presumably indicative result.

There are also a few aspects that I am a bit concerned about after reading the proposal. First is the machine learning tools that the researchers are going to utilize are not specified, therefore, considering the size of the data and the abundance of features, the researchers should be careful in picking algorithms and controlling the running complexity. Secondly, the question is how to prove that the prediction is correct. Even if we can test the model’s performance with machine learning evaluation methods, the effect of the epitopes, in reality, is still to be tested. Up till now, there have not been any COVID-19 vaccine projects approved to be put into use, but only 10 still in Phase 3. The third issue is somehow related to the second one: we are conducting research in a completely data-driven perspective, and will it provide valid results? Should this project be co-oriented by medical knowledge as well, considering the seriousness and ethic in medical research?

Final Project Peer Review

The objective of this project is to analyze the protein sequence of covid 19 and attempt to find which could possibly be used in a successful vaccine. This project is written extremely well.

First, I really liked and also appreciated the beginning section on cell biology. As I do not have much knowledge in this area, this was very helpful in understand the foundation that this project is based on and would be very helpful to any reader that does not know much of cell biology. Another part of this project that I really like is the intuition used in their analysis of the random forest/gradient boosting models. They comment on the false-positive rates and how for this specific example, a false positive is more costly in trying to detect useful peptides for a vaccine. I think this kind of analysis displays this group’s great understanding of what they are actually working with. This group also provided a thorough discussion on whether or not they could have produced a “Weapon of Math Destruction” which I enjoyed.

In my opinion is this a really great project, however I do see some room for improvement. For example, this group seemed to put a lot of effort in providing descriptions of the different models they used and how these models work. Of course this was helpful in understanding the project, but I do think they could have toned the generic descriptions back and put slightly more time into providing more analysis of the models on their data. Similarly, a potion of the visuals and equations that are used aid in understanding the algorithms, e.g the visuals for the random forest and gradient boosting tree models, which again provide helpful insight. Although, I think it could improve the project if they exploited more visuals of their models and the results.

Midterm Peer Review

Things I like:

The visualizations in the Exploratory Data Analysis section are clear and easy to read and give good insight into what the data looks like and why various features are relevant to the goal at hand.
I like that the group considered the random forest model, which shows that various models have been considered.
The section detailing data processing is clear and it makes sense why certain features were dropped.

Things to Improve:

It would be helpful to understand why the group chose to use logistic regression as opposed to other models.
The report is very heavy on vaccine-related jargon, and it would be helpful to understand what the various terms mean (ex chou_fasman and kolaskar_tongaonkar since these are important features in the model)
It would be helpful to have a section in the report detailing the project goals and ultimately why the project is useful/interesting.

Midterm Report Peer Review

I found the overall report to be satisfactory and a great start to the project. There are some conclusions that I found questionable, but overall the analyses appear to be logical and consistent with the classification problem you are facing. The content was not too sparse and not too overwhelming either.

Praise

You gave a clear picture of the data and what your main objective is. I liked how you presented the features in a clear and concise manner as well as justification for why certain features were left out.
Your choice of models seems to be on the right track and make sense given the context of your problem (classification). Your out of sample and in sample errors are not too big and do not vary much in magnitude, so I think you are using the right tools.
I thought highlighting each distribution relative to the features was quite illuminating, especially comparing between target and non-target distributions.

Critiques

If I am understanding your correlation matrix correctly, you say that you conclude that your features are not linearly correlated. However, we can see that kolaskar_tongaokar is moderately negatively correlated with parker with $$r^2$$ around $$-0.6$$. Furthermore, we can also see parker and chou_fasman are also moderately correlated as well. Is the claim that all of your variables are not linearly correlated necessarily accurate?
Is it correct to disregard the logistic model completely due to negative variance? Your negative variance is quite small in magnitude, so it may be easily fixable. See this post for an explanation on what may be going on!
Better explanation of the scales for your graphs would be beneficial. This may or may not be an issue if we are to assume your target audience is well versed in the topic you are reviewing. However, for lay people, it may be helpful to have more illuminating bar graphs as well as other types of graphs as well.