Comments (5)
I agree we should keep track of the features from the beginning. I only included it as
Decide on the format and categorisation (e.g. by year) to save our new dataset and what features to include
but deciding it earlier makes more sense and is more in line with what we discussed in the last meeting. We could keep it in the other issue, as it is already there, but maybe add reference for the problems we want to use them for or group them accordingly.
from cs-insights-crawler.
Project Plan - WIP
Step 0: Preparation
- Read all 4 NLP scholar blogposts
- Read the relevant papers
- Recall NLP lecture
- Create a project plan
Step 1: Complete the dataset
The NLP scholar dataset only seems to have the titles and no other text with contents of the papers (i.e. abstracts). We want to complete the dataset by adding the abstracts for each paper.
- Check the authors github for a bigger dataset and code. He might have useful information there.
- Download the offline AA dataset (https://www.aclweb.org/anthology/info/development/)
- Analyse the offline AA dataset for differences compared to the web version
- Analyse how we can access the abstracts for papers of different venues and how they are referenced
- Decide on the format and categorisation (e.g. by year) to save our new dataset
- Expand the dataset with abstracts
- Write the code to extract the abstracts and create the new dataset
- Save the PDFs of the accessed papers in a structed way (year/venue/...)
Step 2: First look into the data
First we should take a look into the data we have by analysing keywords and using tf-idf.
- Determine the top 20 words (unigrams) per conference
- Determine the top 20 bigrams per conference
- Do the same with same with tf-idf
- Compare our results with those of NLP Scholar
Step 3: Apply topic modelling
We want to gain further insights regarding the topics and how they change using topic modelling.
- Get topic modelling running on our data (https://radimrehurek.com/gensim/)
- Hyperparameter tuning (amount of topics, words per topic)
- Minimum 1 topic per conference
- Play around with the granularity
- Do this for multiple years, maybe try to visualize it
Step 4: Core analysis (might change over time)
These are the things we want to analyse using the expanded dataset. They might change over time.
We might generally want to look into things NLP scholar looked into for comparisions.
- Most used words (see Step 2)
- By conference
- By author (first author or in author list)
- By institution
- Over time (10 years, percentage based plot)
- Also check for bigrams
- Most studied topics (see Step 3)
- By conference
- By author (first author or in author list)
- By institution
- Over time (10 years)
- Measured by citations
- Amount of publications of top-k authors
- (first author/in author list) comparision
- Map different identifiers for conferences together
- Compare our results with those of NLP Scholar
Semantic aspects
Step 5: Create the first embeddings
Now that we have the dataset we want to create the first embeddings for some papers.
- Decide on model for our first analysis
- fastText, (GloVe or word2vec)
- Train the model
- Decide how we want to save the embeddings (in the dataset)
- Create the embeddings for the papers (of one venue in one year, only using abstracts)
- Save the embeddings
Step 6: Create visualization methods for the embeddings
With the embeddings we can now compare and visualize them.
- Write a function that determines the most similar and dissimilar papers
- Check if the results make sense
- Write a function that visualizes the embeddings in a 2D space using UMAP
- Check if the results make sense
- Check if there are clusters and if yes, determine what they could mean manually
- Figure out a way to name those clusters automatically (topic modelling?)
- Determine how long one visualization takes and decide further steps based on this
- Save the visualiations
- Determine whether the results are usable and how to proceed (is the model good/bad?)
Step 7: Analyse the papers
Now that we have useable results from the previous step we now want to analyse the papers and compare them.
- Create the remaining embeddings for all top-tier conference papers published (2010-2020)
- Run the visualizations for each top-tier conference for each year (2010-2020)
- Compare how the topics/clusters shift over time and in between venues
- Do the clusters differ between venues?
- Are there trend-setters among the venues?
- How do they shift over time?
- Expand the function using UMAP to color code different venues (all venues in one visualization)
- Run the visualizations again for 2010-2020
- Compare results with previous analysis and NLP scholar
Step X: Additional analysis (if we have time)
We can also run the same analysis again, but switch some things up
- Create embeddings with different features: only from titels, abstracts or both (whatever we did not do yet)
- Use a different model to create the embeddings (step 3B)
- Visualize papers in a 2D space differently (t-SNE, PCA, ...) and implement corresponding functions
- Visualize/analyse additional things
Notes: Further ideas (if we have time)
- Determine most prominent topic per venue/year (automatically?)
- Compare the results with the analysis of the most occuring title words
- Do the same for the top-k authors
- Use embeddings to see research focus of certain countries/institutions
- Check if the embedding for "foreign language" is biased (depending on the model)
- Feed GPT-3 the titles/abstracts of the papers peryear/venue and see what it generates
from cs-insights-crawler.
I liked Terry's idea. I just want to add that it makes sense to think in terms of problems first and then decide features that we need to solve the problem.
For example, we want to find out which authors published what kind of papers in which conference (I formulated this as general as possible so you can come up with more ideas and narrow it down to your preferences). To address this question, you would need to derive essential features about the author (e.g., name, affiliation, institution, area), the paper (e.g., the topic derived from abstracts/tags/keywords/title), and possibly the venue.
We just want to make sure not to come up with features that do not serve a need.
I would suggest we keep this list of features together with their application (here or in the wiki) so we can review and adjust it when needed. What do you think?
from cs-insights-crawler.
Great progress Lennart! 🚀
If you want us to go over the points and give comments, let me know. I liked what I have seen so far. 🤗
PS: Don't forget to also rest on the weekends.
from cs-insights-crawler.
Great first step, we are on the right track.
I would also add/keep track all the features characteristics we are looking for.
- paper title (ok)
- authors
- institutions
- affiliation
- abstract
- venue
- year
- area (?)
- ... (I just saw in the other issue we have an initial list)
from cs-insights-crawler.
Related Issues (20)
- Automatic upload to Zenodo HOT 1
- Expand use of --s2_filter_pubmed, --s2_filter_arxiv, --s2_filter_pubmedcentral HOT 1
- Add test configuration
- Add CSO annotations to release HOT 1
- Link Scopus and Web of Science to D3
- Total number of works is not equivalent to count of papers.
- Extract call for papers from venue page
- DBLP Client, Processor, Backend Client HOT 1
- Implement DBLP Client HOT 1
- Implement automated storing to db/backend HOT 4
- Implement Processor class HOT 1
- Add automatic documentation and hosting on GitHub pages
- Add Dockerfile and docker-compose for grobid and project HOT 1
- Umlaute author and conference names
- Expand use of --s2_use_tldrs, --s2_use_citations, --s2_use_embeddings
- Match venue names
- Add pep8-naming
- Dataset Release v2.0 HOT 1
- Fix using all entries in export
- Remove paperAbstracts from non open access papers in zenodo
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cs-insights-crawler.