Before you develop anything, outline a plan (as detailed as possible) for everything y

Project Plan - WIP Step 0: Preparation <ul

Outline a full overview for the project. about cs-insights-crawler HOT 5 CLOSED

jpwahle commented on September 23, 2024

Outline a full overview for the project.

from cs-insights-crawler.

Comments (5)

trannel commented on September 23, 2024 2

I agree we should keep track of the features from the beginning. I only included it as

Decide on the format and categorisation (e.g. by year) to save our new dataset and what features to include

but deciding it earlier makes more sense and is more in line with what we discussed in the last meeting. We could keep it in the other issue, as it is already there, but maybe add reference for the problems we want to use them for or group them accordingly.

from cs-insights-crawler.

trannel commented on September 23, 2024 1

Project Plan - WIP

Step 0: Preparation

Read all 4 NLP scholar blogposts
Read the relevant papers
Recall NLP lecture
Create a project plan

Step 1: Complete the dataset

The NLP scholar dataset only seems to have the titles and no other text with contents of the papers (i.e. abstracts). We want to complete the dataset by adding the abstracts for each paper.

Check the authors github for a bigger dataset and code. He might have useful information there.
Download the offline AA dataset (https://www.aclweb.org/anthology/info/development/)
Analyse the offline AA dataset for differences compared to the web version
Analyse how we can access the abstracts for papers of different venues and how they are referenced
Decide on the format and categorisation (e.g. by year) to save our new dataset
- Expand the dataset with abstracts
Write the code to extract the abstracts and create the new dataset
Save the PDFs of the accessed papers in a structed way (year/venue/...)

Step 2: First look into the data

First we should take a look into the data we have by analysing keywords and using tf-idf.

Determine the top 20 words (unigrams) per conference
Determine the top 20 bigrams per conference
Do the same with same with tf-idf
Compare our results with those of NLP Scholar

Step 3: Apply topic modelling

We want to gain further insights regarding the topics and how they change using topic modelling.

Get topic modelling running on our data (https://radimrehurek.com/gensim/)
Hyperparameter tuning (amount of topics, words per topic)
- Minimum 1 topic per conference
- Play around with the granularity
Do this for multiple years, maybe try to visualize it

Step 4: Core analysis (might change over time)

These are the things we want to analyse using the expanded dataset. They might change over time.
We might generally want to look into things NLP scholar looked into for comparisions.

Semantic aspects

Step 5: Create the first embeddings

Now that we have the dataset we want to create the first embeddings for some papers.

Decide on model for our first analysis
- fastText, (GloVe or word2vec)
Train the model
Decide how we want to save the embeddings (in the dataset)
Create the embeddings for the papers (of one venue in one year, only using abstracts)
Save the embeddings

Step 6: Create visualization methods for the embeddings

With the embeddings we can now compare and visualize them.

Write a function that determines the most similar and dissimilar papers
- Check if the results make sense
Write a function that visualizes the embeddings in a 2D space using UMAP
- Check if the results make sense
- Check if there are clusters and if yes, determine what they could mean manually
- Figure out a way to name those clusters automatically (topic modelling?)
- Determine how long one visualization takes and decide further steps based on this
- Save the visualiations
Determine whether the results are usable and how to proceed (is the model good/bad?)

Step 7: Analyse the papers

Now that we have useable results from the previous step we now want to analyse the papers and compare them.

Create the remaining embeddings for all top-tier conference papers published (2010-2020)
Run the visualizations for each top-tier conference for each year (2010-2020)
Compare how the topics/clusters shift over time and in between venues
- Do the clusters differ between venues?
- Are there trend-setters among the venues?
- How do they shift over time?
Expand the function using UMAP to color code different venues (all venues in one visualization)
- Run the visualizations again for 2010-2020
Compare results with previous analysis and NLP scholar

Step X: Additional analysis (if we have time)

We can also run the same analysis again, but switch some things up

Create embeddings with different features: only from titels, abstracts or both (whatever we did not do yet)
Use a different model to create the embeddings (step 3B)
Visualize papers in a 2D space differently (t-SNE, PCA, ...) and implement corresponding functions
Visualize/analyse additional things

Notes: Further ideas (if we have time)

Determine most prominent topic per venue/year (automatically?)
- Compare the results with the analysis of the most occuring title words
Do the same for the top-k authors
Use embeddings to see research focus of certain countries/institutions
Check if the embedding for "foreign language" is biased (depending on the model)
Feed GPT-3 the titles/abstracts of the papers peryear/venue and see what it generates

from cs-insights-crawler.

jpwahle commented on September 23, 2024 1

I liked Terry's idea. I just want to add that it makes sense to think in terms of problems first and then decide features that we need to solve the problem.

For example, we want to find out which authors published what kind of papers in which conference (I formulated this as general as possible so you can come up with more ideas and narrow it down to your preferences). To address this question, you would need to derive essential features about the author (e.g., name, affiliation, institution, area), the paper (e.g., the topic derived from abstracts/tags/keywords/title), and possibly the venue.

We just want to make sure not to come up with features that do not serve a need.
I would suggest we keep this list of features together with their application (here or in the wiki) so we can review and adjust it when needed. What do you think?

from cs-insights-crawler.

jpwahle commented on September 23, 2024

Great progress Lennart! 🚀
If you want us to go over the points and give comments, let me know. I liked what I have seen so far. 🤗

PS: Don't forget to also rest on the weekends.

from cs-insights-crawler.

truas commented on September 23, 2024

Great first step, we are on the right track.

I would also add/keep track all the features characteristics we are looking for.

paper title (ok)
authors
institutions
affiliation
abstract
venue
year
area (?)
... (I just saw in the other issue we have an initial list)

from cs-insights-crawler.

Recommend Projects

Outline a full overview for the project. about cs-insights-crawler HOT 5 CLOSED

Comments (5)

Project Plan - WIP

Step 0: Preparation

Step 1: Complete the dataset

Step 2: First look into the data

Step 3: Apply topic modelling

Step 4: Core analysis (might change over time)

Semantic aspects

Step 5: Create the first embeddings

Step 6: Create visualization methods for the embeddings

Step 7: Analyse the papers

Step X: Additional analysis (if we have time)

Notes: Further ideas (if we have time)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent