Giter Site home page Giter Site logo

Comments (5)

trannel avatar trannel commented on September 23, 2024 2

I agree we should keep track of the features from the beginning. I only included it as

Decide on the format and categorisation (e.g. by year) to save our new dataset and what features to include

but deciding it earlier makes more sense and is more in line with what we discussed in the last meeting. We could keep it in the other issue, as it is already there, but maybe add reference for the problems we want to use them for or group them accordingly.

from cs-insights-crawler.

trannel avatar trannel commented on September 23, 2024 1

Project Plan - WIP

Step 0: Preparation

  • Read all 4 NLP scholar blogposts
  • Read the relevant papers
  • Recall NLP lecture
  • Create a project plan

Step 1: Complete the dataset

The NLP scholar dataset only seems to have the titles and no other text with contents of the papers (i.e. abstracts). We want to complete the dataset by adding the abstracts for each paper.

  • Check the authors github for a bigger dataset and code. He might have useful information there.
  • Download the offline AA dataset (https://www.aclweb.org/anthology/info/development/)
  • Analyse the offline AA dataset for differences compared to the web version
  • Analyse how we can access the abstracts for papers of different venues and how they are referenced
  • Decide on the format and categorisation (e.g. by year) to save our new dataset
    • Expand the dataset with abstracts
  • Write the code to extract the abstracts and create the new dataset
  • Save the PDFs of the accessed papers in a structed way (year/venue/...)

Step 2: First look into the data

First we should take a look into the data we have by analysing keywords and using tf-idf.

  • Determine the top 20 words (unigrams) per conference
  • Determine the top 20 bigrams per conference
  • Do the same with same with tf-idf
  • Compare our results with those of NLP Scholar

Step 3: Apply topic modelling

We want to gain further insights regarding the topics and how they change using topic modelling.

  • Get topic modelling running on our data (https://radimrehurek.com/gensim/)
  • Hyperparameter tuning (amount of topics, words per topic)
    • Minimum 1 topic per conference
    • Play around with the granularity
  • Do this for multiple years, maybe try to visualize it

Step 4: Core analysis (might change over time)

These are the things we want to analyse using the expanded dataset. They might change over time.
We might generally want to look into things NLP scholar looked into for comparisions.

  • Most used words (see Step 2)
    • By conference
    • By author (first author or in author list)
    • By institution
    • Over time (10 years, percentage based plot)
    • Also check for bigrams
  • Most studied topics (see Step 3)
    • By conference
    • By author (first author or in author list)
    • By institution
    • Over time (10 years)
    • Measured by citations
  • Amount of publications of top-k authors
    • (first author/in author list) comparision
  • Map different identifiers for conferences together
  • Compare our results with those of NLP Scholar

Semantic aspects

Step 5: Create the first embeddings

Now that we have the dataset we want to create the first embeddings for some papers.

  • Decide on model for our first analysis
    • fastText, (GloVe or word2vec)
  • Train the model
  • Decide how we want to save the embeddings (in the dataset)
  • Create the embeddings for the papers (of one venue in one year, only using abstracts)
  • Save the embeddings

Step 6: Create visualization methods for the embeddings

With the embeddings we can now compare and visualize them.

  • Write a function that determines the most similar and dissimilar papers
    • Check if the results make sense
  • Write a function that visualizes the embeddings in a 2D space using UMAP
    • Check if the results make sense
    • Check if there are clusters and if yes, determine what they could mean manually
    • Figure out a way to name those clusters automatically (topic modelling?)
    • Determine how long one visualization takes and decide further steps based on this
    • Save the visualiations
  • Determine whether the results are usable and how to proceed (is the model good/bad?)

Step 7: Analyse the papers

Now that we have useable results from the previous step we now want to analyse the papers and compare them.

  • Create the remaining embeddings for all top-tier conference papers published (2010-2020)
  • Run the visualizations for each top-tier conference for each year (2010-2020)
  • Compare how the topics/clusters shift over time and in between venues
    • Do the clusters differ between venues?
    • Are there trend-setters among the venues?
    • How do they shift over time?
  • Expand the function using UMAP to color code different venues (all venues in one visualization)
    • Run the visualizations again for 2010-2020
  • Compare results with previous analysis and NLP scholar

Step X: Additional analysis (if we have time)

We can also run the same analysis again, but switch some things up

  • Create embeddings with different features: only from titels, abstracts or both (whatever we did not do yet)
  • Use a different model to create the embeddings (step 3B)
  • Visualize papers in a 2D space differently (t-SNE, PCA, ...) and implement corresponding functions
  • Visualize/analyse additional things

Notes: Further ideas (if we have time)

  • Determine most prominent topic per venue/year (automatically?)
    • Compare the results with the analysis of the most occuring title words
  • Do the same for the top-k authors
  • Use embeddings to see research focus of certain countries/institutions
  • Check if the embedding for "foreign language" is biased (depending on the model)
  • Feed GPT-3 the titles/abstracts of the papers peryear/venue and see what it generates

from cs-insights-crawler.

jpwahle avatar jpwahle commented on September 23, 2024 1

I liked Terry's idea. I just want to add that it makes sense to think in terms of problems first and then decide features that we need to solve the problem.

For example, we want to find out which authors published what kind of papers in which conference (I formulated this as general as possible so you can come up with more ideas and narrow it down to your preferences). To address this question, you would need to derive essential features about the author (e.g., name, affiliation, institution, area), the paper (e.g., the topic derived from abstracts/tags/keywords/title), and possibly the venue.

We just want to make sure not to come up with features that do not serve a need.
I would suggest we keep this list of features together with their application (here or in the wiki) so we can review and adjust it when needed. What do you think?

from cs-insights-crawler.

jpwahle avatar jpwahle commented on September 23, 2024

Great progress Lennart! 🚀
If you want us to go over the points and give comments, let me know. I liked what I have seen so far. 🤗

PS: Don't forget to also rest on the weekends.

from cs-insights-crawler.

truas avatar truas commented on September 23, 2024

Great first step, we are on the right track.

I would also add/keep track all the features characteristics we are looking for.

  • paper title (ok)
  • authors
  • institutions
  • affiliation
  • abstract
  • venue
  • year
  • area (?)
  • ... (I just saw in the other issue we have an initial list)

from cs-insights-crawler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.