Giter Site home page Giter Site logo

dustinstoltz / cartography_poetics Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 2.0 9.96 MB

Reproduction Repository for "Cultural Cartography with Word Embeddings"

Home Page: https://doi.org/10.1016/j.poetic.2021.101567

License: GNU General Public License v3.0

R 100.00%
embeddings natural-language-processing concept-movers-distance word-movers-distance cultural-sociology replication-materials

cartography_poetics's Introduction

Reproduction Repository for "Cultural Cartography with Word Embeddings"

Updated on 2024-02-19

Marshall A. Taylor and Dustin S. Stoltz

This repository contains all R code and data necessary to reproduce the plots "Cultural Cartography with Word Embeddings" forthcoming in Poetics. You can read the article at Poetics [PDF][DOI][preprint on SocArxiv].

Data

To reproduce the figures, you will need to two sets of pretrained word embeddings: English fastText embeddings and the Historical Word2Vec embeddings trained on the Corpus of Historical American English. You can get them from the linked sites. We have prepared an R package (text2map.pretrained) for downloading and loading the embeddings:

# install the package
remotes::install_gitlab("culturalcartography/text2map.pretrained")

# load the package
library(text2map.pretrained)

You only need to download the embeddings models once per machine:

# download the fasttext embeddings
download_pretrained("vecs_fasttext300_commoncrawl")
# download the histwords embeddings
download_pretrained("vecs_sgns300_coha_histwords")
# Load them into the session
data("vecs_fasttext300_commoncrawl")
data("vecs_sgns300_coha_histwords")

# assign to new (shorter) object 
ft.wv <- vecs_fasttext300_commoncrawl
hi_wv <- vecs_sgns300_coha_histwords
# then remove the original
rm(vecs_fasttext300_commoncrawl)
rm(vecs_sgns300_coha_histwords)

Next, we use roughly 200,000 news articles from the All The News (ATN) dataset covering 2013-2018. We also convert the texts of the articles into a Document-Term Matrix (for our preprocessing procedure see the paper), which is 11,271 unique terms, 197,814 documents, and 79,483,124 total terms. It is also a smidge too big for Github, so we've hosted on Dropbox:

You can also download it directly using R:

temp <- tempfile()
download.file("https://www.dropbox.com/scl/fi/34ic6nw4bw8ku3tdodrf7/dtm_news.Rds?rlkey=8mwuaedqpiqe6zut11ryhvg3o&raw=1",
  destfile = temp
)
news_dtm_99 <- readRDS(temp)

nrow(news_dtm_99) == 197814
ncol(news_dtm_99) == 11271

Then load the metadata for the ATN corpus:

  news_data   <- readRDS("data/news_metadata.Rds")

Finally, below are a few extra pieces of data: our anchor lists for building semantic directions and some event data for figure 5.

  df_anchors <- readRDS("data/anchor_lists.Rds")
  df_events <- read.csv("data/events.csv")

Packages

We use the following packages:

  library(tidyverse)
  library(reshape2)
  library(ggpubr)
  library(text2vec)
  library(text2map)

For the ggplot aesthetics, we use another package we've developed:

  # this will change the colorscheme to viridis
  # and tweak the rest of the ggplot2 aesthetics
  remotes::install_gitlab("culturalcartography/text2map.theme")
  text2map.theme::set_theme()
  

Figures

Provided the above dataframes, matrices, and packages are loaded, the R scripts in the Scripts folder will reproduce figures 1-6 in the paper. For a more detailed guide for using Concept Mover's Distance (used for figures 4-6) see the vignette in text2map.

February 2024 Update

We updated this repository as some of the links to data were broken and we have updated some of the code due to dependency changes.

cartography_poetics's People

Contributors

dustinstoltz avatar marshall-soc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cartography_poetics's Issues

Cannot reproduce "#run CMD" step

Thanks a lot for the great paper and the CMDist package! I would like to using your word embedding technique in my own research so I am trying to reproduce your R code in this file. I've managed to reproduce the exact same figures for the first three figures. But when I try to run the fourth figure, in the "# get anchor lists and build directions" section, I cannot proceed on the "immigr.CMD <- CMDist::CMDist(dtm=news.dtm.99,
cw = c("immigration", "immigration job",
"immigration school", "immigration crime",
"immigration family"),
cv = cd,
wv = ft.wv,
scale = TRUE)" code.

I tried several times, but each time the Rstudio tells me that there is an error like "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 197814". I cannot figure out what happened and could you please tell me how to fix this problem? Thank you very much!

微信截图_20210513172803

Links to data are broken.

Hello,

The links to Google Drive to download the RDS files are not working. So, your experiments cannot be reproduced.

thanks for your great paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.