Giter Site home page Giter Site logo

police-sentiment-topic-modeling-lda-'s Introduction

Police topic modeling (LDA analysis)

About The Project

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. By modeling the topics discussed in a large matrix of police-related tweets, we can begin to understand how public perceptions of the police are influenced by different events and circumstances.

This is part of the Technology, Relationships, and Language lab's research on police-citizen interactions.

Dr. Ta-Johnson is the author and owner of this script. For more info, visit https://www.vivianpta.com

Prerequisites

Required packages (install first if not downloaded)

library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
library(slam)
library(tictoc)

You will also need the Meaning Extraction Helper (MEH) here: https://www.ryanboyd.io/software/meh/download/

Once downloaded, open MEH and select the following parameters (if I don't mention something, leave it on the default selection):

  1. Input File Settings: upload "tweets.csv"; "id" is the column to be used as row identifier; "cleaned_tweets" is the column containing text
  2. Output Generation: select folder where you want the output to be saved
  3. Text Segmentation: select "no segmentation"
  4. Conversion List: use default selections
  5. Stop List: use default selections
  6. Token Handling Options: use default selections
  7. Choose Output Types: adjust number in "Prune Frequency List after X Docs" to whatever;
  8. de-select Binary Document by Term Matrix" and "Verbose Document by Term Matrix"; select "Raw Count Document by Term Matrix"
  9. N-gram Settings: Insert number in "Ignore Documents with a Word Count less than" A (min) = 1; B (max) = 1; select "Retain the X most Frequent N-grams (by raw frequency) 10)Input your selected threshold parameter (X) here: I've used 500 in the past, but may need to adjust
  10. Once you finish selecting the parameters, click "Start!" under Begin Analysis

Import the MEH_DTM_RawCount file into R from output folder

Make it a dataframe named "DF_DTMatrix"

RawCounttable <- read.csv('pathname')

DF_DTMatrix <- as.data.frame(RawCounttable)

Determine how many topics you want to extract and how big they should

Replace w/ the number of topics you want to extract

NumberOfTopics <- 6

Indicate the number of "most likely" terms that you want to show for any given topic. I've used 20 previously but change this to whatever you want

MaximumTermsPerTopic <- 20

Save your results

dir.create("Results", showWarnings = FALSE)

Clean your text some more

Strip away the extra info from our document term matrix (Segment, WC, RawTokenCount)

DF_DTMatrix <- DF_DTMatrix[, !names(DF_DTMatrix) %in% c('Segment', 'WC', 'RawTokenCount')]

Only include rows that don't sum to 0 (i.e., has at least one of our target words)

rowTotals <- apply(DF_DTMatrix[2:length(DF_DTMatrix)] , 1, sum)
DF_DTMatrix <- DF_DTMatrix[rowTotals > 0,]
remove(rowTotals)

Sets aside our filenames

Filenames <- DF_DTMatrix$Filename
DF_DTMatrix <- DF_DTMatrix[, !names(DF_DTMatrix) %in% c("Filename")]

Run LDA model

Our lab typically uses "Gibbs" and alpha = 0.1

lda.model <- LDA(DF_DTMatrix, k=NumberOfTopics, method="Gibbs", control = list(alpha = 0.1))

Dump our term results to a .csv file in the "Results" folder

DataToPrint <- terms(lda.model, MaximumTermsPerTopic)
write.csv(DataToPrint, paste("Results/", Sys.Date(), "_-_LDA_Topic_Terms.csv", sep=""), na="", row.names=FALSE)

This prints the most likely topic for each document to a file in the "Results" folder

DataToPrint <- data.frame(Filenames)
DataToPrint$Topics <- topics(lda.model)
write.csv(DataToPrint, paste("Results/", Sys.Date(), "_-_LDA_Document_Categorization.csv", sep=""), na="", row.names=FALSE)

Rank topic terms for each topic and save

tmResult <- posterior(lda.model)
attributes(tmResult)
beta <- tmResult$terms   
dim(beta)
topictermsranked <- apply(lda::top.topic.words(beta, 20, by.score = T), 2, paste, collapse = " ")
write.csv(topictermsranked, paste("Results/", Sys.Date(), "_-_Topic_Terms_Ranked.csv", sep=""), na="", row.names=FALSE)

police-sentiment-topic-modeling-lda-'s People

Contributors

matias27f98 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.