Giter Site home page Giter Site logo

katehret / compinion Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 2.0 1.01 MB

Supplementary materials and replication data for: Ehret, Katharina and Maite Taboada (2020). The interplay of complexity and subjectivity in opinionated discourse. (version 1.0)

License: GNU General Public License v3.0

R 46.23% Python 53.77%
text-complexity subjectivity argumentation opinion discourse-analysis

compinion's Introduction

Compinion: Analysing complexity and subjectivity

Replication data and scripts for: The interplay of complexity and subjectivity in opinionated discourse. (version 1.0)

DOI

DOI

https://zenodo.org/badge/latestdoi/189996444

Description

This repository comprises the original data, scripts and extensive statistics for the analysis of text complexity and subjectivity described in the related publication

This publication is a large-scale, quantitative analysis of text complexity and various markers of subjectivity in opinionated discourse. Specifically, the authors investigate how text complexity interacts with markers of subjectivity to characterise (i) opinion articles, (ii) reader comments, and (iii) news articles. Methodologically, conditional inference trees and random forests (as implemented in the R package partykit) are used to unravel the interactions between text complexity and subjectivity. Text complexity is defined in terms of Kolmogorov complexity, i.e., the complexity of a text is measured as the length of the shortest possible description necessary to regenerate the original text. Subjectivity is operationalised as the frequency of lexico-grammatical markers of subjectivity and argumentation which have been well-established in research on sentiment, evaluation, stance and Appraisal.

The data published in this repository was retrieved from the Simon Fraser University opinion and comments corpus (SOCC) and a custom-made corpus of general news articles from the Canadian online newspaper The Globe and Mail.

Overview and description of folders and files

This repository contains the following resources (in alphabetical order):

Data

This folder contains the original dataset.

  • aggregate_totals_normalised.csv: The feature matrix with the individual file names as rows and textType, year, tokens, the raw and normalised feature frequencies, and the complexity scores as columns. The normalised feature frequencies of the subjectivity and argumentation markers were calculated based on the raw feature frequencies divided by the number of tokens per file and multiplied with 1000.

  • markerDistributions.csv: The raw frequencies of the individual subjectivity and argumentation markers per text type.

Subjectivity

This folder comprises the complete lists of subjectivity and argumentation markers described in the related publication.

  • other_features: A folder containing the lists of the argumentation markers adverbials, connectives and modals.

  • socal_features: A folder with two subdirectories sampling reduced features lists of subjectivity markers from the Semantic Orientation CALculator (SO-CAL). Specifically, only subjectivity features with a valency of 4 and 5 are included.

    • socal_invariant: negative and positive adverbs.
    • socal_variant: negative and positive adjectives, nouns and verbs.

Scripts

This folder contains the scripts for data analysis and the retrieval of the subjectivity markers.

  • compinion.r: R commands for the visualisation and implementation of the statistics, conditional inference trees and forests presented in the related publication. Only tested on Linux GNU Debian, using R version 3.6.2.

  • countFeat.py: A python script for retrieving the subjectivity and argumentation markers (see Subjectivity).

  • countFeat.md: Read me with instructions of how to run countFeat.py.

Statistics

This folder contains all statistics described in the related publication and additional stastistics.

  • The confusion matrices of the training and test datasets for conditional inference forests with N = 500, 1000, 2000 trees, respectively. Confusion matrices are used to calculate model performance, i.e. prediction accuracy.

    • confMat_500.csv and confMatTest_500.csv
    • confMat_1000.csv and confMatTest_1000.csv
    • confMat_2000.csv and confMatTest_2000.csv
  • correlations.csv: The Pearson correlation coefficients for correlations between all predictor variables described in the related publication, i.e. year, morphological complexity, syntactic complexity, overall complexity, subjective negative markers, subjective positive markers, modals, connectives, adverbials.

  • tunegridTree.csv: A csv file reporting the training and test accuracy for conditional inference trees grown with varying parameter settings. To be more precise, the following three parameters were used in tuning the tree: mincriterion, minbucket and maxsurrogate (for a detailed description of the parameters see https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf).

  • The rankings of the nine predictor variables according to the conditional permutation-importance measure, a measure indicating the importance of individual predictor variables, which was calculated for three differently sized condtional inference forests, i.e. forests with N = 500, 1000, 2000 trees, respectively.

    • varimp500.csv
    • varimp1000.csv
    • varimp2000.csv

compinion's People

Contributors

katehret avatar maitetaboada avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

007v xu-qingxin

compinion's Issues

an error coming across the R script

for (i in 1:27){

  • mycontrols = ctree_control(mincriterion = tunegrid[i,1],
  •                              minbucket = tunegrid[i,2], 
    
  •   		 maxsurrogate = tunegrid[1,3])      #is there any problem with [1,3] ?  or [i, 3]
    
  • tree.list[[i]] = ctree(formula, data = train, control = mycontrols)
  • }
    Error in .y2infl(mfyx, response = d$variables$y, ytrafo = ytrafo) :
    unknown response class

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.