Giter Site home page Giter Site logo

dynamicgenetics / epicosm_legacy Goto Github PK

View Code? Open in Web Editor NEW
1.0 4.0 0.0 244.36 MB

Please ignore this repo. It's just hanging around for quick access to some resources I needed. Will delete soon :) 14/9/21

License: GNU General Public License v3.0

Python 99.12% Shell 0.88%
social-media epidemiology cohort-studies mongodb python3 sentiment-analysis

epicosm_legacy's Introduction

release GPLv3 license DOI

Overview

Epicosm: Epidemiology of Cohort Social Media. Epicosm is a suite of tools for working with social media data in the context of epidemiological research. It is aimed for use by epidemiologists who wish to gather, analyse and integrate social media data with existing longitudinal and cohort-study research. The tools can:

  • Harvest ongoing and retrospective Tweets from a list of users.
  • Real-time Twitter stream-listen from geographic locations, and collate into a database.
  • Sentiment analysis of Tweets using labMT, Vader and LIWC (dictionary required for LIWC).
  • [in development] Validation of sentiment analysis algorithms against groundtruth.

Instructions in a nutshell

2. Install MongoDB version 4 or higher:

  • In a Mac terminal brew install mongodb
  • In a Linux terminal apt install mongodb

3. Put these three files into a folder:

  • epicosm_mac OR epicosm_linux, as downloaded from the repository in step 1,
  • credentials.txt file (provided here, but complete with your own Twitter access keys),
  • and your user_list (supplied by you: one screen name per line, plain text file).

4. Run Epicosm from your command line, including your run flags

  • Epicosm will provide some help if it doesn't understand you, just type ./epicosm_linux or ./epicosm_mac. See below for more details, but for example a typical harvest can be started with ./epicosm_linux --user_harvest

•••

More detail

1 What does it do?

2 Running Epicosm from compiled python executable

3 Optional parameters

4 Natural Language Processing (Sentiment analysis)

5 Geoharvester

6 Data and other outputs

7 Running the python script manually

8 Licence

•••

1 What does it do?

Epicosm is a social media harvester, data manager and sentiment analyser. Currently, the platform uses Twitter as the data source and the sentiment analysis methods available are VADER, labMT and LIWC (you will need an LIWC dictionary for this). You provide a list of users, and it will gather and store all tweets and metadata (going back a maximum of 3240 tweets) for each user. Images, videos and other attachments are stored as URLs. All information is stored by MongoDB. Harvesting can be iterated, for example once a week it can gather new tweets and add them to the database. As well as the full database, output includes a comma-separated-values (.csv) file, with the default fields being the user id number, the tweet id number, time and date, and the tweet content. Epicosm can also harvest the friends of users (ie, the accounts that the user is following, not the followers of the user).

Epicosm uses MongoDB for data management, and this must be installed before being running Epicosm. This can be done through downloading and installing from the MongoDB website, or it can be done in a Terminal window with the commands brew install mongodb on a Mac apt install mongodb on Linux (Debian-based systems like Ubuntu).

Epicosm can be run in two ways. It can be run using the compiled python executables provided, epicosm_mac or epicosm_linux. If there are any issues with your input files (your user_list and your credentials.txt) Epicosm will try to help you. Alternatively, Epicosm can be run by Python version 3+; details are in section 4.

You will need Twitter API credentials by having a developer account authorised by Twitter. Please see our guide to getting an authorised account, and there are further details on Twitter documentation for how to do this. As of August 2020, Twitter are usually rapid in authorising for academic purposes, although this can of course change. You will find many guides for getting authorisation which are out of date!

•••

2 Running Epicosm from compiled python executable

This is the usual way of running Epicosm (see section 4 for running using Python).

You must provide 2 further files in the folder with the Epicosm executable:

  1. a list of user screen names in a file called user_list. The user list must be a plain text file, with a single username (twitter screen name) per line.
  2. Twitter API credentials. Please see the file in this repository for a template. This file must be called credentials.txt.

Then you can run the python executable, for example ./epicosm_linux [your run flags] or ./epicosm_mac [your run flags]

•••

3 Optional parameters

When running the harvester, please specify what you want Epicosm to do:

--user_harvest Harvest tweets from all users from a file called user_list (provided by you) with a single user per line. The database will be backed up on every harvest, with a rotating backup of the last three harvests. These can be imported into another instance of MongoDB with mongoimport, see MongoDB documentation for details.

--get_friends. Create a database of the users that are being followed by the accounts in your user_list. (This process can be very slow, especially if your users are prolific followers.) You will also get a CSV of users and who they are following, in /output/csv If using with --repeat, will only be gathered once.

--repeat Iterate the user harvest every 3 days. This process will need to be put to the background to free your terminal prompt, or to leave running while logged out.

--refresh If you have a new user_list, this will tell Epicosm to take use this file as your updated user list.

--csv_snapshots Make a CSV formatted snapshot of selected fields from every harvest. See documentation for the format and fields of this CSV. Be aware that this may take up disk space - see ./output/csv

Example of single harvest: ./epicosm --user_harvest

Example iterated harvest in background, with a renewed user_list and taking CSV snapshots: nohup ./epicosm --user_harvest --refresh --csv_snapshots --repeat &

4 Natural Language Processing (Sentiment analysis)

Once you have a database with tweets, you can apply sentiment analysis to each document and insert the result into MongoDB. You will need to run epicosm_nlp.py (if you have dependencies errors, please install them with pip3 install -r requirements.txt).

To run, specify from the following flags:

--insert_groundtruth Provide a file of groundtruth values called 'groundtruth.csv' and insert these into the local database.

--liwc Apply LIWC (Pennebaker et al 2015) analysis and append values to the local database. You must have a LIWC dictionary in therun folder, named "LIWC.dic". LIWC has around 70 categories (including posemo and negemo), but many of these will return no value because tweets are too short to provide information. Empty categories are not appended to the database. **Note: the LIWC package is broken and cannot deal with its own dictionary. If it comes across phrasal entries it throws a key error. In LIWC 2015, most of these are variations on the word 'like' ('we like', 'they like', 'not like'), but the words 'like', 'not' 'we' are already in categories, and the phrasal categories have the same metrics anyway! You will need to clean your dictionary with the script in src called cleanLIWC.sh.

--labmt Apply labMT (Dodds & Danforth 2011) analysis and append values to the local database. LabMT provides a single positive - negative metric, ranging from -1 to 1 (1 being positive sentiment, 0 being neutral, -1 being negative).

--vader Apply VADER (Hutto & Gilbert 2014) analysis and append values to the local database. VADER returns 4 metrics: positive, neutral, negative and compound. See their documentation for details.

--textblob Apply TextBlob (github: @sloria) analysis and append values to the local database. TextBlob provides a single positive - negative metric, ranging from -1 to 1 (1 being positive sentiment, 0 being neutral, -1 being negative).

The results of these analyses will be appended to each tweet's record, under the field "epicosm", and stored in MongoDB.

•••

5 Geoharvester

The python script geoharvester.py can launch a Twitter stream listener by geographic location, as defined by one or more latitude/longitude boxes. Please see the example geoboxes.py for the format of this file. As above, you will need to provide your credentials.txt to gain access to the Twitter streaming API. All tweets are stored in MongoDB under the database geotweets and the collection geotweets_collection. To sentiment analyse these, please see the section below on NLP. Few Tweets (historically, less than 2%) have geotags, but Twitter will try to assign a rough location based on city or country. As of 2020, Twitter is reporting they will phase out geotagging, since few people authorise Twitter to geotag their tweets.

6 Data and other outputs

The processed output is a a database of tweets from the users in your user_list, and a CSV file, in the folder ./output/csv/, which by default has the fields: [1] the ID of the tweeter, [2] the id of the tweet, [3] the time and date of the tweet, and [4] the tweet content.

Log files detailing what Epicosm has done is in /epicosm_logs/.

Full tweet content and metadata of all tweets is stored in MongoDB in a format which is closely aligned with JSON. To work with full raw data, you will need MongoDB installed. The tweet database is named twitter_db, with two collections tweets, and friends which contains a list of all users that each user in your list are following. The friends collection will only be made if you ask for friends lists to be gathered. Currently, gathering friends list causes the process to be heavily rate limited by Twitter! [solution in progress]

A backup of the entire database is stored in /output/twitter_db/. If you have MongoDB installed, this can be restored with the command

mongorestore [your name given to the database] [the path to the mongodump bson file]

for example:

mongoresotore -d twitter_db ./output/twitter_db/tweets.bson

(However, please check MongoDB documentation as commands can change)

To view and interact with the database using a GUI, you will need MongoDB installed, and a database viewer. Of open source options, we find that Robo 3T works very well.

•••

7 Running the python script manually

See the source file in /src and run it with

python3 epicosm.py [your run flag]

You must provide 2 files:

  1. a list of user screen names in a file called user_list. The user list must be a plain text file, with a single username (twitter screen name) per line.
  2. Twitter API credentials will need to be supplied, by editing the file credentials.py (further instructions inside file). You will need your own Twitter API credentials by having a developer account authorised by Twitter, and generating the required codes. Please see our guide, and there are further details on Twitter documentation on how to do this.

Please also see these further requirements.

  1. Put all repository files and your user list into their own folder. The python script must be run from the folder it is in.
  2. MongoDB version 4 or higher will need to be installed. It does not need to be running, the script will check MongoDB's status, and start it if it is not running. The working database will be stored in the folder where you place your local copy of this repository (not the default location of /data/db). For Linux and MacOS, use your package manager (eg. apt, yum, yast), for example:

apt install mongodb (or yum, brew or other package manager as appropriate)

  1. The following Python3 dependencies will need to be installed from the src/requirements.txt file if you run

pip3 install -r requirements.txt

•••

8 Licence

DynamicGenetics/Epicosm is licensed under the GNU General Public License v3.0. For full details, please see our license file.

Epicosm is written and maintained by Alastair Tanner, University of Bristol, Integrative Epidemiology Unit.

epicosm_legacy's People

Contributors

altanner avatar dependabot[bot] avatar ninadicara avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

epicosm_legacy's Issues

check for path of mongodb would be nice to includer

currently it goes "is mongodb running?" ok. when we want "if mongodb is running where?" "if running db is not this folder, give warning". doing crazy stuff like

existing_mongodb_dbpath = subprocess.check_output(["ps", "ax", "|",
                                "grep", "-v", "awk", "|",
                                "awk", "'{for(i=1;", "i<=NF;", "i++)",
                                "if($i~/mongod/)", "print", "$(i+2)}'"])

doesn't work as | are seen as literals by shell.

v2API - tweet harvest count discrepancy

With the v2API, we can now harvest complete timelines. However, there is usually a discrepancy between the total tweet count retrieved, and the total that the Twitter website claims someone has posted. This is usually between 1 and 10% of tweets.

This is discussed in the community and is known about, I think the issue is that some tweets are deleted, or retweets from accounts that have been made private, or other edge-cases. So, for now I am leaving it because it seems a common issue, and is only minor.

groundtruth reports

need
"got groundtruth but user not in db"
and
"groundtruth added: this many users were not in the groundtruth" and make file of those users.

get_tweets is messy

There is lots of repetition in src/modules/twitter_ops > get_tweets()

It kept breaking when I moved the api call line into its own function, and I was too dim to work out why.

Needs cleaning, but for now it works at least. But yes, ugly as hell :/

create pyexeinstaller withing baking in credentials.

How to leave out credentials and user_list?

User list seems fine, but credentials is not. Is this because credentials is a module while user_list is a file? if so, go back to brining credentials in as a file? or is there a smarter way to deal with this?

Are retweets truncated?

Retweets are recovered truncated - this is designed behaviour, because rts full text is stored in a different field. This is quite well documented, but requires messing with code that I don’t full understand. So, instead we are going to change mongoexport to have e conditional:
if the record does not have the field “retweeted_status”, then it is not a retweet so just get the field “full_text”.
if the record DOES have “retweeted_status”, the actual full text of the tweet is in the field:
"retweeted_status" : {“full_text”]
Having conditional query in mongoexport is being complicated and not working.
Just exporting the retweeted_status.full_text field leaves blank the tweets which were not retweet, as expected.
Solutions:
get conditional working
get mongodb to move the rt_fulltext field to fulltext field (feels dangerous, and breaks format with true tweet format)
make two output files and merge them
many other ways too.

Is timeout reasonable for big datasets?

stop_mongodb includes a one minute timeout, in case it gets stuck in an infinite loop. is there a better way of doing that? might closing mongod take a long time if the db is very large, and go over this limit?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.