dynamicgenetics / epicosm_legacy Goto Github PK

Please ignore this repo. It's just hanging around for quick access to some resources I needed. Will delete soon :) 14/9/21

License: GNU General Public License v3.0

Python 99.12% Shell 0.88%

social-media epidemiology cohort-studies mongodb python3 sentiment-analysis

epicosm_legacy's People

Contributors

Stargazers

Watchers

epicosm_legacy's Issues

groundtruth reports

need
"got groundtruth but user not in db"
and
"groundtruth added: this many users were not in the groundtruth" and make file of those users.

get_tweets is messy

There is lots of repetition in src/modules/twitter_ops > get_tweets()

It kept breaking when I moved the api call line into its own function, and I was too dim to work out why.

Needs cleaning, but for now it works at least. But yes, ugly as hell :/

Getting follow lists causes heavy rate limiting

Can this be dealt with? Needs researching and attempt at fixing.

v2API - tweet harvest count discrepancy

With the v2API, we can now harvest complete timelines. However, there is usually a discrepancy between the total tweet count retrieved, and the total that the Twitter website claims someone has posted. This is usually between 1 and 10% of tweets.

This is discussed in the community and is known about, I think the issue is that some tweets are deleted, or retweets from accounts that have been made private, or other edge-cases. So, for now I am leaving it because it seems a common issue, and is only minor.

Duplicate user file is made even when empty. Fix.

create pyexeinstaller withing baking in credentials.

How to leave out credentials and user_list?

User list seems fine, but credentials is not. Is this because credentials is a module while user_list is a file? if so, go back to brining credentials in as a file? or is there a smarter way to deal with this?

check for path of mongodb would be nice to includer

currently it goes "is mongodb running?" ok. when we want "if mongodb is running where?" "if running db is not this folder, give warning". doing crazy stuff like

existing_mongodb_dbpath = subprocess.check_output(["ps", "ax", "|",
                                "grep", "-v", "awk", "|",
                                "awk", "'{for(i=1;", "i<=NF;", "i++)",
                                "if($i~/mongod/)", "print", "$(i+2)}'"])

doesn't work as | are seen as literals by shell.

Turn args into argparser.

Use argparser to make argument handling a bit more pro.

--refresh happens for all iterations in python executable.

need toggle inside and recompile.

need a stop and restart command for executable

We might need a new name>ID method, twitter is being very slow at this step now, when it used to be very fast.

Add docker restart option.

On MacOS, openssl version may be too new for MongodDB (specifically mongodump)

Get brew to downdate oppenssl?

# or
brew switch openssl 1.0.2r
# or 
brew switch openssl 1.0.2s
# or
brew switch openssl 1.0.2t```
or try to pakage up openssl 1.0.09 somehow?

Rare twitter error: "[{'message': 'Over capacity', 'code': 130}]" needs handling.

report function counts lines!!

count the real dict? set?

docker ps finder at end of runner might fail in certain circumstances?

pathing for NLP needs tidying

also documentation.

add input() wait if detects no user list or credentials file so window doesn't auto-close.

Implement docker not working and permissions warning in runner.sh

container launched needs to handle incorrect passwords.

"Is docker running?" test needs refining.

Some systems run script thinks that Docker is running when it is not. Fix.

Need to test get following in Docker env

Are retweets truncated?

Retweets are recovered truncated - this is designed behaviour, because rts full text is stored in a different field. This is quite well documented, but requires messing with code that I don’t full understand. So, instead we are going to change mongoexport to have e conditional:
if the record does not have the field “retweeted_status”, then it is not a retweet so just get the field “full_text”.
if the record DOES have “retweeted_status”, the actual full text of the tweet is in the field:
"retweeted_status" : {“full_text”]
Having conditional query in mongoexport is being complicated and not working.
Just exporting the retweeted_status.full_text field leaves blank the tweets which were not retweet, as expected.
Solutions:
get conditional working
get mongodb to move the rt_fulltext field to fulltext field (feels dangerous, and breaks format with true tweet format)
make two output files and merge them
many other ways too.

dynamicgenetics / epicosm_legacy Goto Github PK

epicosm_legacy's People

Contributors

Stargazers

Watchers

epicosm_legacy's Issues

Recommend Projects

Recommend Topics

Recommend Org