whatevery1says / we1s-collector Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 1.0 3.43 MB

article collector utils for WE1S (WhatEvery1Says)

Home Page: https://whatevery1says.github.io/we1s-collector/

License: MIT License

Python 98.32% Dockerfile 0.40% Shell 1.29%

collector lexis-nexis lexisnexis api news news-articles docker

we1s-collector's Introduction

we1s-collector

Archived

With the retirement of the Web Services Kit (WSK) API by LexisNexis around 2019-07, core features of the collecter are no longer functional. A new API -- along with new document format -- requires new connection, query code, and preprocessing code.

The collector is a system for searching, paging, and ingesting article data from new sources (with a current focus on sources from the Lexis-Nexis web-services kit API). Its interface to Lexis-Nexis is built on a fork of the Yale DH Lab “lexis-nexis-wsk,” a Python package with convenience wrappers for accessing Lexis Nexis WSK API.

It consists of wrappers and utilities for processing queries into collections of news article metadata, with a "bagify" feature to transform article content into word lists. Input may be a comma-delimited lists of query jobs; output is into WE1S schema JSON files (one per article), and may be batch-packaged into zip files. It comes with a command line batch-query interface, searchcmd.py, and a single-query interface, search.py.

This software was developed for WE1S (WhatEvery1Says) for studying collections of news articles.

Requires WSK API credentials issued by LexisNexis in order to submit queries.

Python 3-based. Uses the LexisNexis Web Services Kit (WSK) via the Yale DHLab lexis-nexis-wsk. Other requirements are listed in requirements.txt

local install

Download or git clone https://github.com/whatevery1says/we1s-collector.git
From directory, install required python 3 packages with:

pip -r requirements.txt
Edit config/config.py and add credentials

run a query set

Run a query set with the built-in test file:

 searchcmd.py -q queries.csv

Here are the usage details from searchcmd.py --help:

WSK command line interface for WE1S (WhatEvery1Says)
usage examples:
    python searchcmd.py -o ../wskoutput -q queries.csv
    ./searchcmd.py -o ../wskoutput -q queries.csv

optional arguments:
  -h, --help            show this help message and exit
  -b, --bagify
  -o OUTPATH, --outpath OUTPATH
                        output path, e.g. "../output"
  -q QUERIES, --queries QUERIES
                        specify query file path, e.g. queries.csv
  -z, --zip             zip the json output

Query files are comma-separated-value (.csv) files with a header row and one query defined per row.

source_title,source_id,keyword_string,begin_date,end_date,result_filter
Chicago Daily Herald,163823,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
BBC Monitoring: International Reports,10962,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
The Guardian (London),138620,body(plural(arts)) or hlead(plural(arts)),2017-01-01,2017-01-15,the arts
TVEyes - BBC 1 Wales,402024,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
Washington Post,8075,body(liberal PRE/1 plural(arts)) or hlead(liberal PRE/1 plural(arts)),2017-01-01,2017-01-31,liberal arts

The fields are:

source_title A human readable label for the source / publication, to be used in file names and metadata.
source_id
The LexisNexis WSK source id number, to be used in the query.
keyword_string
The search query string. This string may include keywords and WSK search operators such as hlead(), plural(), PRE/1 et cetera.
begin_date
Beginning of the seach date range, in YYYY-MM-DD format.
end_date
End of the seach date range, in YYYY-MM-DD format.
result_filter Optional post-processing filter string. Results which do not contain this literal string are returned, but filtered into the (no-exact-matches) result group.

The default query filename is queries.csv -- you may edit the existing queries.csv or create new .csv files with the required header columns and process those files using the -q argument.

Docker

This repository comes with a Dockerfile for installing as a Docker container on a generic virtual machine running Debian Linux with Python 3.6. The container image may be built locally, or it may be from from a pre-built image available from Docker Hub.

adding credentials to Docker

Note that in order to submit queries, the container must have valid credentials for the WSK API. These can be added:

before a local build, by editing the config/config.py file before building the image
when launching a container, by passing --env arguments to the docker run command
after entering a running container, by editing the /app/config/config.py file inside the container.
by running the container in a swarm and defining and mounting credentials as "Docker secrets".

For more details on these options, see: config/CONFIG_README.md

run from Docker local build

Download or git clone https://github.com/whatevery1says/we1s-collector.git
From directory, build the image:

docker build we1s-collector .
Run a container from the image.

docker run we1s-collector --name we1s-collector
Enter the shell of the running container

docker exec -it we1s-collector /bin/bash
Run a test query set:

cd /app searchcmd.py -q queries.csv

run from Docker remote image

Run a container from an image hosted on Docker Hub:

docker run whatevery1says/we1s-collector:latest --name we1s-collector
Enter the shell of the running container

docker exec -it we1s-collector /bin/bash
Run a test query set:

cd /app searchcmd.py -q queries.csv

we1s-collector's People

Stargazers

Watchers

Forkers

seangilleran

we1s-collector's Issues

search.py fails on csv date fields with whitespace before or after

`searchcmd.py` add file slug to each json file

Will allow for easier filtering, and is just the right thing to do.

Improve error reporting on `searchcmd.py` log file

Either print out a separate error log file or print errors to log.info to track search errors more easily.

Sort results from `searchcmd.py` into subdirectories by keyword string

This would make it easier for users to browse data files in the data directory and select them for modeling. If WMS is able to handle project creation, this may not matter anymore.

rename repo and rebind to docker hub automated builds

There is no clean way to do this as a remapping via docker hub config -- it can be done via docker cloud, but reports are that this leaves some settings orphaned / stuck.

It requires deleting the current automated build repo on Docker Hub, renaming this, then creating a new docker hub build repo and changing our stack.yml files to point to it.

'searchcmd.py' add publication type to metadata

Inspect LexisNexis xml and add publication type to metadata

Why do searches abort?

We still don't fully understand aborting behavior. Why do some searches with 0 results complete (but return 0 results), while other searches abort? We think it has something to do with wsk's time_delta stuff. We should do some testing with searches known to abort and searches known to return 0 results "successfully."

`searchcmd.py` add no-exact-match to each json file in no exact match folder

This will allow for easier filtering in mongodb (although json_into_mongo.py utility currently does this as a post-processing step)

improve scrub by fixing broken punctuation

this is in scripts/scrub/config.py in projects folders.

Generate merged "all" collection and simple sampling (1%, 5%, 10%...)

Could adapt zipcount.py (which is iterating over the json cache anyway) to assemble a random sample of any defined size -- for example using random.sample() or random.shuffle().

`searchcmd.py` date collection

We still aren't collecting some dates correctly. Possible test set: NYT articles from Dan's NYT1980 project. Need others.

multibody duplicate data?

Articles may have multiple <body> tags. Sometimes those contents might be redundant?

If so, here is a potential fix, to take only the first body tag. If the first is a preview, then perhaps they should be merged -- or only the second should be taken....

Potential patch on search.py:

             try:  # move dictionary keys
                 soup = BeautifulSoup(article.pop('full_text'), 'lxml')
                 body_divs = soup.find_all("div", {"class":"BODY"})
-                txt = ''
-                for b in body_divs:
-                    txt = txt + b.get_text(separator=u' ')
+                txt = body_divs[0].get_text(separator=u' ')
                 txt = string_cleaner(txt)
                 if bagify:
                     txt = ' '.join(sorted(txt.split(' '), key=str.lower))

improve scrubber by using NLTK for tokenization

scrub settings are in scripts/scrub/config.py in project template

'seachcmd.py' move scrubbing outside of notebook workflow

Move scrubbing outside of notebook workflow and into collector.

move scrubbing to collector

scrubbing has to happen at data collection phase, within collector

`searchcmd.py` add regions of coverage field to results

LexisNexis includes "Region of Coverage" in their metadata. Add capture to collection script so we can sort by region.

failing to write wsk.log due to directory permissions kills script

Currently, running searchcmd.py in a directory with no write permissions causes the script to die when the log file cannot be written to disk.

An alternative might be to write the log file out to wherever the destination directory is specified to be (-o).

jovyan@2dfeff0ab998:/app$ cd /
jovyan@2dfeff0ab998:/$ searchcmd.py -q /data/queries/queries_test_nyt.csv -o /data/test/
Traceback (most recent call last):
  File "/app/searchcmd.py", line 35, in <module>
    main(ARGS)
  File "/app/searchcmd.py", line 20, in main
    outpath=args.outpath, zip_output=args.zip)
  File "/app/search.py", line 195, in search_querylist
    zip_output=zip_output
  File "/app/search.py", line 68, in search_query
    logging.info(slug)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1901, in info
    root.info(msg, *args, **kwargs)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1307, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1443, in _log
    self.handle(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1453, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1515, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 864, in handle
    self.emit(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1070, in emit
    self.stream = self._open()
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1060, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/wsk.log'
jovyan@2dfeff0ab998:/$

dateutil gives UnknownTimezoneWarning

In version: 0.10.0, when run in a Docker container as a service (on Docker Hub jeremydouglass/we1s-collector:latest)

jovyan@7aaa98bd34dc:/data$ searchcmd.py -q queries/queries_test_nyt.csv -o test/
07/27/2018 02:28:21 AM thenewyorktimes_bodypluralhumanitiesorhleadpluralhumanities
 * querying for body(plural(humanities)) or hlead(plural(humanities)) 6742 1 10 2017-01-01 2017-01-31
/usr/local/lib/python3.6/site-packages/dateutil/parser/_parser.py:1204: UnknownTimezoneWarning: tzname EST identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)
2017-01-30T00:00:00Z
2017-01-30T00:00:00Z
2017-01-29T00:00:00Z
2017-01-28T00:00:00Z
2017-01-25T00:00:00Z
2017-01-24T00:00:00Z
2017-01-22T00:00:00Z
2017-01-22T00:00:00Z
2017-01-17T00:00:00Z
2017-01-16T00:00:00Z
 * querying for body(plural(humanities)) or hlead(plural(humanities)) 6742 11 20 2017-01-01 2017-01-31
2017-01-15T00:00:00Z
2017-01-13T00:00:00Z

`query_expander` minor issue

Can't figure out how to get it to work if queries input file isn't named queries.csv -- I am probably messing this up somehow but worth a look.

Revert wsk request for FullText back to FullTextWithTerms

Consider reverting af2c89a and including Terms in the XML request. Now that the search script is making better use of tags, these terms should not be harmful, and they could potentially be useful.

In wsk.py:

               <documentId>{1}</documentId>
             </documentIdList>
             <retrievalOptions>
-              <documentView>FullTextWithTerms</documentView>
+              <documentView>FullText</documentView>
               <documentMarkup>Display</documentMarkup>
             </retrievalOptions>
           </GetDocumentsByDocumentId>

-rw-r--r--.  1 jovyan users   65766 Jul 27 02:28 6742_thenewyorktimes_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-01-10.zip

and

jovyan@6353b6e3938c:~/work/data/test$ date                                                                                          
  Fri Jul 27 02:28:19 UTC 2018

Possibly relevant:

Log the version of the script with output -- in zip or each json article file

title text storage options for collector search / searchcmd

When downloading title text into JSON:

separate the title text string from article bagification, so it can optionally be combined later
expand the "dek" / subhead / subtitle elements that can be recognized, then roll them into the body text before bagification

Internal discussion: https://we1s.ryver.com/index.html#posts/1848325