Giter Site home page Giter Site logo

whatevery1says / we1s-collector Goto Github PK

View Code? Open in Web Editor NEW
1.0 4.0 1.0 3.43 MB

article collector utils for WE1S (WhatEvery1Says)

Home Page: https://whatevery1says.github.io/we1s-collector/

License: MIT License

Python 98.32% Dockerfile 0.40% Shell 1.29%
collector lexis-nexis lexisnexis api news news-articles docker

we1s-collector's Introduction

we1s-collector

Archived

With the retirement of the Web Services Kit (WSK) API by LexisNexis around 2019-07, core features of the collecter are no longer functional. A new API -- along with new document format -- requires new connection, query code, and preprocessing code.

The collector is a system for searching, paging, and ingesting article data from new sources (with a current focus on sources from the Lexis-Nexis web-services kit API). Its interface to Lexis-Nexis is built on a fork of the Yale DH Lab “lexis-nexis-wsk,” a Python package with convenience wrappers for accessing Lexis Nexis WSK API.

It consists of wrappers and utilities for processing queries into collections of news article metadata, with a "bagify" feature to transform article content into word lists. Input may be a comma-delimited lists of query jobs; output is into WE1S schema JSON files (one per article), and may be batch-packaged into zip files. It comes with a command line batch-query interface, searchcmd.py, and a single-query interface, search.py.

This software was developed for WE1S (WhatEvery1Says) for studying collections of news articles.

Requires WSK API credentials issued by LexisNexis in order to submit queries.

Python 3-based. Uses the LexisNexis Web Services Kit (WSK) via the Yale DHLab lexis-nexis-wsk. Other requirements are listed in requirements.txt

local install

  1. Download or git clone https://github.com/whatevery1says/we1s-collector.git

  2. From directory, install required python 3 packages with:

    pip -r requirements.txt

  3. Edit config/config.py and add credentials

run a query set

Run a query set with the built-in test file:

 searchcmd.py -q queries.csv

Here are the usage details from searchcmd.py --help:

WSK command line interface for WE1S (WhatEvery1Says)
usage examples:
    python searchcmd.py -o ../wskoutput -q queries.csv
    ./searchcmd.py -o ../wskoutput -q queries.csv

optional arguments:
  -h, --help            show this help message and exit
  -b, --bagify
  -o OUTPATH, --outpath OUTPATH
                        output path, e.g. "../output"
  -q QUERIES, --queries QUERIES
                        specify query file path, e.g. queries.csv
  -z, --zip             zip the json output

Query files are comma-separated-value (.csv) files with a header row and one query defined per row.

source_title,source_id,keyword_string,begin_date,end_date,result_filter
Chicago Daily Herald,163823,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
BBC Monitoring: International Reports,10962,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
The Guardian (London),138620,body(plural(arts)) or hlead(plural(arts)),2017-01-01,2017-01-15,the arts
TVEyes - BBC 1 Wales,402024,body(plural(humanities)) or hlead(plural(humanities)),2017-01-01,2017-01-31,humanities
Washington Post,8075,body(liberal PRE/1 plural(arts)) or hlead(liberal PRE/1 plural(arts)),2017-01-01,2017-01-31,liberal arts

The fields are:

  • source_title A human readable label for the source / publication, to be used in file names and metadata.
  • source_id
    The LexisNexis WSK source id number, to be used in the query.
  • keyword_string
    The search query string. This string may include keywords and WSK search operators such as hlead(), plural(), PRE/1 et cetera.
  • begin_date
    Beginning of the seach date range, in YYYY-MM-DD format.
  • end_date
    End of the seach date range, in YYYY-MM-DD format.
  • result_filter Optional post-processing filter string. Results which do not contain this literal string are returned, but filtered into the (no-exact-matches) result group.

The default query filename is queries.csv -- you may edit the existing queries.csv or create new .csv files with the required header columns and process those files using the -q argument.

Docker

This repository comes with a Dockerfile for installing as a Docker container on a generic virtual machine running Debian Linux with Python 3.6. The container image may be built locally, or it may be from from a pre-built image available from Docker Hub.

adding credentials to Docker

Note that in order to submit queries, the container must have valid credentials for the WSK API. These can be added:

  1. before a local build, by editing the config/config.py file before building the image
  2. when launching a container, by passing --env arguments to the docker run command
  3. after entering a running container, by editing the /app/config/config.py file inside the container.
  4. by running the container in a swarm and defining and mounting credentials as "Docker secrets".

For more details on these options, see: config/CONFIG_README.md

run from Docker local build

  1. Download or git clone https://github.com/whatevery1says/we1s-collector.git

  2. From directory, build the image:

    docker build we1s-collector .

  3. Run a container from the image.

    docker run we1s-collector --name we1s-collector

  4. Enter the shell of the running container

    docker exec -it we1s-collector /bin/bash

  5. Run a test query set:

    cd /app searchcmd.py -q queries.csv

run from Docker remote image

  1. Run a container from an image hosted on Docker Hub:

    docker run whatevery1says/we1s-collector:latest --name we1s-collector

  2. Enter the shell of the running container

    docker exec -it we1s-collector /bin/bash

  3. Run a test query set:

    cd /app searchcmd.py -q queries.csv

we1s-collector's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

seangilleran

we1s-collector's Issues

rename repo and rebind to docker hub automated builds

There is no clean way to do this as a remapping via docker hub config -- it can be done via docker cloud, but reports are that this leaves some settings orphaned / stuck.

It requires deleting the current automated build repo on Docker Hub, renaming this, then creating a new docker hub build repo and changing our stack.yml files to point to it.

Why do searches abort?

We still don't fully understand aborting behavior. Why do some searches with 0 results complete (but return 0 results), while other searches abort? We think it has something to do with wsk's time_delta stuff. We should do some testing with searches known to abort and searches known to return 0 results "successfully."

`searchcmd.py` date collection

We still aren't collecting some dates correctly. Possible test set: NYT articles from Dan's NYT1980 project. Need others.

multibody duplicate data?

Articles may have multiple <body> tags. Sometimes those contents might be redundant?

If so, here is a potential fix, to take only the first body tag. If the first is a preview, then perhaps they should be merged -- or only the second should be taken....

Potential patch on search.py:

             try:  # move dictionary keys
                 soup = BeautifulSoup(article.pop('full_text'), 'lxml')
                 body_divs = soup.find_all("div", {"class":"BODY"})
-                txt = ''
-                for b in body_divs:
-                    txt = txt + b.get_text(separator=u' ')
+                txt = body_divs[0].get_text(separator=u' ')
                 txt = string_cleaner(txt)
                 if bagify:
                     txt = ' '.join(sorted(txt.split(' '), key=str.lower))

failing to write wsk.log due to directory permissions kills script

Currently, running searchcmd.py in a directory with no write permissions causes the script to die when the log file cannot be written to disk.

An alternative might be to write the log file out to wherever the destination directory is specified to be (-o).

jovyan@2dfeff0ab998:/app$ cd /
jovyan@2dfeff0ab998:/$ searchcmd.py -q /data/queries/queries_test_nyt.csv -o /data/test/
Traceback (most recent call last):
  File "/app/searchcmd.py", line 35, in <module>
    main(ARGS)
  File "/app/searchcmd.py", line 20, in main
    outpath=args.outpath, zip_output=args.zip)
  File "/app/search.py", line 195, in search_querylist
    zip_output=zip_output
  File "/app/search.py", line 68, in search_query
    logging.info(slug)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1901, in info
    root.info(msg, *args, **kwargs)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1307, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1443, in _log
    self.handle(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1453, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1515, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 864, in handle
    self.emit(record)
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1070, in emit
    self.stream = self._open()
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1060, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: '/wsk.log'
jovyan@2dfeff0ab998:/$

dateutil gives UnknownTimezoneWarning

In version: 0.10.0, when run in a Docker container as a service (on Docker Hub jeremydouglass/we1s-collector:latest)

jovyan@7aaa98bd34dc:/data$ searchcmd.py -q queries/queries_test_nyt.csv -o test/
07/27/2018 02:28:21 AM thenewyorktimes_bodypluralhumanitiesorhleadpluralhumanities
 * querying for body(plural(humanities)) or hlead(plural(humanities)) 6742 1 10 2017-01-01 2017-01-31
/usr/local/lib/python3.6/site-packages/dateutil/parser/_parser.py:1204: UnknownTimezoneWarning: tzname EST identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)
2017-01-30T00:00:00Z
2017-01-30T00:00:00Z
2017-01-29T00:00:00Z
2017-01-28T00:00:00Z
2017-01-25T00:00:00Z
2017-01-24T00:00:00Z
2017-01-22T00:00:00Z
2017-01-22T00:00:00Z
2017-01-17T00:00:00Z
2017-01-16T00:00:00Z
 * querying for body(plural(humanities)) or hlead(plural(humanities)) 6742 11 20 2017-01-01 2017-01-31
2017-01-15T00:00:00Z
2017-01-13T00:00:00Z

`query_expander` minor issue

Can't figure out how to get it to work if queries input file isn't named queries.csv -- I am probably messing this up somehow but worth a look.

Revert wsk request for FullText back to FullTextWithTerms

Consider reverting af2c89a and including Terms in the XML request. Now that the search script is making better use of tags, these terms should not be harmful, and they could potentially be useful.

In wsk.py:

               <documentId>{1}</documentId>
             </documentIdList>
             <retrievalOptions>
-              <documentView>FullTextWithTerms</documentView>
+              <documentView>FullText</documentView>
               <documentMarkup>Display</documentMarkup>
             </retrievalOptions>
           </GetDocumentsByDocumentId>

container displaying GMT timezone rather than host timezone

from inside a container console, time appears to be gmt.

For example, at 7:28p PST a file is written -- in both the container and another mounting the same volume, create time is instead GMT (02:28). From ls

-rw-r--r--.  1 jovyan users   65766 Jul 27 02:28 6742_thenewyorktimes_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-01-10.zip

and

jovyan@6353b6e3938c:~/work/data/test$ date                                                                                          
  Fri Jul 27 02:28:19 UTC 2018

Possibly relevant:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.