Giter Site home page Giter Site logo

living-with-machines / zoonyper Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 2.0 1.6 MB

Code to make it easy to import and process Zooniverse annotations and their metadata in Python/Jupyter Notebooks

License: MIT License

Python 100.00%
zooniverse data data-processing crowdsourcing data-science python

zoonyper's Introduction

About zoonyper

All Contributors

Zoonyper is a Python library, designed to make it easy for users to import and process Zooniverse annotations and their metadata in your own Python code. It is especially designed for use in Jupyter Notebooks.

Purpose

The Zooniverse citizen science platform's Project Builder allows anyone to create crowdsourced tasks using uploaded or imported images and other media. However, its flexibility means that the data created can be difficult to process.

Zoonyper can help process the output files from the Zooniverse citizen science platform, and facilitate data wrangling, compression, and output into JSON and CSV files. The output files can then be more easily used in e.g. Observable visualisations, Excel and other tools.

Background

The library was created as part of the Living with Machines project, a research project developing historical and data science methods to study the effects of the mechanisation on the lives of ordinary people during the long nineteenth century.

As part of that work, we used digitised historical newspapers at scale. We chose crowdsourcing as a method for some of this work so that we could invite the public to actively contribute to our research, observe how training data is created and annotated for machine learning, and to view the source material we were using across the project. We used the Zooniverse project builder as it is designed for citizen science projects where volunteers contribute to scientific research projects by annotating and categorizing images or other data. The annotations created by volunteers are collected as "classifications" in the Zooniverse system.

We queried digitised newspapers for keywords related to our research topics, uploaded the images, automatically transcribed text (OCR) and metadata about the selected articles to Zooniverse, then asked volunteers to help us with classifications or transcriptions (typing in text) of those articles. The final goal for the research overall was to use the annotations to study the content of these historical newspapers and gain insights into the events and trends of the past.

Getting started

Here's how you can use Zoonyper in your own project:

  1. Install the repository: First, you'll need to install the repository. You can do this by cloning the repository or installing it using the instructions below.

  2. Import the Project class: Once you've installed the repository, you can import the Project class into your own Python code. You can do this by adding the following line to the top of your code:

    from zoonyper import Project
  3. Initialize a Project object: To start using the Project class, you'll need to create a Project object. You can do this by calling the constructor and passing in the path to the directory that contains all the files from your Zooniverse project lab*:

    project = Project("<path to the directory with all necessary files>")
  4. Access the project's data and metadata: Once you have a Project object, you can access its annotations by using the .classifications attribute. This attribute is a Pandas DataFrame, where each row contains information about a single classification, including annotations.

  5. Process the data and metadata: Because the data structures in Zoonyper are Pandas DataFrames, you can process the classifications, subjects, and annotations in any way you like, using the tools and techniques that you're familiar with. For example, you might want to calculate statistics about the annotations, or create plots to visualize the data.

Preparing your Zooniverse files

Via Zooniverse's web 'Lab' interface, go to the Data Exports page. Request and download these exports:

  • classification export
  • subject export
  • workflow export
  • talk comments

They should be named "classifications.csv", "subjects.csv", "workflows.csv", and "comments.json" and "tags.json" respectively, placed in a folder. This folder's path is what should be passed to the Project constructor.

Installation

Because this project is in active development, you need to install from the repository for the time being. In order to do so, follow the installation instructions.

Documentation

You can see the public documentation on https://living-with-machines.github.io/zoonyper.

You can contribute to the documentation using sphinx to edit and render the docs directory.

Data model

The Zoonyper dataframes' data model is illustrated in the following diagram.

erDiagram
    workflow ||--|{ annotation : has
    workflow }|--o{ subject_set: has
    subject_set }|--|{ subject: contains
    annotation ||--|| subject: on
    user ||..|{ annotation : makes
    user ||..|{ comment : writes
    tag }|--|{ comment : in
    comment ||--o{ subject : "written about"

Contributors

zoonyper's People

Contributors

dependabot[bot] avatar jmiguelv avatar kallewesterling avatar mialondon avatar

Watchers

 avatar

Forkers

kingsdigitallab

zoonyper's Issues

No module named 'wasabi'

from zoonyper import Project

resulted in:


ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from zoonyper import Project

~/Zoonyper install/zoonyper/zoonyper/init.py in
1 version = "0.1.0"
2
----> 3 from .project import Project

~/Zoonyper install/zoonyper/zoonyper/project.py in
12 import time
13
---> 14 from .utils import Utils, TASK_COLUMN, get_current_dir
15 from .log import log
16

~/Zoonyper install/zoonyper/zoonyper/utils.py in
8 import re
9
---> 10 from .log import log
11
12

~/Zoonyper install/zoonyper/zoonyper/log.py in
1 # TODO: Make a better logger
2
----> 3 from wasabi import Printer
4
5 printer = Printer()

ModuleNotFoundError: No module named 'wasabi'

Abstract code for subject disambiguation into command line code

Here's a quick sketch:

# Setup:
# - put username and password separated with a space in a file called "auth"
# - make sure panoptes_client, tqdm and requests are installed packages

from pathlib import Path
from panoptes_client import Panoptes, Project, SubjectSet
from tqdm.notebook import tqdm

import requests
import hashlib
import json
import time

def get_md5(path: str):
    """
    Computes the MD5 hash of a file.

    .. versionadded:: 0.1.0

    Parameters
    ----------
    path : str
        The path of the file to compute the MD5 hash for.

    Returns
    -------
    str
        The computed MD5 hash in hexadecimal format.

    Notes
    -----
    The function is borrowed from https://bit.ly/3TvUrd1.
    """
    md5_hash = hashlib.md5()

    with open(path, "rb") as f:
        content = f.read()
        md5_hash.update(content)

        digest = md5_hash.hexdigest()

    return digest

# Set up username and password
username, password = Path("./auth").read_text().split(" ")

# Project ID for LwM is 9943
project_id = 9943

# Connect API
Panoptes.connect(username=username, password=password)

# Set up Project
project = Project(project_id)

# Load subject sets + set up names
subject_set_ids = json.loads(Path("subject_sets.json").read_text())
subject_set_names = list(subject_set_ids.keys())

# Load in the done subject sets (so we don't double up)
done_subject_sets = json.loads(Path("done_subject_sets.json").read_text()) if Path("done_subject_sets.json").exists() else []

lst = [x for x in subject_set_names if x not in done_subject_sets] # and x in PROCESS_KEYS

for subject_set_name in lst: # tqdm(lst, position = 0, desc=subject_set_name):
    # print(f"Processing {subject_set_name}")
    subject_set_id = subject_set_ids[subject_set_name]
    subject_set = SubjectSet(subject_set_id)
    
    errors_occured = False
    
    for subject in tqdm(subject_set.subjects, position=1, total = subject_set.set_member_subjects_count, desc=subject_set_name): # leave=False, 
        if not "!zooniverse_file_md5" in subject.metadata.keys():
            # print(f"updating {subject.id}")
            urls = [url for x in subject.locations for url in x.values()]
            
            # Ensure we have only one URL
            if len(urls) > 1:
                raise NotImplementedError("This script has no ability to process multi-URL subjects yet.")
                
            if len(urls) == 0:
                print(f"--> Warning: subject {subject.id} had not URL!")
                continue
            
            # because we will only process subjects with one URL (see above)
            url = urls[0]
            
            filename = url.split('/')[-1]
            filepath = Path(f"downloads/{filename}")

            if not filepath.exists():
                filepath.parent.mkdir(parents=True, exist_ok=True)

                try:
                    r = requests.get(url, timeout=10)
                except:
                    errors_occured = True
                    time.sleep(50)
                    continue

                if r.status_code != 200:
                    raise RuntimeError(f"Failed with status {r.status_code}")

                filepath.write_bytes(r.content)

            md5 = get_md5(filepath)

            subject.metadata["!zooniverse_file_md5"] = md5
            subject.save()
    
    if not errors_occured:
        done_subject_sets.append(subject_set_name)
        Path("done_subject_sets.json").write_text(json.dumps(done_subject_sets))

Text for front page - suggestions

For @mialondon to write: Something on the history/context of the project e.g.
that it works with files exported from Zooniverse;
that our interest has been in annotations (classifications (Zooniverse 'questions') or free text) on articles extracted from historical newspapers with article metadata included in the Zooniverse manifests;
that our workflows have been built around the idea of winnowing out unsuitable or irrelevant articles then more detailed classsifications;
something on the composition of our subject sets (noting that they've changed over time)

For @kallewesterling to note - something on assumptions about how people will run it e.g. write their own python and import zoonyper to use the functions

Notes on 'setting up a project'

I think the need to disambiguate subjects is pretty unusual - in our case it was the result of a sprawling project with many hands stirring the pot (and an accidental double upload, which I guess might happen more often).

Two thoughts:

We could explain that section of the text a little more (especially given the overhead of the process it mentions), and

How would someone proceed if they were happy to skip that step? i.e. they didn't want to run project.disambiguate_subjects but did want things in a dataframe?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.