Giter Site home page Giter Site logo

coursetable / ferry Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 3.0 82.39 MB

A crawler for Yale courses and evaluation data. Integrates with Coursetable.

License: MIT License

Python 97.62% Shell 1.23% Dockerfile 0.73% JavaScript 0.43%
courses graphql hasura postgresql yale

ferry's Introduction

Coursetable

Maintainability GitHub commit activity GitHub contributors HitCount

Production CD Staging CD Ferry Run

Coursetable is made of two big parts:

  1. Website: The site you see when you go to coursetable.com. The code for this – the front-end site as well as the back-end server that handle user actions – is contained within this repository.
  2. Crawler: The scripts behind the scenes that actually get all the data from Yale’s websites. The code for this is in our ferry repository.

Repository Layout

The various functions of the website are compartmentalized as follows:

  • /api: Source code for API server with Docker Compose configuration for backend logic.
  • /frontend: The current face of the site, built with React.

How to develop

Check out our contributing guide.

How to deploy

Deployments are automatically handled via GitHub Actions workflows. If necessary, you can also manually deploy. For all instructions relevant to deploying our code, see docs/deployment.md.

powered-by-vercel

ferry's People

Contributors

avgupta456 avatar bearsyankees avatar course-table avatar deepsource-autofix[bot] avatar deepsourcebot avatar dependabot-preview[bot] avatar dsjong avatar etherite1 avatar hsheth2 avatar hsheth2-bot avatar imgbotapp avatar inchkev avatar jlv34 avatar josh-cena avatar kevinhu avatar lilyzhouzyj avatar maxyuan6717 avatar neilsong avatar quintec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ferry's Issues

Reduce memory footprint

transform.py uses nearly 12GB of RAM to construct the tables in Pandas. This can be reduced by dropping unused columns, reducing deep copies, and specifying more efficient column datatypes.

Staging tables fails to rollback

Traceback (most recent call last):
  File "./ferry/stage.py", line 316, in <module>
    copy_from_stringio(raw_conn, professors, "professors_staged")
  File "/home/app/ferry/ferry/includes/importer.py", line 716, in copy_from_stringio
    cursor.rollback()
AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'rollback'

Parse and import registrar notes

Certain courses have registrar notes associated with them. An example is CHNS 110, which has "Enrollment for this course is managed through Preference Selection". These are available in the regnotes field from the raw JSON responses.

Tiebreaking on last enrollment numbers

There's a couple scenarios where we'll want to modify the last_enrollment field algorithms:

  • When there's multiple sections with different professors (e.g. MATH 120, where Sudesh tends to have larger classes than other profs)
  • When a previous professor comes back to teach a class that they previously taught, but didn't in the most recent instance (e.g. Stan with CPSC 323)

[Follow up on #37, cc @kevinhu]

Store similar courses in database

  • Add one of the course embedding algorithms to the import pipeline
  • Store linked courses: perhaps a junction table?
    # Course-Professor association/junction table.
    course_professors = Table(
    "course_professors",
    Base.metadata,
    Column("course_id", ForeignKey("courses.course_id"), primary_key=True, index=True),
    Column(
    "professor_id",
    ForeignKey("professors.professor_id"),
    primary_key=True,
    index=True,
    ),
    )

Track enrollment last offered

Add computed fields in courses table tracking the evaluation_statistics.enrolled of the last time the course was offered (resolve by course codes) as well as the season and course_id of this last-enrolled date.

Planned fields:

  • courses.last_enrolled
  • courses.last_enrolled_season
  • courses.last_enrolled_course

Using issues in this repo

Due to the nature of this repo (being a "peripheral" service of CourseTable), every feature implemented here needs to eventually have a corresponding change in the website, and most changes here are motivated by features on the website. Therefore, in order for the team to better track work, all requests for new data structures should be directed to https://github.com/coursetable/coursetable instead, with the label https://github.com/coursetable/coursetable/labels/New%20data%20structure. Issues in this repo should be limited to the following purposes:

  • Internal refactors
  • Code bugs
  • [Others; to be added later if they come up]

Ferry should not redump the entire database

Every time Ferry recrawls, it basically recreates the DB. This makes it hard to tell what things have changed unless we inspect the ferry-data diff. We should make Ferry generate a diff and write data incrementally.

Different classes reusing the same course code

A very niche issue, brought to our attention by Professor Bensinger. We've only found this issue to affect one course code (CSTC 300). A fix could perhaps cross-validate classes using both the course code and the extended class name; currently we only group course codes together with the assumption that Yale does not reuse course codes.

CSTC 300 01 - https://coursetable.com/Table/201903/course/CSTC_300_1
2019 Fall - Leadership as Behavior
2014 Spring - Captivity & Law World History

Replace ujson.dumps with ujson.dump

Current method for exporting to JSON uses file.write(ujson.dumps(object)), which stringifies the JSON before writing. Instead, we can use usjon.dump(object, file) to write directly, which will reduce memory and should be a bit faster.

Question codes have divergent texts

The Fall 2020 course evaluations appear to have mismatches in some question texts:

Traceback (most recent call last):
  File "./ferry/transform.py", line 155, in <module>
    ) = import_evaluations(
  File "/home/app/ferry/ferry/includes/transform_import.py", line 735, in import_evaluations
    raise database.InvariantError(
ferry.database.database_utilities.InvariantError: Error: question codes YC401, YC401-YCWR, YC403, YC403-YCWR, YC409, YC409-YCWR have divergent texts

Analyze Hasura performance

  • Search queries currently have a 600-800ms delay
  • Try the 'Analyze' tool in Hasura console's GraphQL explorer

Fetch and import first-year seminar status

  • Our responses don't include a status attribute for if a course is a first-year seminar.
  • However, Yale Course Search allows us to filter for first-year seminars under the "Yale College Attributes" section.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.