Giter Site home page Giter Site logo

rxng8 / gettysburg-course-crawling-system Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 3.76 MB

The system is designed to collect subject and courses data from the college website to form the course catalog for every school years in the future. Estimated lines of code: ~10,000

License: Other

HTML 98.48% Python 1.52% Shell 0.01% Batchfile 0.01%
crawler scraper bs4 python

gettysburg-course-crawling-system's Introduction

Course Catalog Web Scraper System

This script grabs all of the relevant content for the course catalog from the gettysburg.edu website and compiles into a single HTML file. It extracts all of the urls from a csv of the with the given format (NEED TO ADD).

When grabbing the content from each page, this script assigns the section a new id created from its url path for the sake of linking to it throughout the document. Additionally, all non-external links re-created to link to the aforementioned id.

Pages are added to the final HTML file in the order they are listed in the csv.

About Gettysburg Course Catalog Project

Snapshot script to crawl the relevant pages of the Gettysburg College website to fetch the Course Catalog content and generate a single HTML file with structured headings.

Main sections:

  • Title and date generated
  • Table of contents
  • 1 - Academic Policies
  • 2 - Admissions Policies
  • 3 - Financial Policies
  • 4 - Degree Requirements
  • 5 - Programs of Study (54 programs, including majors and minors)
  • 6 - Faculty Registry

The output is designed to be used as source for a Word or InDesign document to generate an accessible PDF.

Reference documents:

Getting started

The codebase

  1. The codebase of the package is located in src/catalog_engine/.
  2. The codebase consists of the following files:
    • scraper.py: Scraper class which only crawl all sites from page.csv, and put them into data structures, i.e, list of soup, and mapper from each line to the soup. This class have been finished and does not need more change (for now)

    • extractor.py: Extractor classes which only focus on taking those data structures we have from the Crawler, and organize and extract to build a kind of key, value dictionary, list according to the template model that Adrian gave us. (We can easily generate json file from this data structure, too!).

    • generator.py: Generator class which only focus on generating the output html file. So far I just have the generators that output every subjects (subjects, major-minor, courses, program) but not the curriculum and policies.

    • explorer.py: Code for courses api communication.

Data Structure

  1. Courses JSON Data

(To be implemented)

  1. Policies JSON Data

(To be implemented)

How to run the package

  1. Requirements:

  2. Run this in bash

# First change directory to the src folder
cd src

# Install related python library
pip3 install -r ./requirements.txt

# Run the main file from the package
python3 ./mainv2.py
  • Note: You can change pip3 to pip and python3 to python if possible
  1. Generate the docs if you want to:
doxygen Doxyfile

Note that after executing the main file, 2 output will be produced in the folder output: output_official.json and output_official.html.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.