Giter Site home page Giter Site logo

Data Scraping all years about unam-data HOT 3 OPEN

lopezpedres avatar lopezpedres commented on September 13, 2024
Data Scraping all years

from unam-data.

Comments (3)

mate-h avatar mate-h commented on September 13, 2024

Data sources from 1996 to present, HTML format:

https://web.archive.org/web/19970329111029/http://www.dgae.unam.mx/
https://www.dgae.unam.mx/admision/
https://web.archive.org/web/*/http://www.dgae.unam.mx*

Archival statistical agenda data source from 1959 to present, PDF format:
http://agendas.planeacion.unam.mx/

from unam-data.

mate-h avatar mate-h commented on September 13, 2024

Collected all the available links from the Web Archive here:
https://storage.googleapis.com/mate-h.appspot.com/archive-links.csv

from unam-data.

mate-h avatar mate-h commented on September 13, 2024

Need to gather all of working links. Link example:

https://www.dgae.unam.mx/Licenciatura2021/resultados/1/10400035.html

Schematic link:

https://www.dgae.unam.mx/:term/resultados/:areaId/:facultyId:buildingId.html
https://servicios.dgae.unam.mx/:term/resultados/:areaId/:facultyId:buildingId.html
  1. step one is to scrape all of the identifiers for:
  • Terms
  • Areas
  • Facilties
  • Buildings
  1. Construct all of the possible URLs using the ids and the link schema above. Link schema version depends on the term.
  2. Download the one by one using lynx and discard the documents that were not resolved. Take all of the tags into account (Special link in 1996 and Archive links, Servicio subdomain, etc.)
  3. Write scripts to process that data.

from unam-data.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.