Giter Site home page Giter Site logo

NASA Scraper List about data-rescue-pdx HOT 14 OPEN

maxogden avatar maxogden commented on July 19, 2024
NASA Scraper List

from data-rescue-pdx.

Comments (14)

maxogden avatar maxogden commented on July 19, 2024

https://earthdata.nasa.gov/nasa-data-policy

You may need an earthdata login to access some of the data, it's a free registration.

Also here is a list of all the servers FTP and HTTP from data.gov, which includes many NASA ftp servers https://gist.github.com/maxogden/9885244926c1ab576287ff5047dd0e5f

from data-rescue-pdx.

znmeb avatar znmeb commented on July 19, 2024

Working on Goddard Space Flight Center. Mr. Google sent me here:

https://daac.gsfc.nasa.gov/

And they encourage wget!!

https://disc.gsfc.nasa.gov/recipes/?q=recipes/How-to-Download-Data-Files-from-HTTP-Service-with-wget

I can code this up ... do we want to put it up on a server somewhere?

from data-rescue-pdx.

jmicrobe avatar jmicrobe commented on July 19, 2024

https://genelab-data.ndc.nasa.gov/genelab/projects

A very nice database for genetic research done IN SPACE!

from data-rescue-pdx.

sckott avatar sckott commented on July 19, 2024

Sam and I are doing NSIDC

from data-rescue-pdx.

samavar14 avatar samavar14 commented on July 19, 2024

For the Earth Sciences Level 1 and Atmosphere Archive and Distribution System (LAADS) DAACS, they have archived all of their data on both ftp and http sites:
ftp://ladsweb.modaps.eosdis.nasa.gov
https://ladsweb.nascom.nasa.gov/archive

Useful Readme of the data contained and how to access is here:
https://ladsweb.nascom.nasa.gov/archive/README

from data-rescue-pdx.

samavar14 avatar samavar14 commented on July 19, 2024

Actually, it looks like all the DAACS' data is contained in the Common Metadata Repository:
https://wiki.earthdata.nasa.gov/display/CMR/CMR+Client+Partner+User+Guide. Based off this, we would only need one scraper to pull all data from this system?

from data-rescue-pdx.

shawnbot avatar shawnbot commented on July 19, 2024

I've got dibs on crawling https://opendap.larc.nasa.gov/opendap/ 🚀

from data-rescue-pdx.

crhallberg avatar crhallberg commented on July 19, 2024

I took a look at the CMR page and started parsing the metadata provided at https://cmr.sit.earthdata.nasa.gov/search/collections.json.

I put together a script that traces the the files linked there with curl and outputs their final place after redirects: https://gist.github.com/crhallberg/eebc86dd74ec36e9f2f522ac1559cb7b.

That's just the bare-bones version. I also have one that does a lot more (saves collections.json, separates files into data, webpage, and broken, has status output) if needed.

from data-rescue-pdx.

maxogden avatar maxogden commented on July 19, 2024

@crhallberg awesomeness, do you have an idea of how many datasets are available under that collections endpoint? is each collection a big group of datasets? do you have an example of the metadata that your script produces?

from data-rescue-pdx.

crhallberg avatar crhallberg commented on July 19, 2024

I'm glad you asked because I'm still very new to this. There is a LOT more info here than I thought. My initial thought that what I was parsing was an update feed. Turns out I was on page 1 of 19,590 items. I still don't know how many. A part of the documentation I just found says "You can not page past the 1 millionth item." so there is (obviously) a heck of a lot.

Do you have any examples of good metadata that I can aim for as I interate on this?

from data-rescue-pdx.

maxogden avatar maxogden commented on July 19, 2024

@crhallberg hah! that's a lot of data :) if you wanna check out the data.gov metadata, the gold standard in my opinion, check out this guide i wrote last month https://github.com/jsonlines/guide. the main idea is you have a JSON object for each dataset, and that object has an array of resource URLs, one for each data file.

from data-rescue-pdx.

nichoth avatar nichoth commented on July 19, 2024

Is this related to the tweet https://twitter.com/denormalize/status/838550043397234691 ? I was wondering if you found a solution to the parallel ftp problem.

from data-rescue-pdx.

crhallberg avatar crhallberg commented on July 19, 2024

Update: I've identified 48,126 links. Some are invalid, some are ftp folders, I'm weeding through now by checking headers. After I've separated the wheat links from the chaff links, I'll reconcile it with the original metadata.

I will place a link here when I have a centralized place to show and tell progress: https://github.com/crhallberg/nasa-cmr-scraper.

from data-rescue-pdx.

crhallberg avatar crhallberg commented on July 19, 2024

I wasn't sure where else to push this, so I just made a new repository: https://github.com/crhallberg/nasa-cmr-scraper

from data-rescue-pdx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.