Giter Site home page Giter Site logo

dataset-downloader's People

Contributors

navarrepratt avatar rexwang8 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

parallelo segyges

dataset-downloader's Issues

Issue downloading datasets

Hi! I'm taking a look at CoreWeave's LLM documentation, and I'm trying to get GPT-J fine-tuned with the example dataset.

There seems to be an issue with the dataset-downloader -- my PVC does not get populated with the dataset.

I'm currently trying to understand if there's an issue with the path being passed to it, or if the internals of main.go aren't working correctly. Debugging this is a bit more involved because the dataset-downloader is running in a GitHub-produced distroless container, so it will take some extra effort to inspect the state during run time.

My K8s logs for the relevant pod only show this:

2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/140
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/0
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/20
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/40
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/60
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/80
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/100
2023/01/17 20:04:12 Getting book links from https://www.smashwords.com/books/category/1245/downloads/0/free/any/120

I'd expect the logs to also show something like the following (from this Printf statement):

Downloaded XYZ to /data/finetune-data/dataset/xyz

Anyway, I'm still debugging this, but just wanted to reach out and see if you have suggestions in the meantime. Thanks in advance!

Clean up and reify `dataset_downloader`

We should make it more useful and less hard-coded, as it's hard-coded to Western Romance, for example.

  • Remove hardcoded assumptions and parameterize them.
  • Parallelize the downloader using goroutines
  • Potentially support other sites.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.