Giter Site home page Giter Site logo

dagshub / fds Goto Github PK

View Code? Open in Web Editor NEW
382.0 382.0 22.0 228 KB

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

Home Page: http://fastds.io

License: MIT License

Dockerfile 0.51% Python 99.49%
data-science dvc git

fds's Introduction

DagsHub Client


Tests pip License Python Version DagsHub Docs DagsHub Client Docs

DagsHub Sign Up Discord DagsHub on Twitter

What is DagsHub?

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects. With DagsHub you can:

  1. Version code, data, and models in one place. Use the free provided DagsHub storage or connect it to your cloud storage
  2. Track Experiments using Git, DVC or MLflow, to provide a fully reproducible environment
  3. Visualize pipelines, data, and notebooks in and interactive, diff-able, and dynamic way
  4. Label your data directly on the platform using Label Studio
  5. Share your work with your team members
  6. Stream and upload your data in an intuitive and easy way, while preserving versioning and structure.

DagsHub is built firmly around open, standard formats for your project. In particular:

Therefore, you can work with DagsHub regardless of your chosen programming language or frameworks.

DagsHub Client API & CLI

This client library is meant to help you get started quickly with DagsHub. It is made up of Experiment tracking and Direct Data Access (DDA), a component to let you stream and upload your data.

For more details on the different functions of the client, check out the docs segments:

  1. Installation & Setup
  2. Data Streaming
  3. Data Upload
  4. Experiment Tracking
    1. Autologging
  5. Data Engine

Some functionality is supported only in Python.

To read about some of the awesome use cases for Direct Data Access, check out the relevant doc page.

Installation

pip install dagshub

Direct Data Access (DDA) functionality requires authentication, which you can easily do by running the following command in your terminal:

dagshub login

Quickstart for Data Streaming

The easiest way to start using DagsHub is via the Python Hooks method. To do this:

  1. Your DagsHub project,
  2. Copy the following 2 lines of code into your Python code which accesses your data:
    from dagshub.streaming import install_hooks
    install_hooks()
  3. That’s it! You now have streaming access to all your project files.

🀩 Check out this colab to see an example of this Data Streaming work end to end:

Open In Colab

Next Steps

You can dive into the expanded documentation, to learn more about data streaming, data upload and experiment tracking with DagsHub


Analytics

To improve your experience, we collect analytics on client usage. If you want to disable analytics collection, set the DAGSHUB_DISABLE_ANALYTICS environment variable to any value.

Made with 🐢 by DagsHub.

fds's People

Contributors

deanp70 avatar dependabot[bot] avatar guysmoilov avatar idonov8 avatar indweller avatar martintali avatar mohithg avatar nirbarazida avatar pwoolvett avatar simonlsk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fds's Issues

Get rid of sys.exit calls in commands

It makes composition (like in fds save #34 ) very problematic, and breaks encapsulation.
The commands should just raise errors or return error codes, and the main should be the only one to call sys.exit.
Will also make unit testing much simpler.

Implement `fds -V` or `fds --version`

Currently, in order to find out which fds version I'm using, I have to use pip list which might be problematic if I have fds installed globally and am using a virtual environment. It would be great if I could check with version I had with a simple command. Maybe the command can also show the git and dvc versions fds is using, but that would be nice to have not a must.

Fails to add files to DVC tracking

When running the fds add command for data files it tries to add them to DVC tracking but fails.

In my case I tried to add the raw-data directory that contains the following image files:

$ tree data/raw-data
data/raw-data
β”œβ”€β”€ IM-0001-0001.jpeg
β”œβ”€β”€ IM-0003-0001.jpeg
β”œβ”€β”€ IM-0005-0001.jpeg
β”œβ”€β”€ IM-0006-0001.jpeg
β”œβ”€β”€ IM-0007-0001.jpeg
β”œβ”€β”€ IM-0009-0001.jpeg
β”œβ”€β”€ IM-0010-0001.jpeg
β”œβ”€β”€ IM-0011-0001-0001.jpeg
β”œβ”€β”€ IM-0011-0001-0002.jpeg
β”œβ”€β”€ IM-0011-0001.jpeg
β”œβ”€β”€ IM-0013-0001.jpeg
β”œβ”€β”€ IM-0015-0001.jpeg
β”œβ”€β”€ IM-0016-0001.jpeg
β”œβ”€β”€ IM-0017-0001.jpeg
....

But fds failed to execute the add command:

$ fds add data/raw-data
========== Make your selection, Press "h" for help ==========

DVC add failed to execute

fsd clone for non-DVC repos throws an error

When using fds clone for non-DVC repo it throws the following error:

ERROR: you are not inside of a DVC repository (checked up to mount point '/')

Cloning a non-DVC repo using FDS can be a common use case, e.g., cloning a DAGsHub repo containing many files, but none of them are tracked by DVC nur the repo contains DVC config files.

I suggest that after cloning the Git server, FDS will check if the repo contains DVC files.

if it contains DVC files:

  • echo 'Starting DVC Clone...`
  • FDS will start a wizard to set the user name and password for each remote storage in the local config. (consider checking if they are set in the global config file first?)
  • FDS will pull all the files from the remotes and show a progress bar (might be reasonable to ask if the user wants to pull the files from each remote)

It doesn't contain DVC files:

  • FDS will initialize DVC

    if the Git server URL is DAGsHub's:

    • FDS will set DAGsHub storage as the remote using the Git URL (replacing.git with .dvc).
    • FDS will start a wizard to set the remote user name, password, and name.

    else:

    • FDS will start a wizard asking do you want to set a DVC remote
      if yes:
      • With the wizard, the user will set the remote URL, name, username, and password.

Detect missing authentication for dvc

When running fds commands like fds pull and fds clone that involves dvc pull or clone, we can check if the chosen DVC remote has any authentication configured - and if not, prompt the user to authenticate before running the command.
At least until iterative/dvc#5677 is solved.

Run tests on PRs

Right now, there's no indication whether tests are broken before merging to main

Check latest version on PyPI and suggest to upgrade

On any command of fds, it should check if a new version is available and either display a message clearly suggesting to upgrade, or actually ask whether to upgrade interactively.
Not sure if the interactive upgrade is actually possible though.

fds init

I successfully pip installed fastds, but anytime I did fds init, it always gives this import output after displaying some Traceback information:
ImportError: cannot import name '__version__' from 'fds.__init__' < c:\users\<MyComputerName>\pneumonia-detection\.venv\lib\site-packages\fds\__init__.py>

Improve docker instructions

Running fds through docker is not very clear, need to improve the docs with more clear instructions for an user.

Support -m flag in fds commit

Either instead of the current scheme of fds commit "message" or in addition to it - otherwise, it's confusing for git users.

Error in command description for fds clone

The command description for fds clone is:

clone git repository and pull dvc repository based on dvc.yaml

The command doesn't actually need the dvc.yaml for anything, and it's in fact not using it.
The description should state that it's based on the tracked dvc config file.

fds fails to pull dvc on windows

When running pull command on DagsHub remote. I receive dvc pull failure, so I have to manually pull dvc again.

This issue permanent issue on windows.

fds clone <remote> 

It is not urgent issue, but in annoyance category.

"fds forget" feature proposal

Scenario: You accidentally git add'ed or dvc add'ed a path that you didn't intend to.

It's a commonly googled question: https://stackoverflow.com/questions/1274057/how-to-make-git-forget-about-a-file-that-was-tracked-but-is-now-in-gitignore

What fds forget can add:

  1. Easier naming - no more googling required
  2. Automatically detect whether the file is tracked by git or DVC
  3. Remove the file from DVC cache if it is tracked by DVC (after confirmation from the user)
  4. Remove the relevant .dvc file if it exists, and also make git forget about that file
  5. More?

Extension-based logic for `fds add`

Its quite common for our use case that files which should go into dvc have the same extension (and that extension should not go into git according to our guidelines). For example: HDF5, pickle, etc.

Is that something you'd like to incorporate? eg via

  • a value in fds.domain.constants
  • a configuration file in $REPO_ROOT, or ~/.config/fds
  • env var
  • a cli flag

(or any combination of them, maybe overriding defaults given a preference order)

If it is, I'd like to take a shot at it, starting from here https://github.com/DAGsHub/fds/search?q=MAX_THRESHOLD_SIZE

While we're at it, is a configuration file in your horizon somewhere?

fds not working on Mac with whitespaces in the path

The path to my venv is /Users/deanpleban/Project Talos/Code/fds/env/bin/fds and whenever I run any fds command I get:

/Users/deanpleban/Project Talos/Code/fds/env/bin/fds: bad interpreter: "/Users/deanpleban/Project: no such file or directory

I tried googling around, but so far I can't find an explanation for how to solve this.

Documentation request: using fds with git repo already using dvc

It's not clear from the readme, or the blog posting, how to start using fds with a git repo already using dvc.

I can guess that all I need to do is install the fds python package, and I use pipenv, so that might work, but it isn't clear. I know "make a copy of your repo and try it out!" but how many experiments until I recreate the documentation that is so badly needed?

Automatically trigger fds init

To make users' lives easier, it makes sense to trigger fds init on any invocation of fds which doesn't find an existing git or dvc repo.

It should probably ask the user whether to trigger it or not.

Also, it needs to handle the edge case that DVC isn't installed, so fds should first ask whether to install DVC, and then also ask to run fds init immediately after that, and then run the original requested command.

Add "skip" option to fds add wizard

It should not add to dvc, or git, or to ignore files.
This might be complicated to implement since it requires breaking down the git add which we do in the final step of fds add.
But it is more intuitive to users.
Requested by @nirbarazida

Print correct message when running fds init in an existing repo

Right now, if you run fds init inside an existing Git repo you get the output:

Git repo initialized successfully
DVC repo initialized successfully

This might look like we overwrite the existing .git or .dvc dirs, and instead it would be best to say that nothing was done since it exists already.

Auto update FDS only tries with `pip` and not `pip3`

fds asks if I want to update to the latest version, but when I type y to perform the update, I get thrown a bad interpreter error, due to it using pip which doesn't work on my system for a few reasons.

It would be great if fds could check whether pip or pip3 work for the update, and use whichever one works.

`fds commit` doesn't really work

dvc commit fails when there is something to commit, since DVC in that case shows an interactive UI asking the user to approve any changed file. We're currently adding the -q flag which just returns 1 and fails.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.