Giter Site home page Giter Site logo

Comments (8)

srikris avatar srikris commented on July 20, 2024

@brylie Thanks for your feedback!

We have been actively developing the SFrame for over 4 years now. There are many reasons why we like SFrame.

  • Its out of core so you can work with really large datasets
  • Lazy evaluation lets you work interactively even on really large datasets
  • Its fast! and has some awesome compression techniques that make sure you can do more with less
  • Because it is written in C++, we can have support for multiple-languages in the future
  • Its parallel
  • Built in visualization tools that can handle a lot of data (in a streaming way, so you can see your plots right away)

We love the pandas project and have met with the creators and contributors many times and have all the respect for pandas. For that reason Pandas and SFrame are fully interoperable through to_dataframe and construction.

What can help is the following:

  • We can provide more clarification with the user guide on the differences with Pandas and SFrame
  • Add a user guide chapter on inter-op between Pandas and SFrame

from turicreate.

znation avatar znation commented on July 20, 2024

@brylie To clarify re: SFrame project status, the SFrame codebase is still under active development; however, it's no longer developed or released as its own project. Development of SFrame has been folded into turicreate and is ongoing here in this repo.

from turicreate.

brylie avatar brylie commented on July 20, 2024

Ah, it wasn't clear ay first glance that SFrame is part of Turi Create. With such a tight relationship between SFrame and Turi Create, will there also be pandas support?

We can provide more clarification with the user guide on the differences with Pandas and SFrame

At the risk of drifting off topic, what is the overlap between SFrame and pandas, and what are significant differences?

For note, I am coming from a JavaScript background where there is a lot of churn and bikeshedding. I am concerned, in general, when there is duplication among open source projects where resources are somewhat scarce. One thing I have appreciated about Python data science tools is that ecosystem matters - meaning projects seem to build on common foundations, share consistent APIs, and attain higher levels of usability and abstraction than would otherwise be possible in a more fragmented environment.

from turicreate.

brylie avatar brylie commented on July 20, 2024

Also, how does SFrame compare with the design goals and foundations of Dask? E.g.

Dask is composed of ... "Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.

from turicreate.

srikris avatar srikris commented on July 20, 2024

Ah, it wasn't clear ay first glance that SFrame is part of Turi Create. With such a tight relationship between SFrame and Turi Create, will there also be pandas support?

As I mentioned, SFrame and Pandas data frames are already deeply compatible. The APIs working with SFrames allows us to push a lot of optimizations into the model creation process that would otherwise not be possible by using Pandas.

At the risk of drifting off topic, what is the overlap between SFrame and pandas, and what are significant differences?

I tried outlining the key differences above. The overlap in terms of functionality is pretty high. The SFrame is pretty full featured and we plan to keep building on it and adding more things as we find gaps. The APIs are pretty similar so it should be quite natural to use them.

Also, how does SFrame compare with the design goals and foundations of Dask? E.g.

Dask is another great project that we are aware of. It started around a year after the SFrame and at its core is a task scheduler that can help scale up Pandas using pure python primitives. It has some of the advantages of SFrame (parallelism, scale, interop with pandas etc.) but not all of them (can be extended to multiple languages, lazy evaluation, compression, interop with pandas etc.)

from turicreate.

brylie avatar brylie commented on July 20, 2024

@srikris thanks for your clarifications. It would be interesting to read a more thorough article outlining the above concepts, similarities, differences - as well as perhaps a plan to be good stewards of the data science ecosystem. Thanks for your innovative work :-)

from turicreate.

brylie avatar brylie commented on July 20, 2024

I would just like to clarify that my intention here is not to diminish this project or the talent of its contributors.

For a slightly broader context, companies like Apple, Facebook, Google, and Microsoft exert a lot of influence over the developer community, with starry eyed developers eager to try the latest offerings. In the sometimes competitive platform landscape (iOS, Android, and cloud services such as Azure, GCE and AWS) as well as in an effort to attract talented developers, the large players often release open source platforms/frameworks to exert some leverage over the ecosystem. This can be somewhat evidenced by looking at the JavaScript frontend ecosystem, where many options with significantly overlapping purpose and design compete for adoption and mindshare. Competition comes at a cost of fragmentation and somewhat subverts efforts to standardize technologies in a vendor neutral, cross-cutting manner (e.g. web standards).

I recognize that evolution takes diversity and redundancy. I am just concerned about varying motives (both pragmatic and competitive), and for the overall health and cooperation of the open source community.

from turicreate.

brylie avatar brylie commented on July 20, 2024

How does SFrame relate to Apache Arrow? Might there be some parallel goals between Arrow and SFrame that might serve as a broader foundation for Turi Create? E.g.

The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead. It is also focused on supporting a wide variety of industry-standard programming languages. Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.

from turicreate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.