Giter Site home page Giter Site logo

Comments (1)

gianlucaborello avatar gianlucaborello commented on August 23, 2024 1

That is mostly because the read path is not concurrent, as a simple toy tool it was never meant to be used for such big use cases :)

Some workarounds I can give you, from "easy" to "hard":

  1. Increase the value of FETCH_SIZE in the code, it might help making the requests more efficient when reading a lot of data, you should be able to notice an improvement.

  2. Ditch this tool altogether, and adopt a more sound strategy with tools that scale well (e.g. nodetool snapshot + sstableloader), which make a lot of sense when dealing with dozens of GB like your case.

  3. Run multiple instances of the tool at the same time, each one focusing on a different column family or a different subset of filters and then merge the files manually, so you should be able to concurrently get a lot of stuff done at the same time.

  4. Implement proper concurrency inside the tool: in other words, instead of always doing a SELECT * FROM foo, you should split the range of primary keys and start doing a whole lot of sub queries and then execute them concurrently with cassandra.concurrent.execute_concurrent() (which I'm sort of using during the import phase, since it's easier there). Cassandra will happily scale to thousands of concurrent read requests per second, whereas now I'm doing just one.

If I were to do this professionally, I would definitely go for option 2, or if I really needed a plain text dump I'd implement 4. Unfortunately, I don't have time at the moment to embark in this big feature.

Thanks

from cassandradump.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.