Giter Site home page Giter Site logo

Comments (9)

RandyGuck avatar RandyGuck commented on August 26, 2024

Sorry for the late reply--just got back from vacation. You're right that the CSVLoader only handles comma separators, and it requires a first row with column names. But you can set the application name via the "-app " parameter.

Since you're not explicitly defining any fields, every field will be loaded as text, so every value should be accepted. However, field (column) names must follow identifier rules: first character must be a letter; all other characters must be letters or digits or underscores; names are case-sensitive.

If that doesn't explain what's happening, post a few lines of a typical input file and I'll see if I can debug it.

from doradus.

dzlab avatar dzlab commented on August 26, 2024

Thanks for the reply but what about the misleading logs CSVLoader: ...loaded 10000 records. while there were nothing really uploaded as of the Invalid field name: ... error (which should be logged)?

Also, I'm having really bad performance (several minutes, I think around 30mn) when ingesting a dataset of 0.3M doc each has 93 columns (no links, only scalar fields). I've split the data on many batchs of 10k docs and use finagle-http (I've a scala client) to send the requests (json). Queries are also slow, A distinct aggregate on one attribute takes around 40 (both Cassandra and Doradus server run on mac book).

p.s. Hope you had great time.

from doradus.

RandyGuck avatar RandyGuck commented on August 26, 2024

I just fixed one issue with the CSVLoader: when one object in the batch fails, the overall batch status is set to "warning", and warning-only batches weren't getting reported. If there's an invalid field name or value, you should see CSVLoader report these now.

The misleading progress reporting (...loaded xxx records) is also fixable with more work. The problem is that this log message is generated by the main thread as it parses and queues reports for worker threads. At the time of reporting, it doesn't know how many records actually succeeded. But this could be changed to something like "...queued xxx records, yyy loaded successful." I'll take a look at that next week.

from doradus.

dzlab avatar dzlab commented on August 26, 2024

These changes will be available in next release?
What about the second part of my previous comment on how Doradus is performing?

from doradus.

RandyGuck avatar RandyGuck commented on August 26, 2024

These changes are in the master branch, so you can download and build it if you like. Otherwise, they will be in the next release, however we just created the v4.2 release and probably won't create another one for a while.

As for the performance issues: Spider databases are OK for moderate data volumes (millions of objects, but not billions) and moderate query requirements. It uses fully inverted indexing, so update performance is proportional to the number of fields indexed. Text fields generate the most mutations since they are parsed into terms. What kind of load rate (objs/sec) are you seeing? The queries that Spider is best at are "needle in the haystack" queries such as finding objects where a field contains some term or falls in some range. Aggregate queries (COUNT(*)) will be the slowest queries. If you send me a sample query, can I take a look.

When high performance loading and fast aggregate queries are important, OLAP is much better, sometimes several orders of magnitude. It uses no indexes and columnar compression, so updates generate far fewer mutations, hence load rates are much higher. In queries, OLAP can scan millions of objects per second. OLAP of course requires that data can be organized into shards.

If your data is time-stamped, immutable, and doesn't require links, the new Logging service is even faster. It doesn't require shards, and it loads and queries data even faster than OLAP. I can point out some links for more information if you like.

from doradus.

dzlab avatar dzlab commented on August 26, 2024

My data is timestamped and immutable, I have a set of dimensions and metrics, on which I want to do analytics (OLAP workload). I've a data set of around 0.3M row with 93 column each. I submit batches of 10K json document, each request take around 33.5s (mean). A distinct query GET http://localhost:1123/app_name/table_name/_aggregate?m=DISTINCT(field_name) takes around.

I've explicitly set the storage service option to OLAPService in the app schema that I submit to Doradus. But when checking my app schema on Doradus (i.e. GET http://localhost:1123/_applications) I see the storage service set to SpiderService! I don't know why (may be the server is not started with the OLAPService up, what's the default behaviour?) but this is definitely why the insert/query is so slow.

What about the Logging Service, it's not mentioned in the wiki, how to use it ?

from doradus.

RandyGuck avatar RandyGuck commented on August 26, 2024

It sounds like the OLAP or Logging service would work much better for you. The Logging service is brand new and I'm still working on wiki pages/tutorials, but there is a PDF document for it located here: https://github.com/dell-oss/Doradus/blob/master/docs/Doradus%20Logging%20Database.pdf

from doradus.

dzlab avatar dzlab commented on August 26, 2024

I'm trying to understand how Doradus stores its data into Cassandra, it looks like it creates a single SSTable with few row ids (36) I except around 0.3M as this is the size of my dataset. Also, it's not using any memtable neither it uses bloom filters!! I wonder how queries/aggregations can be fast then.

$ /bin/nodetool --host localhost cfstats
Keyspace: Doradus
    Read Count: 0
    Read Latency: NaN ms.
    Write Count: 0
    Write Latency: NaN ms.
    Pending Flushes: 0
        Table: Applications
        SSTable count: 1
        Space used (live): 8164
        Space used (total): 8164
        Space used by snapshots (total): 0
        Off heap memory used (total): 23
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 1
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 16
        Bloom filter off heap memory used: 8
        Index summary off heap memory used: 15
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 3312
        Compacted partition maximum bytes: 3973
        Compacted partition mean bytes: 3973
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

        Table: OLAP
        SSTable count: 1
        Space used (live): 19850466
        Space used (total): 19850466
        Space used by snapshots (total): 0
        Off heap memory used (total): 153
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 36
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 56
        Bloom filter off heap memory used: 48
        Index summary off heap memory used: 105
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 61
        Compacted partition maximum bytes: 654949
        Compacted partition mean bytes: 571379
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

        Table: Tasks
        SSTable count: 1
        Space used (live): 5128
        Space used (total): 5128
        Space used by snapshots (total): 0
        Off heap memory used (total): 36
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 2
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 16
        Bloom filter off heap memory used: 8
        Index summary off heap memory used: 28
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 36
        Compacted partition maximum bytes: 258
        Compacted partition mean bytes: 150
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

from doradus.

RandyGuck avatar RandyGuck commented on August 26, 2024

OLAP uses columnar storage and various compression techniques to store data very compactly, so it doesn't use much disk. Here are some links to presentations that provide a little more insight on how OLAP works:

If you download the slides, the notes on each slide provide extra info. Hope this helps!

from doradus.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.