Comments (9)
Sorry for the late reply--just got back from vacation. You're right that the CSVLoader only handles comma separators, and it requires a first row with column names. But you can set the application name via the "-app " parameter.
Since you're not explicitly defining any fields, every field will be loaded as text, so every value should be accepted. However, field (column) names must follow identifier rules: first character must be a letter; all other characters must be letters or digits or underscores; names are case-sensitive.
If that doesn't explain what's happening, post a few lines of a typical input file and I'll see if I can debug it.
from doradus.
Thanks for the reply but what about the misleading logs CSVLoader: ...loaded 10000 records.
while there were nothing really uploaded as of the Invalid field name: ...
error (which should be logged)?
Also, I'm having really bad performance (several minutes, I think around 30mn) when ingesting a dataset of 0.3M doc each has 93 columns (no links, only scalar fields). I've split the data on many batchs of 10k docs and use finagle-http (I've a scala client) to send the requests (json). Queries are also slow, A distinct aggregate on one attribute takes around 40 (both Cassandra and Doradus server run on mac book).
p.s. Hope you had great time.
from doradus.
I just fixed one issue with the CSVLoader: when one object in the batch fails, the overall batch status is set to "warning", and warning-only batches weren't getting reported. If there's an invalid field name or value, you should see CSVLoader report these now.
The misleading progress reporting (...loaded xxx records) is also fixable with more work. The problem is that this log message is generated by the main thread as it parses and queues reports for worker threads. At the time of reporting, it doesn't know how many records actually succeeded. But this could be changed to something like "...queued xxx records, yyy loaded successful." I'll take a look at that next week.
from doradus.
These changes will be available in next release?
What about the second part of my previous comment on how Doradus is performing?
from doradus.
These changes are in the master branch, so you can download and build it if you like. Otherwise, they will be in the next release, however we just created the v4.2 release and probably won't create another one for a while.
As for the performance issues: Spider databases are OK for moderate data volumes (millions of objects, but not billions) and moderate query requirements. It uses fully inverted indexing, so update performance is proportional to the number of fields indexed. Text fields generate the most mutations since they are parsed into terms. What kind of load rate (objs/sec) are you seeing? The queries that Spider is best at are "needle in the haystack" queries such as finding objects where a field contains some term or falls in some range. Aggregate queries (COUNT(*)) will be the slowest queries. If you send me a sample query, can I take a look.
When high performance loading and fast aggregate queries are important, OLAP is much better, sometimes several orders of magnitude. It uses no indexes and columnar compression, so updates generate far fewer mutations, hence load rates are much higher. In queries, OLAP can scan millions of objects per second. OLAP of course requires that data can be organized into shards.
If your data is time-stamped, immutable, and doesn't require links, the new Logging service is even faster. It doesn't require shards, and it loads and queries data even faster than OLAP. I can point out some links for more information if you like.
from doradus.
My data is timestamped and immutable, I have a set of dimensions and metrics, on which I want to do analytics (OLAP workload). I've a data set of around 0.3M row with 93 column each. I submit batches of 10K json document, each request take around 33.5s (mean). A distinct query GET http://localhost:1123/app_name/table_name/_aggregate?m=DISTINCT(field_name)
takes around.
I've explicitly set the storage service option to OLAPService
in the app schema that I submit to Doradus. But when checking my app schema on Doradus (i.e. GET http://localhost:1123/_applications
) I see the storage service set to SpiderService
! I don't know why (may be the server is not started with the OLAPService up, what's the default behaviour?) but this is definitely why the insert/query is so slow.
What about the Logging Service, it's not mentioned in the wiki, how to use it ?
from doradus.
It sounds like the OLAP or Logging service would work much better for you. The Logging service is brand new and I'm still working on wiki pages/tutorials, but there is a PDF document for it located here: https://github.com/dell-oss/Doradus/blob/master/docs/Doradus%20Logging%20Database.pdf
from doradus.
I'm trying to understand how Doradus stores its data into Cassandra, it looks like it creates a single SSTable with few row ids (36) I except around 0.3M as this is the size of my dataset. Also, it's not using any memtable neither it uses bloom filters!! I wonder how queries/aggregations can be fast then.
$ /bin/nodetool --host localhost cfstats
Keyspace: Doradus
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: Applications
SSTable count: 1
Space used (live): 8164
Space used (total): 8164
Space used by snapshots (total): 0
Off heap memory used (total): 23
SSTable Compression Ratio: 0.0
Number of keys (estimate): 1
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 16
Bloom filter off heap memory used: 8
Index summary off heap memory used: 15
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 3312
Compacted partition maximum bytes: 3973
Compacted partition mean bytes: 3973
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Table: OLAP
SSTable count: 1
Space used (live): 19850466
Space used (total): 19850466
Space used by snapshots (total): 0
Off heap memory used (total): 153
SSTable Compression Ratio: 0.0
Number of keys (estimate): 36
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 56
Bloom filter off heap memory used: 48
Index summary off heap memory used: 105
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 61
Compacted partition maximum bytes: 654949
Compacted partition mean bytes: 571379
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Table: Tasks
SSTable count: 1
Space used (live): 5128
Space used (total): 5128
Space used by snapshots (total): 0
Off heap memory used (total): 36
SSTable Compression Ratio: 0.0
Number of keys (estimate): 2
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 16
Bloom filter off heap memory used: 8
Index summary off heap memory used: 28
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 36
Compacted partition maximum bytes: 258
Compacted partition mean bytes: 150
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
from doradus.
OLAP uses columnar storage and various compression techniques to store data very compactly, so it doesn't use much disk. Here are some links to presentations that provide a little more insight on how OLAP works:
- Strata presentation: http://www.slideshare.net/randyguck/strata-presentation-doradus
- O'Reilly webinar: http://www.slideshare.net/randyguck/extending-cassandra-with-doradus-olap-for-high-performance-analytics
If you download the slides, the notes on each slide provide extra info. Hope this helps!
from doradus.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from doradus.