Giter Site home page Giter Site logo

Comments (29)

nickva avatar nickva commented on June 26, 2024 2

I just rebuilt all the 3.3.3 packages to use the latest Erlang patch version 24.3.4.17. If you update from the deb repo you should get the latest version from there: 3.3.3-2

from couchdb.

nickva avatar nickva commented on June 26, 2024 1

With include_docs=true tweaking [fabric] all_docs_concurrency = 10 might have an effect.

https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_all_docs.erl#L297

That spawns 10 background workers at the fabric (cluster level) to open 10 documents in parallel, but that puts more work on the cluster. Can try decreasing it or increasing it a bit and see if it changes anything.

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

Your expectation is well-founded, and couchdb does behave that way. So there must be something wrong somewhere. Could it be that the number of in-progress requests keeps climbing? That is, making more and more queries even though previous ones have not yet completed? If so, increased latency on the latter requests is expected due to internal queues.

I suggest using a tool like ApacheBench to confirm or refute this, set it with some modest concurrency and point it at a view with some appropriate query parameters (i.e, don't read the entire view on each request).

Note that a view query operates in two phases. the first phase is to ensure the view is up to date with respect to any database updates (by running any new updates through the map and reduce functions). The second phase is the actual query. It will help us to know if the database in question is being written to (and, if so, how intensely) during these queries. e.g, we would expect a view query to take longer if a large number of updates were made to the database just beforehand.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

Thanks for the reply!
I don't think the issue is with queries not being completed, since

a. we do them in series - i.e. we wait for one response before throwing the new query
b. we even have a sleep(.5) after each request before throwing another one.
c. resources -- namely CPU and memory -- are pretty good (like 10% CPU and 10% memory usage)

from an outside point of view looks like there's some kind of short term memory cache which gets bigger and bigger and each query traverses the whole object which gets bigger and bigger all the time.

for the record: the requests right after the service restart take somethink like few milliseconds, whereas request after some few thousands get even to 120.000 ms (which is the maximum amount of time for us before calling it a day and raising an exception)

we'll try benchmarking with ab

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

there is no cache of query results inside couchdb, for sure.

Is the database growing during this? we'd expect longer query times for larger views, even if the view is up-to-date when queried (it's ultimately B+Tree's, so the more data the taller the tree, and the larger the database the more likely we have to read more nodes from disk)

At this point it would help to see your specific map/reduce functions and an example document.

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

also check /_active_tasks as view compaction tasks may be running. you can still query while those run, but they'll obviously be using more i/o and perhaps this is a factor in performance (less of an issue on SSD which, hopefully, you are already using).

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

no the database is not growing during the read operations (the actual platform is a QA environment where the only operation is the "mega read"). We're speaking of 100.000 reads.
Also: right after the restart the whole thing gets back to normal (200ms) and starts degrade again in a couple of minutes.

_active_tasks is empty ([])
I'm not sure what do you mean by our map/reduce since we're using the Hyperledger API to access the CouchDB

I can show you the go code for the chaincode we does the reads though:

func (c *xxxxx) Token(ctx contractapi.TransactionContextInterface, tokenId string) (string, error) {

	token, err := GetTokenData(ctx, tokenId)
	if err != nil {
		return "", err
	}

	tokenxxxxx := xxxxxx
	iterator, err := ctx.GetStub().GetStateByPartialCompositeKey(PrefixTokenxxxxxx, []string{tokenId})
	if err != nil {
		return "", fmt.Errorf("error creating asset chaincode: %v", err)
	}
	for iterator.HasNext() {
		queryResponse, err := iterator.Next()
		if err != nil {
			return "", err
		}
		// Recupero il tokenId
		_, compositeKeyParts, err := ctx.GetStub().SplitCompositeKey(queryResponse.Key)
		if err != nil {
			return "", err
		}

		/// OTHER CODE REDACTED BUT OF NO CONCERN
	}
	_ = iterator.Close()
	resultByte, err := json.Marshal(tokenWithxxxxxxx)
	return string(resultByte), nil
}

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

Can you express that as an http request?

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

I think that gets translated to two kind of queries:

  1. /xxxxxx_yyyyyy/_all_docs?attachments=true&endkey="\\u0000xxxxx\\u000014302\\uxxxxx�"&include_docs=true&inclusive_end=false&limit=1001&startkey="\\u0000xxxxx\\u000014302\\u0000"
  2. /xxxxxx_yyyyyy/\x00xxxxx\x0014520\x009892299\x00?attachments=true

the first is the one I'm benchmarking now (-c=10 -n=100000) and is sometimes really slow
the second seems to take around 10-20ms and is run over the 100 results of the first query

but still I don't understand why if I run the requests one after another and not in parallel they influence each other.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

here's the ab result

Document Length:        47691 bytes

Concurrency Level:      10
Time taken for tests:   2219.996 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      4792500000 bytes
HTML transferred:       4769100000 bytes
Requests per second:    45.05 [#/sec] (mean)
Time per request:       222.000 [ms] (mean)
Time per request:       22.200 [ms] (mean, across all concurrent requests)
Transfer rate:          2108.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       45   54  41.0     52    1089
Processing:   147  168  28.6    163     854
Waiting:       47   57   7.1     56     361
Total:        193  222  51.2    216    1650

Percentage of the requests served within a certain time (ms)
  50%    216
  66%    220
  75%    223
  80%    227
  90%    238
  95%    250
  98%    291
  99%    349
 100%   1650 (longest request)

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

browsing through old issues I found this one, which sounds a lot like our problem:
#1319 (comment)
at this point: could this be related to how hyperledger manages connections?

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

Hi, yes, it could be. A typical http client would reuse connections (indeed, would manage a pool of them), check that you're doing that first. Otherwise you could possibly use up all possible ports and have to wait for one to become available.

Try your ab command again, but change it to run for longer(either more requests or switch to -t) and compare a run with and without the -k parameter, which enables http keepalive. The results with -k should have much more consistent durations.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

Hi, yes, it could be. A typical http client would reuse connections (indeed, would manage a pool of them), check that you're doing that first. Otherwise you could possibly use up all possible ports and have to wait for one to become available.

Try your ab command again, but change it to run for longer(either more requests or switch to -t) and compare a run with and without the -k parameter, which enables http keepalive. The results with -k should have much more consistent durations.

I tried with -k which yield similar results to the yesterday's test. We don't have much we can do with the connections since we're using CouchDB indirectly, through Hyperledger Fabric.

With include_docs=true tweaking [fabric] all_docs_concurrency = 10 might have an effect.

https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_all_docs.erl#L297

That spawns 10 background workers at the fabric (cluster level) to open 10 documents in parallel, but that puts more work on the cluster. Can try decreasing it or increasing it a bit and see if it changes anything.

we'll definitely try this!!
thanks

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

With include_docs=true tweaking [fabric] all_docs_concurrency = 10 might have an effect.

https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_all_docs.erl#L297

That spawns 10 background workers at the fabric (cluster level) to open 10 documents in parallel, but that puts more work on the cluster. Can try decreasing it or increasing it a bit and see if it changes anything.

We tried using this, but with no luck.
One thing that might be of any help, we see the memory increasing linearly as the requests are made.

This is the memory in the last seven days:

Screenshot 2024-05-06 alle 16 08 14

The points were the memory rapidly decreases are the restarts of the couchdb service, which make the system fast again (for a couple of minutes).
I'm really wondering if there's any kind of memory leak somewhere.

from couchdb.

nickva avatar nickva commented on June 26, 2024

@luca-simonetti thanks for trying the concurrency setting.

Yeah, memory leak is plausible we have debugged a number of those last few years. Some in OTP 25 and 24:

Some memory usage increase is expected perhaps if it's the page cache that's using the memory, thought that wouldn't explain the slowdown...

Another thing to tweak might be the max_dbs_open. That helped in another issue related to memory usage: #4992, another interesting one might be: #4988 (comment). Though, again in those cases performance didn't seem to be much of an issue.

Are there any exceptions or errors showing in the logs?

To get more details about the internals cay try getting the output from /_node/_local/_system and see what the "memory" objects and "message_queues" values increase. #4992 (comment) has an example of calling that and what part of the response might look like. _local can be replaced with an Erlang (distribution) node name as well.

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

which version of erlang are you using?

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

which version of erlang are you using?

{
    "javascript_engine": {
        "version": "78",
        "name": "spidermonkey"
    },
    "erlang": {
        "version": "24.3.4.15",
        "supported_hashes": [
            "blake2s",
            "blake2b",
            "sha3_512",
            "sha3_384",
            "sha3_256",
            "sha3_224",
            "sha512",
            "sha384",
            "sha256",
            "sha224",
            "sha",
            "ripemd160",
            "md5",
            "md4"
        ]
    },
    "collation_driver": {
        "name": "libicu",
        "library_version": "70.1",
        "collator_version": "153.112",
        "collation_algorithm_version": "14"
    }
}

this is the full _config/_local/_versions output

as for the memory leaks provided: we tried to look into that but those are like 50GB memory leak whereas in our case the memory used is around 200MB and it's only 15% of the total available. For this very reason I don't really think it's related to those problems.

Also: we noticed that the CPU used is around 25% which is really low and tweaking the all_docs_concurrency to 100 didn't change really much meaning that the time taken increases over time, CPU and RAM usages are the same as before.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

and this is the current state of _node/_local/_system
it doesn't look too bad I guess...

{
    "uptime": 77078,
    "memory": {
        "other": 26982505,
        "atom": 631001,
        "atom_used": 605053,
        "processes": 170112432,
        "processes_used": 170076040,
        "binary": 474168,
        "code": 13917182,
        "ets": 2267344
    },
    "run_queue": 0,
    "run_queue_dirty_cpu": 0,
    "ets_table_count": 157,
    "context_switches": 851304076,
    "reductions": 3036158123783,
    "garbage_collection_count": 20109323,
    "words_reclaimed": 35778913031,
    "io_input": 144808316,
    "io_output": 1204370753,
    "os_proc_count": 0,
    "stale_proc_count": 0,
    "process_count": 444,
    "process_limit": 262144,
    "message_queues": {
        "couch_file": {
            "50": 0,
            "90": 0,
            "99": 0,
            "count": 38,
            "min": 0,
            "max": 0
        },
        "couch_db_updater": {
            "50": 0,
            "90": 0,
            "99": 0,
            "count": 38,
            "min": 0,
            "max": 0
        },
// a bunch of keys with value:0
    }

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

for an extreme test we tried to set [fabric] all_docs_concurrency = 10000 (ten thousands) and it suddendly worked perfectly for the first 8.000 requests. Then it started immediately slow down to the previous point
Since we need to get to 200.000 requests we need this behaviour to be constant.
With this new information what is the actual problem?

UPDATE: I actually restarted the process to verify that this value actually fixes the problem and it does not.
When it was fast it was basically requesting the "container" of the subtokens and somehow somewhere the values were given like super fast. After restarting couchdb it went back to the previous state where it takes the token and then the 100 subdocuments.
If you could explain what all_docs_concurrency does, since it's not documented.

UPDATE2: Idk if this is the case, but it could just be related to some hyperledger internal caching. I'm really struggling how the whole system works

from couchdb.

nickva avatar nickva commented on June 26, 2024

for an extreme test we tried to set [fabric] all_docs_concurrency = 10000 (ten thousands) and it suddenly worked perfectly for the first 8.000 requests.

After restarting couchdb it went back to the previous state where it takes the token and then the 100 subdocuments.
If you could explain what all_docs_concurrency does, since it's not documented.

All doc_docs concurrency limit how many parallel doc read to do at the cluster level. That speeds up the include_docs=true _all_docs but at the expensive of consuming more resources per request.

Another factor to play with is Q. What is your Q value for the dbs? By default it's 2. You can try experimenting with higher values like 8, 16, 64.

UPDATE2: Idk if this is the case, but it could just be related to some hyperledger internal caching. I'm really struggling how the whole system works

If on your containers your CPU and memory usage are not very high, it does seem like a slow disk problem issue but a disk with a page cache or other limited faster cache in front of it. Try using a faster local disk or another setup as an experiment with the same container and see what numbers you get.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

I don't think the disk is the issue here:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 312512 478476 2741660    0    0     2    19    0    0  3  0 97  0  0

as you can see the system doesn't spend any time waiting for IO ops.

What I see though is a memory cache increasing. After restart that very same cache decreases a lot and starts growing back again.

One more note: the requests that are slow are not the "include_documents=true" ones, but the "attachments=true"
I reiterate just so we're on the same page: we do 10-20 millions request to download ALL documents with a key prefix in the database. For example we might have
piece0000001
piece0000002
...
piece1000000
...
piece9999999

and we make a request for each of those. So if there's something cached it's of no use or even in this case a issue since each subsequent request is accessing a different document from the request before.

Also: we cannot change the Q factor since the database is already created and we don't want to risk resharding and losing the whole thing...

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

a memory leak around attachments is feasible, as they are handled as large binaries in erlang, and there have been some leaks in BEAM around that.

What version of erlang is couchdb using here?

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

as for resharding, our advice to increase q would be to make a new, empty database with the new q and replicate from current db to the new one with admin credentials and no filter. This ensures everything gets across. You could leave a continuous:true replication running while switching clients over from the old url to the new one, and then cancel it, and delete the old database, once everything is switched (i.e, no more reads or writes arriving at the old db).

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

a memory leak around attachments is feasible, as they are handled as large binaries in erlang, and there have been some leaks in BEAM around that.

What version of erlang is couchdb using here?

the Erlang version inside couchdb is:
"version": "24.3.4.15"

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

ok, let's get you onto 24.3.4.17, as we know there are bugs in earlier point releases of that series. Let's rule those out first.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

ok, cool! How do I do that?

from couchdb.

rnewson avatar rnewson commented on June 26, 2024

you'll need to download or build 24.3.4.17 (https://www.erlang.org/patches/otp-24.3.4.17) and build couchdb against it.

How did you install couchdb initially? We should update our binary packages with 24.3.4.17 for the same reason.

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

we installed the current couchdb version (3.3..3) using the official repository version for Ubuntu. I think erlang came bundled with it

from couchdb.

luca-simonetti avatar luca-simonetti commented on June 26, 2024

thank you for the help! We updated couchdb as requested:

{
"javascript_engine": {
"version": "78",
"name": "spidermonkey"
},
"erlang": {
"version": "24.3.4.17",
"supported_hashes": [
"blake2s",
"blake2b",
"sha3_512",
"sha3_384",
"sha3_256",
"sha3_224",
"sha512",
"sha384",
"sha256",
"sha224",
"sha",
"ripemd160",
"md5",
"md4"
]
},
"collation_driver": {
"name": "libicu",
"library_version": "70.1",
"collator_version": "153.112",
"collation_algorithm_version": "14"
}
}

but unfortunately didn't help. The problem is still the same 😢

from couchdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.