Giter Site home page Giter Site logo

Comments (18)

brancz avatar brancz commented on May 19, 2024 1

My apologies for taking so long to get back to this. Re-reading your #36 (comment) @inge4pres , are you saying you are trying to view trace type profiles? If so I think this is our mistake, as we don't actually support viewing trace profiles, as we never ended up finishing #56 / #61 .

I think we should probably disable scraping trace profiles by default until we have this support implemented, to avoid confusing situations like this (if this is the case).


Regarding your last comment, do you have a sequence of steps to run to reproduce this? I can't by just running a conprof instance from scratch, however, if I run conprof, collect some data, shut it down, and then start it back up, I can indeed reproduce some failed to fetch any source profiles errors. I'll investigate why this is happening.

from parca.

brancz avatar brancz commented on May 19, 2024 1

Quick update: It appears I have found part of the problem, there was an unsafe loading of data in the WAL. So the good news is that there is no corruption of the data on disk. Bad news is that for some reason new appends still yield this error, but when restarted those appends can be viewed without issues. I'm continuing to investigate.

I'll open a PR as soon as I fix the remaining issue.

from parca.

brancz avatar brancz commented on May 19, 2024 1

I finally managed to find that last bug, and all new e2e tests are passing with this patch: #113

Thank you so much everyone for bearing with me!

from parca.

brancz avatar brancz commented on May 19, 2024

Yes I've noticed this as well, I haven't fully figured out why this happens, as we do parse profiles before appending them. Maybe we need to do run further validations on the profile rather than just parsing it. Maybe an empty profile is a valid one, I'm not sure.

from parca.

brancz avatar brancz commented on May 19, 2024

I just double checked, and honestly I don't understand how those samples get persisted, as parsing the profile should encounter the exact same issue judging by: https://github.com/google/pprof/blob/27840fff0d09770c422884093a210ac5ce453ea6/profile/profile.go#L167-L177

from parca.

josharian avatar josharian commented on May 19, 2024

We hit this as well. I think the stored data is getting corrupted: I've seen a particular profile go from readable (because it rendered nicely) to "failed to fetch any source profiles". A large proportion of our samples right now are failing...

from parca.

brancz avatar brancz commented on May 19, 2024

Thanks for reporting! I believe since the last storage rebase this started happening for data older than 8h, potentially even 2h.

Iā€™m gonna try to build some data integrity tooling to test this over time.

from parca.

inge4pres avatar inge4pres commented on May 19, 2024

Hey I'm using the latest version too and facing this same issue for data just collected: how to debug, is there an option to add verbose logging?
Can you please also point me to a document/part of code that handles the scraping, I am not sure if I need to configure the targest with the HTTP host-port only or if I need to add the /debug URI too...
Thanks for this beautiful tool šŸ‘šŸ¼

Additional context: I see this log line when issuing the HTTP request to visualize a single trace in pprof-ui

2020/11/18 16:54:30 : parsing profile: unrecognized profile format

from parca.

inge4pres avatar inge4pres commented on May 19, 2024

One more hint on this: when switching to version master-2020-11-04-ce50636 this log line appears

2020/11/18 17:09:35 : decompressing profile: gzip: invalid header

But removing the old tsdb storage and recreating from scratch, that version works on the first data point of all inputs (heap, goroutines, etc...) but not on all subsequent data points šŸ¤”
Might be that if a profile collection times out its format is somehow stored corrupted in the the time series?

from parca.

inge4pres avatar inge4pres commented on May 19, 2024

Hi @brancz the steps you describe are exactly the one needed to reproduce, for some reasons it's like the tsdb files becomes unreadable?

from parca.

brancz avatar brancz commented on May 19, 2024

I have two hunches right now, either the chunks mmap'ed to disk are somehow corrupted, or when WAL replay happens they get corrupted in some way. The later would be better for us as that would mean the storage isn't corrupted, just the way we load them. Once I have a better idea I'll report back here! :)

Thank you so much for reporting!

from parca.

inge4pres avatar inge4pres commented on May 19, 2024

If you can give a pointer to a piece of code and/or a test to debug, I'd be happy to fork and see if I can help šŸ˜„

from parca.

brancz avatar brancz commented on May 19, 2024

I just opened a PR to fix most problems that I found so far in the tsdb (conprof/db#2), but the later issue is still present though. To debug I clone both repos in the same directory and then use a replace directive in the conprof repo to use the local tsdb.

from parca.

brancz avatar brancz commented on May 19, 2024

I just wrote an extensive test that seems to indicate that the database is functioning just fine. I'll continue investigating further up the stack.

from parca.

brancz avatar brancz commented on May 19, 2024

I tried a number of things, and definitely found a couple of small problems, but none of those ended up fixing this symptom. For what it's worth I was finally able to write a test to reproduce this.

from parca.

brancz avatar brancz commented on May 19, 2024

I think I found the last "failed to fetch" errors with: #112.

There is still a remaining problem though, which is after a restart, previous series don't seem to be continued for some reason. At least now we're in a state where all data is viewable and queryable though (with more tests to prevent this from happening again in the future)! šŸŽ‰

(unfortunately we're being rate limited heavily in our CI environment by docker, so it may take a couple of hours until images are available; I'll look into moving to github actions to prevent this)

edit: looks like at least the amd64 image managed to push

from parca.

brancz avatar brancz commented on May 19, 2024

With #112 and #113 merged I think we can close this. Thank you everyone for reporting, and please open new issues if you find anything else or if you think this isn't resolved with the latest versions!

from parca.

inge4pres avatar inge4pres commented on May 19, 2024

Thanks! Will test asap and send some feedback

from parca.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.