Giter Site home page Giter Site logo

Comments (22)

isaacna avatar isaacna commented on August 26, 2024 2

Ok so I think the root cause is inconsistent event_datetime being used deep in the recursion. The following logs display it:

Logs during event generation: https://pastebin.com/tdKQN3UH
Logs during reprocessing: https://pastebin.com/etp8Y1PU
Also for reference my GCP proj is stg-cdp-seattle-isaac if you want to reference my data.

Branch:

On lines 45-46, we see Event getting hashed with event_datetime 2021-08-16 02:42:10.318957-05:00. This results in the hash that's stored as the Event.id in firestore 2d79b33f34fc.

In the first paste, hash generation for Session starts at line 59.

  1. It recursively calls the generate function for Session -> Event -> Body.
  2. The Body is loaded via reference doc and is correct and consistent (4346a8351006) in L67-68.
  3. Event datetime hashed using event_datetime 2021-08-16 07:42:10.318957+00:00! This is different than the datetime used on line 45.
  4. In line 71, the output of this is 29f33b8722df, while the hash used in GCP is 2d79b33f34fc.
  5. This leads to Session being hashed with a different Event.id 29f33b8722df1 during creation. During regeneration, Session gets hashed with 2d79b33f34fc

The two different event_datetime's used are different datetimes I have for two different sessions in my test data:

{
   "body":{
      "name":"Full Council",
      "is_active":true,
      "start_datetime":null,
      "description":null,
      "end_datetime":null,
      "external_source_id":null
   },
   "sessions":[
      {
         "session_datetime":1629099730.318957,
         "video_uri":"assets/sample-5s.mp4",
         "session_index":0,
         "caption_uri":null,
         "external_source_id":null
      },
      {
         "session_datetime":1629099930.318957,
         "video_uri":"assets/wombo_combo.mp4",
         "session_index":1,
         "caption_uri":null,
         "external_source_id":null
      }
   ],
   "event_minutes_items":null,
   "agenda_uri":null,
   "minutes_uri":null,
   "static_thumbnail_uri":null,
   "hover_thumbnail_uri":null,
   "external_source_id":null
}

As to why the event_datetime used on Event is different, I'm not sure about because it should always be defaulting to the datetime from the first session. But I feel like the trickiest part of investigating the hash generation should be answered now?

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024 1

I think my comment above might make the hashing function consistent (for regeneration at least), but doesn't identify the root cause...

I think a potential problem might be here:
https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/database/functions.py#L50-L53

        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field
            setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

            # Now attach the generated hash document path
            hasher.update(pickle.dumps(field.id, protocol=4))

On line 50 we set the underlying reference model's id. But on line 53 when we update the hasher, we use field.id when field was assigned before we recursively generated the id on line 50. I think this would cause a difference when generating the hash for the first time vs when the underlying models already exist.

Maybe this is why the original doesn't match the regenerated id's?

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

Whatever the fix is, we will need to patch it in and then reprocess all of the above meetings. Which pushes the need for a "delete all event data for event id x" function to high.

Basically, this single issue becomes two:

  1. fix the hashing function
  2. add a "delete_event" function

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

I already have some old code lying around for deleting an entire event and all of it's attachments and files. Still need to finish it however.

from cdp-backend.

dphoria avatar dphoria commented on August 26, 2024

I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours.

That's where I am at also at the moment.

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

I have a change that gets all the regenerated hashes to match regardless of if they start out as a reference model or an in-memory model:

In [1]: from cdp_backend.database import models as db_models
   ...: import fireo
   ...: from google.auth.credentials import AnonymousCredentials
   ...: from google.cloud.firestore import Client
   ...: 
   ...: from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id
   ...: 
   ...: # Connect to the database
   ...: fireo.connection(client=Client(
   ...:     project="cdp-seattle-staging-dbengvtn",
   ...:     credentials=AnonymousCredentials()
   ...: ))
   ...: 
   ...: def print_session_hashes(event_id):
   ...:     event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
   ...:     sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
   ...:     for session in sessions:
   ...:         print("original", session.id)
   ...:         updated = generate_and_attach_doc_hash_as_id(session)
   ...:         print("regenerated", updated.id)
   ...: 
   ...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f

In [2]: def print_session_hashes(event_id):
   ...:     event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
   ...:     sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
   ...:     for session in sessions:
   ...:         session.event_ref = session.event_ref.get()
   ...:         print("original", session.id)
   ...:         updated = generate_and_attach_doc_hash_as_id(session)
   ...:         print("regenerated", updated.id)
   ...: 
   ...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f

from cdp-backend.

dphoria avatar dphoria commented on August 26, 2024

I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.

In other words, "did the web page decide this event had 2 sessions because it got 2 different session.id for that event"?

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

Could it be related to this part of the hasher function? Where if the field is a ReferenceDocLoader (not loaded in memory), we hash by id, but if it's already loaded then we hash by the field (instead of the id)?

But I have no idea why it'd only be affecting the seattle staging deployment and not the other ones. In firestore, is there anything different data wise between the duplicate sessions besides their id?

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

An easy but not optimal solution is to turn autoload back on to avoid these hashing problems, although that'd be at the cost of a lot of latency and extra GCP calls. But I don't think that's probably the route we wanna go given the original reasons to turning autoload off.

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

@JacksonMaxfield I copied your script to test it out for myself and made the following changes to the hash function:

def generate_and_attach_doc_hash_as_id(db_model: Model) -> Model:
    # Create hasher and hash primary values
    hasher = sha256()
    for pk in db_model._PRIMARY_KEYS:
        field = getattr(db_model, pk)

        # Load doc if it's a reference doc
        if isinstance(field, ReferenceDocLoader):
            field = field.get()

        # Handle reference fields by using their doc path
        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field
            setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

            # Now attach the generated hash document path
            hasher.update(pickle.dumps(field.id, protocol=4))

        # Otherwise just simply add the primary key value
        else:
            # Hash by field value
            hasher.update(pickle.dumps(field, protocol=4))

    # Set the id to the first twelve characters of hexdigest
    db_model.id = hasher.hexdigest()[:12]

    return db_model

This generates the following:

print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7

print_session_hashes("9409f4d1ea09")
25d3a51447d0
25d3a51447d0
30701b4a0fbf
25d3a51447d0

the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull)

Does this fix the issue? I was a little confused by the ordering of the print statements described above.

The main change difference here is that previously the following line wasn't being executed when the field was a ReferenceDocLoader:

setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

I just force loading the reference doc when we need it, and then we either hash by the id if the field is a Model (while recursively calling the function), or the field value if it's not a Model. Even though we're loading the model, I still think this works around the original autload issue because we're only loading for primary keys rather than for the whole database object

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

@isaacna I have similar changes here: main...bugfix/model-id-hasher

We arrive at the same conclusion just different routes ahha

However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.

Like the fundamental "why did this happen in the first place" is still a thing.

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.

Like the fundamental "why did this happen in the first place" is still a thing.

This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.

For the changes, what do you think about something like this :

     for pk in db_model._PRIMARY_KEYS:
        field = getattr(db_model, pk)
        
        #Reprocessing use case 
        if isinstance(field, ReferenceDocLoader):
            field = field.get()
            hasher.update(pickle.dumps(field.id, protocol=4))
        # First time generation use case 
        elif isinstance(field, Model):
            field = generate_and_attach_doc_hash_as_id(field)
            setattr(db_model, pk, field)
            hasher.update(pickle.dumps(field.id, protocol=4))

        else:
            hasher.update(pickle.dumps(field, protocol=4))

    # Set the id to the first twelve characters of hexdigest
    db_model.id = hasher.hexdigest()[:12]

    return db_model

In the above we use elif to avoid the unnecessarily doing the recursive work for the reprocessing use case. This does assume however that the existing data is all corrected, since we're not repeating the hash generation. So if we want to favor consistency over speed/cost I think regenerating the hash each time works, but once the database is clean, this should be more efficient.

One thing I thought was weird is when I run the script with these changes, I get:

print_session_hashes("c847010ce7a4")
71d1777f22c7
5626e734b33d
e8b6fd84193f
5626e734b33d

Whereas with your changes in the bugfix branch I get:

print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7

So for this example with my changes, 5626e734b33d is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4.

But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event -> Body that would change the hash from the original generated value.

Does this make sense or am I missing/overthinking something?

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.

Yea... but that just doesn't make sense to me. Basically every event in the Seattle pipeline gets reprocessed I think? I would need to dig into the GitHub Action logs.


So for this example with my changes, 5626e734b33d is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4.

But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event -> Body that would change the hash from the original generated value.

Does this make sense or am I missing/overthinking something?

Yea I saw that behavior too which is why I went my route of forcing the recursive model loading. It results in consistent hashes which is always the end goal. Especially because as I said, the order of prints is:

  1. the reprocessed session hash from the pipeline
  2. the rehash of the reprocessed session
  3. the original session hash from the first time the session went through the pipeline
  4. the rehash of the original session

that is hard to follow but basically, whatever our change is should try to match the hash to the third one as that is the hash produced when the data has "never-been-seen-before"

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time

Agree. It's concerning as well. I am wondering if we are doing something wrong with our hashing in general.

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

I have spent an hour looking at writing down various notes of "how this processing" happened and I truly cannot figure out how this resulted in a "bad" hash.

I guess I will try to test this in a dev infrastructure tomorrow and add a bunch of print / log statements. This is wild. Whatever I find I will likely make a recording of because the explanation will likely be some level of conspiracy theory.

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

Yeah I wasn't able to reproduce the duplicate session issue locally with a minimal event either. The more I look into this the more strange the issue gets.

But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to session. This seems to happen only for theSession db model.

I added log statements like so and ran your :

        # Handle reference fields by using their doc path
        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field

            print(type(field))
            print("PRE GENERATE FIELD ID " + str(field.id))
            new_field = generate_and_attach_doc_hash_as_id(field)
            setattr(db_model, pk, new_field)
            print(type(field))
            print("GENERATED FIELD ID " + str(new_field.id))

Running the event gather pipeline (only relevant logs attached):

# Body hash matches the existing model's id 
<class 'cdp_backend.database.models.Body'>
PRE GENERATE FIELD ID 4346a8351006
<class 'cdp_backend.database.models.Body'>
<class 'cdp_backend.database.models.Body'>
GENERATED FIELD ID 4346a8351006

# session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID f1356f6d3179
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 541ac1cd31e3

# Session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID cb32d7e83781
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 0b766156eabf

Your print session hash script (2 valid, separate sessions exist for this one event):

print_session_hashes("2d79b33f34fc")
# cb32d7e83781
# 0b766156eabf

# f1356f6d3179
# 541ac1cd31e3

Looking in my database however, the session id's were cb32d7e83781 and f1356f6d3179. So I think there's something generally wrong with how we generate hashes, at least when it comes to Session. Other db models matched the "pre generated field" and the "generated field". I wonder if the order in which we upload in store_event_processing_results could have some odd effects?

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to session. This seems to happen only for theSession db model.

I quoted the above but basically everything you said makes sense to me but the thing that confuses me is how we could possibly be hashing incorrectly. Session's PKs are event_ref and video_uri, video_uri is easy, event_ref hashes itself and uses the generated id as the hash. To generate an id / hash for event_ref we look to it's PKs which are body_ref and event_datetime, event_datetime is easy, body_ref does the same thing, and recursively it goes to the PK for name on the body. We know that the event hash and the body hash worked perfectly fine, so how did we get a different hash for the session....

I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.

Yeah I didn't include all the print statements, just the part that shows the hash not matching.
Full logs can be found here: https://pastebin.com/w9bET2YV

I'll need to dig deeper on the hash function generation to see what's actually causing the wrong hash to be produced

from cdp-backend.

evamaxfield avatar evamaxfield commented on August 26, 2024

This is a great find. Well done.

This is somewhat expected. The difference is the time zone handling and it's known. / Timezone handling will always be the bane of programming.

I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?

Yep I agree with this. We could also convert it to epoch millis so we can be as granular as milliseconds but not have to consider timezone

from cdp-backend.

isaacna avatar isaacna commented on August 26, 2024

I'm working on the fix described above, will put out a PR soon

from cdp-backend.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.