Comments (22)
Ok so I think the root cause is inconsistent event_datetime
being used deep in the recursion. The following logs display it:
Logs during event generation: https://pastebin.com/tdKQN3UH
Logs during reprocessing: https://pastebin.com/etp8Y1PU
Also for reference my GCP proj is stg-cdp-seattle-isaac
if you want to reference my data.
Branch:
- https://github.com/isaacna/cdp-backend/blob/withoutHashFix/cdp_backend/database/functions.py#L26-L81
- https://github.com/isaacna/cdp-backend/blob/withoutHashFix/test_fire.py
On lines 45-46, we see Event
getting hashed with event_datetime
2021-08-16 02:42:10.318957-05:00
. This results in the hash that's stored as the Event.id
in firestore 2d79b33f34fc
.
In the first paste, hash generation for Session
starts at line 59.
- It recursively calls the generate function for
Session
->Event
->Body
. - The
Body
is loaded via reference doc and is correct and consistent (4346a8351006) in L67-68. - Event datetime hashed using
event_datetime 2021-08-16 07:42:10.318957+00:00
! This is different than the datetime used on line 45. - In line 71, the output of this is
29f33b8722df
, while the hash used in GCP is2d79b33f34fc
. - This leads to
Session
being hashed with a differentEvent.id
29f33b8722df1
during creation. During regeneration,Session
gets hashed with2d79b33f34fc
The two different event_datetime
's used are different datetimes I have for two different sessions in my test data:
{
"body":{
"name":"Full Council",
"is_active":true,
"start_datetime":null,
"description":null,
"end_datetime":null,
"external_source_id":null
},
"sessions":[
{
"session_datetime":1629099730.318957,
"video_uri":"assets/sample-5s.mp4",
"session_index":0,
"caption_uri":null,
"external_source_id":null
},
{
"session_datetime":1629099930.318957,
"video_uri":"assets/wombo_combo.mp4",
"session_index":1,
"caption_uri":null,
"external_source_id":null
}
],
"event_minutes_items":null,
"agenda_uri":null,
"minutes_uri":null,
"static_thumbnail_uri":null,
"hover_thumbnail_uri":null,
"external_source_id":null
}
As to why the event_datetime
used on Event
is different, I'm not sure about because it should always be defaulting to the datetime from the first session. But I feel like the trickiest part of investigating the hash generation should be answered now?
from cdp-backend.
I think my comment above might make the hashing function consistent (for regeneration at least), but doesn't identify the root cause...
I think a potential problem might be here:
https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/database/functions.py#L50-L53
if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))
# Now attach the generated hash document path
hasher.update(pickle.dumps(field.id, protocol=4))
On line 50 we set the underlying reference model's id. But on line 53 when we update the hasher, we use field.id
when field
was assigned before we recursively generated the id on line 50. I think this would cause a difference when generating the hash for the first time vs when the underlying models already exist.
Maybe this is why the original doesn't match the regenerated id's?
from cdp-backend.
Whatever the fix is, we will need to patch it in and then reprocess all of the above meetings. Which pushes the need for a "delete all event data for event id x
" function to high.
Basically, this single issue becomes two:
- fix the hashing function
- add a "delete_event" function
from cdp-backend.
I already have some old code lying around for deleting an entire event and all of it's attachments and files. Still need to finish it however.
from cdp-backend.
I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours.
That's where I am at also at the moment.
from cdp-backend.
I have a change that gets all the regenerated hashes to match regardless of if they start out as a reference model or an in-memory model:
In [1]: from cdp_backend.database import models as db_models
...: import fireo
...: from google.auth.credentials import AnonymousCredentials
...: from google.cloud.firestore import Client
...:
...: from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id
...:
...: # Connect to the database
...: fireo.connection(client=Client(
...: project="cdp-seattle-staging-dbengvtn",
...: credentials=AnonymousCredentials()
...: ))
...:
...: def print_session_hashes(event_id):
...: event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
...: sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
...: for session in sessions:
...: print("original", session.id)
...: updated = generate_and_attach_doc_hash_as_id(session)
...: print("regenerated", updated.id)
...:
...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f
In [2]: def print_session_hashes(event_id):
...: event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
...: sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
...: for session in sessions:
...: session.event_ref = session.event_ref.get()
...: print("original", session.id)
...: updated = generate_and_attach_doc_hash_as_id(session)
...: print("regenerated", updated.id)
...:
...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f
from cdp-backend.
I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.
In other words, "did the web page decide this event had 2 sessions because it got 2 different session.id
for that event"?
from cdp-backend.
Could it be related to this part of the hasher function? Where if the field is a ReferenceDocLoader
(not loaded in memory), we hash by id, but if it's already loaded then we hash by the field (instead of the id)?
But I have no idea why it'd only be affecting the seattle staging deployment and not the other ones. In firestore, is there anything different data wise between the duplicate sessions besides their id?
from cdp-backend.
An easy but not optimal solution is to turn autoload back on to avoid these hashing problems, although that'd be at the cost of a lot of latency and extra GCP calls. But I don't think that's probably the route we wanna go given the original reasons to turning autoload off.
from cdp-backend.
@JacksonMaxfield I copied your script to test it out for myself and made the following changes to the hash function:
def generate_and_attach_doc_hash_as_id(db_model: Model) -> Model:
# Create hasher and hash primary values
hasher = sha256()
for pk in db_model._PRIMARY_KEYS:
field = getattr(db_model, pk)
# Load doc if it's a reference doc
if isinstance(field, ReferenceDocLoader):
field = field.get()
# Handle reference fields by using their doc path
if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))
# Now attach the generated hash document path
hasher.update(pickle.dumps(field.id, protocol=4))
# Otherwise just simply add the primary key value
else:
# Hash by field value
hasher.update(pickle.dumps(field, protocol=4))
# Set the id to the first twelve characters of hexdigest
db_model.id = hasher.hexdigest()[:12]
return db_model
This generates the following:
print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7
print_session_hashes("9409f4d1ea09")
25d3a51447d0
25d3a51447d0
30701b4a0fbf
25d3a51447d0
the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull)
Does this fix the issue? I was a little confused by the ordering of the print statements described above.
The main change difference here is that previously the following line wasn't being executed when the field was a ReferenceDocLoader
:
setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))
I just force loading the reference doc when we need it, and then we either hash by the id
if the field is a Model
(while recursively calling the function), or the field value if it's not a Model
. Even though we're loading the model, I still think this works around the original autload issue because we're only loading for primary keys rather than for the whole database object
from cdp-backend.
@isaacna I have similar changes here: main...bugfix/model-id-hasher
We arrive at the same conclusion just different routes ahha
However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.
Like the fundamental "why did this happen in the first place" is still a thing.
from cdp-backend.
However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.
Like the fundamental "why did this happen in the first place" is still a thing.
This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.
For the changes, what do you think about something like this :
for pk in db_model._PRIMARY_KEYS:
field = getattr(db_model, pk)
#Reprocessing use case
if isinstance(field, ReferenceDocLoader):
field = field.get()
hasher.update(pickle.dumps(field.id, protocol=4))
# First time generation use case
elif isinstance(field, Model):
field = generate_and_attach_doc_hash_as_id(field)
setattr(db_model, pk, field)
hasher.update(pickle.dumps(field.id, protocol=4))
else:
hasher.update(pickle.dumps(field, protocol=4))
# Set the id to the first twelve characters of hexdigest
db_model.id = hasher.hexdigest()[:12]
return db_model
In the above we use elif
to avoid the unnecessarily doing the recursive work for the reprocessing use case. This does assume however that the existing data is all corrected, since we're not repeating the hash generation. So if we want to favor consistency over speed/cost I think regenerating the hash each time works, but once the database is clean, this should be more efficient.
One thing I thought was weird is when I run the script with these changes, I get:
print_session_hashes("c847010ce7a4")
71d1777f22c7
5626e734b33d
e8b6fd84193f
5626e734b33d
Whereas with your changes in the bugfix branch I get:
print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7
So for this example with my changes, 5626e734b33d
is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4
.
But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063
. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event
-> Body
that would change the hash from the original generated value.
Does this make sense or am I missing/overthinking something?
from cdp-backend.
This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.
Yea... but that just doesn't make sense to me. Basically every event in the Seattle pipeline gets reprocessed I think? I would need to dig into the GitHub Action logs.
So for this example with my changes, 5626e734b33d is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4.
But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event -> Body that would change the hash from the original generated value.
Does this make sense or am I missing/overthinking something?
Yea I saw that behavior too which is why I went my route of forcing the recursive model loading. It results in consistent hashes which is always the end goal. Especially because as I said, the order of prints is:
- the reprocessed session hash from the pipeline
- the rehash of the reprocessed session
- the original session hash from the first time the session went through the pipeline
- the rehash of the original session
that is hard to follow but basically, whatever our change is should try to match the hash to the third one as that is the hash produced when the data has "never-been-seen-before"
from cdp-backend.
The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time
from cdp-backend.
The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time
Agree. It's concerning as well. I am wondering if we are doing something wrong with our hashing in general.
from cdp-backend.
I have spent an hour looking at writing down various notes of "how this processing" happened and I truly cannot figure out how this resulted in a "bad" hash.
I guess I will try to test this in a dev infrastructure tomorrow and add a bunch of print / log statements. This is wild. Whatever I find I will likely make a recording of because the explanation will likely be some level of conspiracy theory.
from cdp-backend.
Yeah I wasn't able to reproduce the duplicate session issue locally with a minimal event either. The more I look into this the more strange the issue gets.
But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to session
. This seems to happen only for theSession
db model.
I added log statements like so and ran your :
# Handle reference fields by using their doc path
if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
print(type(field))
print("PRE GENERATE FIELD ID " + str(field.id))
new_field = generate_and_attach_doc_hash_as_id(field)
setattr(db_model, pk, new_field)
print(type(field))
print("GENERATED FIELD ID " + str(new_field.id))
Running the event gather pipeline (only relevant logs attached):
# Body hash matches the existing model's id
<class 'cdp_backend.database.models.Body'>
PRE GENERATE FIELD ID 4346a8351006
<class 'cdp_backend.database.models.Body'>
<class 'cdp_backend.database.models.Body'>
GENERATED FIELD ID 4346a8351006
# session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID f1356f6d3179
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 541ac1cd31e3
# Session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID cb32d7e83781
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 0b766156eabf
Your print session hash script (2 valid, separate sessions exist for this one event):
print_session_hashes("2d79b33f34fc")
# cb32d7e83781
# 0b766156eabf
# f1356f6d3179
# 541ac1cd31e3
Looking in my database however, the session id's were cb32d7e83781
and f1356f6d3179
. So I think there's something generally wrong with how we generate hashes, at least when it comes to Session
. Other db models matched the "pre generated field" and the "generated field". I wonder if the order in which we upload in store_event_processing_results
could have some odd effects?
from cdp-backend.
But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to
session
. This seems to happen only for theSession
db model.
I quoted the above but basically everything you said makes sense to me but the thing that confuses me is how we could possibly be hashing incorrectly. Session's PKs are event_ref
and video_uri
, video_uri
is easy, event_ref
hashes itself and uses the generated id
as the hash. To generate an id / hash for event_ref
we look to it's PKs which are body_ref
and event_datetime
, event_datetime
is easy, body_ref
does the same thing, and recursively it goes to the PK for name
on the body
. We know that the event hash and the body hash worked perfectly fine, so how did we get a different hash for the session....
I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.
from cdp-backend.
I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.
Yeah I didn't include all the print statements, just the part that shows the hash not matching.
Full logs can be found here: https://pastebin.com/w9bET2YV
I'll need to dig deeper on the hash function generation to see what's actually causing the wrong hash to be produced
from cdp-backend.
This is a great find. Well done.
This is somewhat expected. The difference is the time zone handling and it's known. / Timezone handling will always be the bane of programming.
I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?
from cdp-backend.
I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?
Yep I agree with this. We could also convert it to epoch millis so we can be as granular as milliseconds but not have to consider timezone
from cdp-backend.
I'm working on the fix described above, will put out a PR soon
from cdp-backend.
Related Issues (20)
- Investigate / fix m3u8 processing...
- Google Speech-to-Text SR Model raises a confusing attr error instead of a defined error when Google runs into an issue
- Parse closed caption files for Oakland better HOT 13
- Hackathon Cleanup: Audio/Video Clipping
- Accept timestamps and only process a subset of a video as an event HOT 13
- Allow ability to flag Events as "try to scrape again next time" and do so on the next CRON run
- Filter out bad caption files
- Issue with deploy-infra action on 3.2.4 due to missing ffmpeg HOT 1
- Issue with deploy-infra action on 3.2.5 due to missing webvtt HOT 2
- Allow meeting minutes to be processed like a transcript
- Re-enable m3u8 and vimeo file utils resource copy tests HOT 1
- Duplicate Persons in cdp-seattle instance HOT 7
- Docker Images
- Break up Event Gather Pipeline HOT 1
- Cannot store and transcribe multiple videos that are clipped via the video_start_time and video_end_time event params
- Reduce complexity of event gather pipeline HOT 4
- MP4 Videos Unnecessarily Encoded During Trim HOT 3
- Inefficient usage of requests.get for very large videos causes event gather to fail
- deploy-infra - problem when Firestore database created in deprecated Datastore Mode HOT 3
- 403 error on resource_copy HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdp-backend.