Comments (8)
This could be a Swarm topic to discuss as a team as well!
from avalon.
Currently, Ramp only loads transcript text when the user views the transcript. So if we implement search within via Ramp for all transcripts in a section, we'll have to have Ramp pre-load all the transcripts.
IIIF content search would likely provide an advantage over Javascript search in-browser if we want to search across all transcripts / annotations across all sections. We would not necessarily want to have to pre-load all that content to allow for a client-side search that searched across it all.
IIIF content search also provides reusable functionality in Ramp that could be used by other implementers. However, it may make sense to implement a first pass using a simple Javascript solution and then re-work later when the use case arises.
from avalon.
Our transcript annotations point out to an external file. What does the granularity of results look like? If we just return the entire transcript annotation as a result, we don't know anything about how many hits are within the transcript. All of our supplementing annotations are external files. Are the annotations that are returned as search results have to be symmetrical with annotations in the original manifest, or can we return annotations that are snippets within the original transcript?
Paging is optional, even if you have long lists of search results.
Could return search results as IIIF collection, and could utilize search API to search within them, possibly even to add the hit counts.
Key question is granularity of the search results. Do we return the entire transcript annotation, or do we break it out by cue times? Some transcripts are textual and do not have cues. Could TextQuoteSelector be used to point to chars, rather than just text before / after? Looks like this is allowed by TextPositionSelector; is this allowed by IIIF?
Response can include annotations that reference different canvases as targets, and point to the original annotation, allowing referencing to different originating annotations.
IIIF content search explicitly searches annotations, but not metadata.
from avalon.
If we are searching all transcripts across all canvases, then it probably makes more sense to do it as IIIF content search rather than loading all those transcripts into memory.
For our first pass, we intended to implement searching across all transcripts for a single canvas, which might be easier to do client-side in memory rather than implemeting IIIF content search.
For Ramp users without a IIIF content search service already in place, a JS based solution would be faster/easier. For users wanting to plug Ramp into existing IIIF infrastructure who already have IIIF content search implemented, they'd prefer that solution.
from avalon.
IIIF content search examples:
-
Example IIIF Content Search in Digital Collections (digitalcollection.iu.edu)
Search for "Broderick" then facet on Paged Resource, Go to the last result Notebook, January 23, 1955-May 9, 1955 . Note it doesn't appear to have the search term in any metadata but it does automatically set the search in the UV to be "Broderick" and it does find a hit. If you search multiple terms, hit count seems to be a summation of hit on each individual term. Search passes through from Blacklight to UV on the item.
The search service appears to be version 0 and here's a url to this specific search: https://digitalcollections.iu.edu/catalog/rr171z590/iiif_search?q=Broderick -
Another example IIIF search result: https://miiify.rocks/iiif/content/search?q=london and manifest: https://miiify.rocks/manifest/diamond_jubilee_of_the_metro. This is search v.2. The example search looks like it is searching across multiple manifests, based on the target of the annotations. The granularity of text returned in the annotation varies: sometimes a single term and sometimes a paragraph.
-
Here’s an Archipelago object that has content search enabled in Mirador. See the search button when you toggle the sidebar menu open on the left. https://studio.esmero.io/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc. Here’s a v2 search URL (looks like they’re using v1 on the site but also support v2): https://studio.esmero.io/iiifcontentsearch/v2/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0?q=kitchen+family. It seems to be returning annotations that match either term or both terms. Their content is OCR, and it looks like the annotation responses are chunked into paragraph level and the target is an x,y location on the canvas where the word actually appears.
-
Here is a BL example: https://bl.iro.bl.uk/uv/uv.html#?manifest=https://bl.iro.bl.uk/concern/reports/400db5b7-ed6d-4bfb-898d-1f0bcd8a9d6e/manifest&config=https://bl.iro.bl.uk/uv/uv-config.json. Very slow.
-
Princeton example: https://dpul.princeton.edu/sae/catalog/12263a4c-27ab-421c-b35f-b93ec0271a62. Princeton manifest with search v.0 service: https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest?manifest=https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest
-
Yale example: https://collections.library.yale.edu/catalog/16807673. Looks like Yale is using search v.1 in their manifest: https://collections.library.yale.edu/manifests/16807673?manifest=https://collections.library.yale.edu/manifests/16807673. Here's an example of search results: https://collections.library.yale.edu/catalog/16807673/iiif_search?q=law+library. Looks like they may be using OCR divided up into about 1 line of text for annotation granularity. They are linking to each page with results, but not highlighting the search term on the page, which makes sense are the results are not providing an x,y location to target.
from avalon.
IIIF content search pros
- re-use same index required for search across the repository
- allow search across multiple sections and multiple transcripts/annotations for each section
- compliant with IIIF and allows institutions consuming Avalon manifests in other contexts to have access to have search within annotations
- allows Ramp re-use in repositories that already have v.2 search service
- highlighting info returned by search service may simplify front-end JS
- easier to start with spec implementation and then customize then to start with custom implementation and then make it comply with spec
- If we plan to incorporate IIIF content search at some point, best to adhere to spec from beginning.
IIIF content search cons
- latency / slower when searches are submitted
- more complex to implement - requires implementation in both Ramp and Avalon
- unclear whether the IIIF content search response really meets our needs. Ex: hit counts for each section. If all we want are hit counts per section, retrieving all annotations just to get a hit count is probably more expensive.
- Hits in structure or metadata don't match the format of the IIIF content search response. Search of metadata is expicitly out of scope. If we want to search those, might need to implement that separately.
Javascript search within pros
- in-memory storage of info will eliminate latency of searching IIIF content search over the network
- May be able to build upon prior art from Third Wave
- would provide transcript search feature for institutions that haven't implemented IIIF content search in their repository but are using Ramp
- Highlighting needed to return search within results from Solr requires us to store rather than just index the transcript text; unclear what impact that might have on response times.
Javascript search within cons
- May be challenging to store large files in memory for searching on mobile devices
- Using a JS library for indexing/searching introduces a dependency that may cause issues across platforms, especially mobile
- Reliant on an external dependency on JS library
- If we have to wait for code from Third Wave, that could delay our timelines. If we start from scratch, we are duplicating their work.
Third Wave reported their search service could use local search or call out to a search API. Could we enable this?
from avalon.
Examples from discussion:
{
"@context": "http://iiif.io/api/search/2/context.json",
"id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/search?q=April",
"type": "AnnotationPage",
"items": [
// This item highlights the transcript by returning the whole cue
{
"id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
"type": "Annotation",
"motivation": "highlighting",
"body": {
"type": "TextualBody",
"value": "CASSIDY CLOUSE: All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
"format": "text/plain"
},
"target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts#t=00:00:00,00:00:16"
},
// This item highlights the canvas like a marker in a playlist item
{
"id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
"type": "Annotation",
"motivation": "highlighting",
"body": {
"type": "TextualBody",
"value": "CASSIDY CLOUSE: All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
"format": "text/plain"
},
"target": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n#t=00:00:00,00:00:16"
},
// This item highlights in a txt transcript by returning the whole paragraph
{
"id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
"type": "Annotation",
"motivation": "highlighting",
"body": {
"type": "TextualBody",
"value": "CASSIDY CLOUSE: All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
"format": "text/plain"
},
"target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts"
},
// Further 'April' annotations here ...
],
}
from avalon.
Here is an example of IIIF content search of vtt annotations for a media file in Archipelago. They have a similar question about targeting the VTT vs. the canvas, and have experimented with both. They are using Mirador for their front-end.
I setup a small Video object demo for you here https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6. It is my very own media so no copy rights issues, two VTTs (can be enabled on the viewer) or downloaded there directly on the Digital Object page (see download tab)
For this object we are using Mirador V4 alpha 2 so you can use the interface to search (Mirador will only hit V1), but the results won't interact with the media at all (my open question).
The IIIF manifest V3 is dynamic like all of ours and can be seen at https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadata/iiifmanifest/Train%20Departure_manifest.jsonld so you can see the source of what is searchable
The direct endpoints for the IIIF Content Search are: (try "train", "dark", etc... the Vtts can be downloaded so that should be ok)
V1 https://studio.esmero.io/iiifcontentsearch/v1/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train
V2 https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train
Note: i disabled a few parts of the "standards/specs" in this content search responses to make it simpler to consume. You are not getting "number of results" etc. Also we are not using the extra "annotations" that could supplement (with a before/after text snipped) but we could/you can ask for it if you need it and i enable that
Note 2: the output of the api is targeting the VTTs themselves. If you want to see a target against the canvas let me know and i turn the switch so you can compare outputs (basically different target, different motivation on the response).
"@context": "http://iiif.io/api/search/2/context.json",
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0",
"type": "AnnotationPage",
"items": [
{
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/1",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train over tracks.]",
"format": "text/plain"
},
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=0,8"
},
{
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/2",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train passing and sounds of train over tracks.]",
"format": "text/plain"
},
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=8,13"
},
....
from avalon.
Related Issues (20)
- Test derivative download on mco-staging HOT 6
- Query Change for Has Transcript and Has Caption Facets HOT 1
- Allow Users to Specify Transcript Language in UI
- Item Doesn't Play in Safari HOT 1
- Error When Launching Tool from Canvas Module HOT 1
- Playlist LTI Share Links Not Working HOT 2
- Search Relevance and Weights
- Poster Images Stretched In Media Player HOT 2
- Web Annotation Ontology in IIIF Manifest @context HOT 2
- [Regression] MediaObject json changes in Avalon 7.7 HOT 3
- API-Based Upload of Masterfile Binary
- Enforce Identical Strings for avalon_resource_type in Solr and Fedora HOT 2
- Discussion of Transcript component rewrites by Third Wave HOT 1
- Prevent / support deep pagination HOT 1
- Allow More Text for Playlist Description Field HOT 1
- Search Across All Transcripts on Section and Add Hit Counts HOT 1
- Derivative Download Button Doesn't Appear for Valid Manager HOT 1
- Deploy Avalon 7.7.2 to Demo Site
- Add Content Search to CORS Allowlist
- Saves of many-sectioned items still slow after migration from ordered_aggregation to JSON array of ids
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from avalon.