Giter Site home page Giter Site logo

Comments (8)

joncameron avatar joncameron commented on July 17, 2024

This could be a Swarm topic to discuss as a team as well!

from avalon.

elynema avatar elynema commented on July 17, 2024

Currently, Ramp only loads transcript text when the user views the transcript. So if we implement search within via Ramp for all transcripts in a section, we'll have to have Ramp pre-load all the transcripts.

IIIF content search would likely provide an advantage over Javascript search in-browser if we want to search across all transcripts / annotations across all sections. We would not necessarily want to have to pre-load all that content to allow for a client-side search that searched across it all.

IIIF content search also provides reusable functionality in Ramp that could be used by other implementers. However, it may make sense to implement a first pass using a simple Javascript solution and then re-work later when the use case arises.

from avalon.

elynema avatar elynema commented on July 17, 2024

Our transcript annotations point out to an external file. What does the granularity of results look like? If we just return the entire transcript annotation as a result, we don't know anything about how many hits are within the transcript. All of our supplementing annotations are external files. Are the annotations that are returned as search results have to be symmetrical with annotations in the original manifest, or can we return annotations that are snippets within the original transcript?

Paging is optional, even if you have long lists of search results.

Could return search results as IIIF collection, and could utilize search API to search within them, possibly even to add the hit counts.

Key question is granularity of the search results. Do we return the entire transcript annotation, or do we break it out by cue times? Some transcripts are textual and do not have cues. Could TextQuoteSelector be used to point to chars, rather than just text before / after? Looks like this is allowed by TextPositionSelector; is this allowed by IIIF?

Response can include annotations that reference different canvases as targets, and point to the original annotation, allowing referencing to different originating annotations.

IIIF content search explicitly searches annotations, but not metadata.

from avalon.

elynema avatar elynema commented on July 17, 2024

If we are searching all transcripts across all canvases, then it probably makes more sense to do it as IIIF content search rather than loading all those transcripts into memory.

For our first pass, we intended to implement searching across all transcripts for a single canvas, which might be easier to do client-side in memory rather than implemeting IIIF content search.

For Ramp users without a IIIF content search service already in place, a JS based solution would be faster/easier. For users wanting to plug Ramp into existing IIIF infrastructure who already have IIIF content search implemented, they'd prefer that solution.

from avalon.

elynema avatar elynema commented on July 17, 2024

IIIF content search examples:

  1. Example IIIF Content Search in Digital Collections (digitalcollection.iu.edu)
    Search for "Broderick" then facet on Paged Resource, Go to the last result Notebook, January 23, 1955-May 9, 1955 . Note it doesn't appear to have the search term in any metadata but it does automatically set the search in the UV to be "Broderick" and it does find a hit. If you search multiple terms, hit count seems to be a summation of hit on each individual term. Search passes through from Blacklight to UV on the item.
    The search service appears to be version 0 and here's a url to this specific search: https://digitalcollections.iu.edu/catalog/rr171z590/iiif_search?q=Broderick

  2. Another example IIIF search result: https://miiify.rocks/iiif/content/search?q=london and manifest: https://miiify.rocks/manifest/diamond_jubilee_of_the_metro. This is search v.2. The example search looks like it is searching across multiple manifests, based on the target of the annotations. The granularity of text returned in the annotation varies: sometimes a single term and sometimes a paragraph.

  3. Here’s an Archipelago object that has content search enabled in Mirador. See the search button when you toggle the sidebar menu open on the left. https://studio.esmero.io/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc. Here’s a v2 search URL (looks like they’re using v1 on the site but also support v2): https://studio.esmero.io/iiifcontentsearch/v2/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0?q=kitchen+family. It seems to be returning annotations that match either term or both terms. Their content is OCR, and it looks like the annotation responses are chunked into paragraph level and the target is an x,y location on the canvas where the word actually appears.

  4. Here is a BL example: https://bl.iro.bl.uk/uv/uv.html#?manifest=https://bl.iro.bl.uk/concern/reports/400db5b7-ed6d-4bfb-898d-1f0bcd8a9d6e/manifest&config=https://bl.iro.bl.uk/uv/uv-config.json. Very slow.

  5. Princeton example: https://dpul.princeton.edu/sae/catalog/12263a4c-27ab-421c-b35f-b93ec0271a62. Princeton manifest with search v.0 service: https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest?manifest=https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest

  6. Yale example: https://collections.library.yale.edu/catalog/16807673. Looks like Yale is using search v.1 in their manifest: https://collections.library.yale.edu/manifests/16807673?manifest=https://collections.library.yale.edu/manifests/16807673. Here's an example of search results: https://collections.library.yale.edu/catalog/16807673/iiif_search?q=law+library. Looks like they may be using OCR divided up into about 1 line of text for annotation granularity. They are linking to each page with results, but not highlighting the search term on the page, which makes sense are the results are not providing an x,y location to target.

from avalon.

elynema avatar elynema commented on July 17, 2024

IIIF content search pros

  • re-use same index required for search across the repository
  • allow search across multiple sections and multiple transcripts/annotations for each section
  • compliant with IIIF and allows institutions consuming Avalon manifests in other contexts to have access to have search within annotations
  • allows Ramp re-use in repositories that already have v.2 search service
  • highlighting info returned by search service may simplify front-end JS
  • easier to start with spec implementation and then customize then to start with custom implementation and then make it comply with spec
  • If we plan to incorporate IIIF content search at some point, best to adhere to spec from beginning.

IIIF content search cons

  • latency / slower when searches are submitted
  • more complex to implement - requires implementation in both Ramp and Avalon
  • unclear whether the IIIF content search response really meets our needs. Ex: hit counts for each section. If all we want are hit counts per section, retrieving all annotations just to get a hit count is probably more expensive.
  • Hits in structure or metadata don't match the format of the IIIF content search response. Search of metadata is expicitly out of scope. If we want to search those, might need to implement that separately.

Javascript search within pros

  • in-memory storage of info will eliminate latency of searching IIIF content search over the network
  • May be able to build upon prior art from Third Wave
  • would provide transcript search feature for institutions that haven't implemented IIIF content search in their repository but are using Ramp
  • Highlighting needed to return search within results from Solr requires us to store rather than just index the transcript text; unclear what impact that might have on response times.

Javascript search within cons

  • May be challenging to store large files in memory for searching on mobile devices
  • Using a JS library for indexing/searching introduces a dependency that may cause issues across platforms, especially mobile
  • Reliant on an external dependency on JS library
  • If we have to wait for code from Third Wave, that could delay our timelines. If we start from scratch, we are duplicating their work.

Third Wave reported their search service could use local search or call out to a search API. Could we enable this?

from avalon.

cjcolvar avatar cjcolvar commented on July 17, 2024

Examples from discussion:

{
  "@context": "http://iiif.io/api/search/2/context.json",
  "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/search?q=April",
  "type": "AnnotationPage",

  "items": [
    // This item highlights the transcript by returning the whole cue
    {
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      },
      "target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts#t=00:00:00,00:00:16"
    },
    // This item highlights the canvas like a marker in a playlist item
    {
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      },
      "target": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n#t=00:00:00,00:00:16"
    },
    // This item highlights in a txt transcript by returning the whole paragraph
    {
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      },
      "target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts"
    },
    // Further 'April' annotations here ...
  ],
}

from avalon.

elynema avatar elynema commented on July 17, 2024

Here is an example of IIIF content search of vtt annotations for a media file in Archipelago. They have a similar question about targeting the VTT vs. the canvas, and have experimented with both. They are using Mirador for their front-end.

I setup a small Video object demo for you here https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6. It is my very own media so no copy rights issues, two VTTs (can be enabled on the viewer) or downloaded there directly on the Digital Object page (see download tab)
For this object we are using Mirador V4 alpha 2 so you can use the interface to search (Mirador will only hit V1), but the results won't interact with the media at all (my open question).
The IIIF manifest V3 is dynamic like all of ours and can be seen at https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadata/iiifmanifest/Train%20Departure_manifest.jsonld so you can see the source of what is searchable
The direct endpoints for the IIIF Content Search are: (try "train", "dark", etc... the Vtts can be downloaded so that should be ok)
V1 https://studio.esmero.io/iiifcontentsearch/v1/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train
V2 https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train
Note: i disabled a few parts of the "standards/specs" in this content search responses to make it simpler to consume. You are not getting "number of results" etc. Also we are not using the extra "annotations" that could supplement (with a before/after text snipped) but we could/you can ask for it if you need it and i enable that
Note 2: the output of the api is targeting the VTTs themselves. If you want to see a target against the canvas let me know and i turn the switch so you can compare outputs (basically different target, different motivation on the response).

"@context": "http://iiif.io/api/search/2/context.json",
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0",
"type": "AnnotationPage",
"items": [
{
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/1",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train over tracks.]",
"format": "text/plain"
},
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=0,8"
},
{
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/2",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train passing and sounds of train over tracks.]",
"format": "text/plain"
},
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=8,13"
},

....

from avalon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.