Giter Site home page Giter Site logo

Comments (8)

bmquinn avatar bmquinn commented on July 17, 2024

@chrisdaaz Looking over the XML DTD (can be accessed at https://secure.etdadmin.com/dtds/etd.dtd) the element DISS_access_option doesn't specify the options (unlike embargo code, e.g. embargo_code (0 | 1 | 2 | 3 | 4) "0"):

<!--
  This element contains the text of the selected access option.
  For example "Open access", "Campus use only", etc.
-->
<!ELEMENT DISS_access_option (#PCDATA)>

Is it possible that there is only a limited number of actual values used for DISS_access_option that we could use to set visibility? If so, a mapping would help, i.e.

Open Access -> open
Campus use only -> authenticated
??? -> restricted

from arch.

davidschober avatar davidschober commented on July 17, 2024

Hey @chrisdaaz We have code written for the change for new dissertations.
We're trying to figure out how many of these it may have effected.

We don't have the source files They lifecycle out. Can you download of them? We think we may be able to do some fancy grepping to figure out what we're dealing with.

from arch.

chrisdaaz avatar chrisdaaz commented on July 17, 2024

@bmquinn @davidschober

I ran a report and found 771 "Campus use only" dissertations that are likely in Arch. Here's the report.

When you look at the report, the first column ID refers to a value we put into "Alternative Identifier". For example, a dissertation with an ID of 15484 would map to http://dissertations.umi.com/northwestern:15594 in the Arch record.

From the report, I can tell that there are two options available for DISS_access_option:

Open Access -> open
Campus use only -> authenticated

There are also blanks which would mean not applicable -- do nothing.

Another thing: all dissertations added before the batch ingest feature was available will not have that "Alternative Identifier", so the ID field in the report won't help us. Can we match by Title?

from arch.

bmquinn avatar bmquinn commented on July 17, 2024

Ok @chrisdaaz I wrote a script to generate a new CSV to determine if we can use titles to find all the dissertations. Here's the script I ran (for future reference if needed):

s3 = Aws::S3::Client.new
resp = s3.get_object(bucket: "stack-p-arch-dropbox", key: "titles_names.csv")
csv = CSV.parse(resp.body.string, headers: true, header_converters: :symbol, liberal_parsing: true)

csv_string = CSV.generate do |new_csv|
  csv.each.with_index(1) do |row, index|
    gw = GenericWork.where(title: Array(row[:title]))&.first
    match = gw&.creator&.present? ? gw.creator.any? { |c| c.include?(row[:student_last_name])} : false
    new_csv << [index, gw&.id, row[:title], row[:student_last_name], match]
  end;nil
end; s3.put_object({acl: "authenticated-read", body: csv_string, bucket: "stack-p-arch-dropbox", key: "title_matches.csv"})

The output csv is at s3://stack-p-arch-dropbox/title_matches.csv if you want to download it and take a look. If there is a title match the second column should contain the Arch ID for the dissertation (blank means no match, but it could be for a number of reasons including funky character encodings. There are 45 total that didn't match the title query). The last boolean column is a check to see whether the last name in the Proquest spreadsheet is part of any of the creators' names in the record found by title (I hope that sentence is understandable).

from arch.

davidschober avatar davidschober commented on July 17, 2024

@bmquinn moving into in progress. Toss points on it at some point.

from arch.

chrisdaaz avatar chrisdaaz commented on July 17, 2024

@bmquinn wondering what your thoughts are about this idea: what if we applied authenticated access restrictions on filesets while keeping the works public?

authenticated works in Arch are not discoverable from Google or NUsearch or Arch's browse/search features. They require the user to login before they can find and access a work and its files. Users must somehow how know a work exists in Arch before they can access it.

Users who can access works via NetID authentication currently have no way of discovering dissertations via Google or NUsearch. I wonder if the following scenario could be done programmatically:

  • Find dissertations that have "Campus use only" values in their ProQuest XML metadata
  • Change the visibility of those Works to Public
  • Change the visibility of those Works's Filesets to Northwestern

This would signal to campus (via Google/ NUsearch indexing) that Arch has dissertations that may be relevant to their research. When they visit the public Work record in Arch and attempt to download the dissertation PDF, they will be prompted to Login with NetID. Does this make sense?

As we discussed, you might not be able to find every dissertation in Arch via the script, so I can check on those remaining dissertations manually.

from arch.

kdid avatar kdid commented on July 17, 2024

Please add your planning poker estimate with ZenHub @bmquinn

from arch.

bmquinn avatar bmquinn commented on July 17, 2024

Hi @chrisdaaz I've been doing some dry-run testing of the script I've written to fix these, but I have a quick question before I hit "go". There are 17 works in the batch of 770 that have FileSets in addition to the PDF with the ProQuest id e.g. XXXX_1234.pdf (a range of types including video, documents, images, etc.). Should I set the visibility on those the same as the "main" one or leave their visibility as-is? Thanks!

from arch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.