Giter Site home page Giter Site logo

eckardm / droid-sqlite-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from exponential-decay/demystify

0.0 1.0 0.0 2.03 MB

Engine for analysis of DROID CSV and Siegfried export files. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions.

Home Page: http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export

Python 69.01% HTML 30.99%

droid-sqlite-analysis's Introduction

droid/siegfried-sqlite-analysis

Engine for analysis of DROID CSV and Seigfried export files. The tool has three purposes, break the export into its components and store them within a set of tables in a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions.

See the following blogs for more information:

There are three components to the tool:

droid2sqlite.py

Places DROID CSV export data into an SQLite database with the same filename as the input.

Single argument: --export

Translates a the results of DROID and Siegfried into a static SQLite Database structure. A drastic change to the original tool, there are now five tables:

  • DBMD - Database Metadata
  • FILEDATA - File manifest and filesystem metadata
  • IDDATA - Identification metadata
  • IDRESULTS - FILEDATA/IDRESULTS junction table
  • NSDATA - Namespace metadata, also secondary key (NS_ID) in IDDATA table

Will also augment DROID or Siegfried export data with additional columns amongst others:

URI_SCHEME: Separates the URI_SCHEME from the DROID URI column. This is to enable the identification of container objects found in the export specifically, and the distinction of files stored in container objects from standalone files.

DIR_NAME: Returns the base directory name from the filepath to enable analysis of directory names, e.g. the number of directories in the collection.

droidsqliteanalysis.py

Combines the functions of droid2sqlite.py by calling droid2sqlite's primary class. Further, queries a DROID sqlite database of the schema generated in droid2sqlite, and outputs the result to stdout.

Two primary arguments:

  • --export

First creates a database like droid2sqlite then outputs a report.

  • --db

Outputs a report from a pre-existing sqlite database. HTML is output by default, and it's a good idea to pipe this to a file, e.g.:

python droidsqliteanalysis.py --db opf-with-text.db > my_html_file.htm

TXT can be output by using a --txt flag:

python droidsqliteanalysis.py --db opf-with-text.db --txt > my_txt_file.txt

N.B. This feature will return, but has been temporarily disabled in the current release to understand what a Rogues gallery needs to look like when using Siegfried.

The following flags provide Rogue or Hero output:

  • --rogues

Outputs a list of problematic files returned by DROID e.g. non-IDs, multiple IDs, extension mismatches, zero-byte objects and duplicate files.

  • --heroes

Outputs a list of files considered to be comparatively 'clean' in the context of a DROID output, files will not be duplicates and will be positively identified using Signature or Container mechanisms by DROID's standards.

More information can be found here: http://openpreservation.org/blog/2015/08/25/hero-or-villain-a-tool-to-create-a-digital-preservation-rogues-gallery/

Rogues Gallery Animation

MsoftFnameAnalysis.py

Class to handle analysis of non-recommended filenames from Microsoft: http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx

Code contains copy of library from Cooper Hewitt to enable writing of plain text descriptions of characters: https://github.com/cooperhewitt/py-cooperhewitt-unicode

Usage Notes

Summary/Aggregate Binary / Text / Filename identification statistics are output with the following priority:

Namespace (e.g. ordered by PRONOM first [configurable])

  1. Binary and Container Identifiers
  2. XML Identifiers
  3. Text Identifiers
  4. Filename Identifiers
  5. Extension Identifiers

We need to monitor how well this works. Namespace specific statistics are also output further down the report.

TODO

  • Internationalizing archivist descriptions
  • Additional typing of database fields
  • Improved container listing/handling e.g. maybe via URIs in SF output...
  • Improved 'directory' listing and handling.
  • Unit tests!

License

Copyright (c) 2014 Ross Spencer

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.

Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.

This notice may not be removed or altered from any source distribution.

droid-sqlite-analysis's People

Contributors

ross-spencer avatar

Watchers

Max Eckard avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.