Giter Site home page Giter Site logo

exponential-decay / skeleton-test-suite-generator Goto Github PK

View Code? Open in Web Editor NEW
7.0 5.0 2.0 8.99 MB

DROID Skeleton Test Suite Generator (skeleton-test-suite-generator): Tool for the automated generation of digital objects based on the digital signatures documented in the PRONOM database maintained by The National Archives, UK. The skeleton-test-suite-generator serves to fill the gap that exists whereby the community requires a corpus of digital objects for the validation and evaluation of format identification tools and techniques. The tool should be used to complement a methodology whereby skeleton files are also generated manually by signature developers. The tool takes a signature specified for a digital object in PRONOM and constructs a digital object that will match its footprint. For more information, see the README.md associated with the project...

License: zlib License

Python 100.00%
code4lib digital-preservation pronom

skeleton-test-suite-generator's Introduction

DROID Skeleton Test Suite Generator (skeleton-test-suite-generator)

Herein lies a tool for the automated generation of digital objects based on the digital signatures documented in the PRONOM database maintained by The National Archives, UK: PRONOM Data is licensed under the Open Government Licence (OGL): http://www.nationalarchives.gov.uk/doc/open-government-licence/

The skeleton-test-suite-generator serves to fill the gap that exists whereby the community requires a corpus of digital objects for the validation and evaluation of format identification tools and techniques. The tool should be used to complement a methodology whereby skeleton files are also generated manually by signature developers.

The research paper this work led to can be found here: http://www.ijdc.net/index.php/ijdc/article/view/8.1.120

Technical

The tool takes a signature specified for a digital object in PRONOM and constructs a digital object that will match its footprint. For example, given the signature:

CAFED00D{4}CAFEBABE(0D|0D0A)

The hex sequences comprising digital objects that will match this signature in DROID will look like the following:

CA FE D0 0D 00 00 00 00 CA FE BA BE 0D

Or:

CA FE D0 0D 00 00 00 00 CA FE BA BE 0D 0A

The scripts take an export of the PRONOM database in XML, extract the internal signature information belonging to each format record and generate the digital objects - creating the 'skeleton test suite'.

The objects can be used for:

  • Understanding where signatures in the PRONOM database will conflict, therefore generating multiple identifications for some files.

  • Creating signatures purely based on format specifications where getting sample files or making them available to those able to create signatures is extremely difficult.

  • Incorporation into the DROID unit test-suite to ensure modifications to identification engine do not impact identification capability.

  • Test the stability of signature files over time.

Other benefits include a small footprint - zipped the suite is just over 150kb in size. Unzipped the suite is approx 390kb.

Does not suffer issues relating to IPR and copyright. The suite and generator tool, licensed under CC BY-SA (see below).

The tool so far is a prototype and it doesn't handle every sequence in PRONOM as of yet. Signatures with multiple BOF sequences, for example, will not generate correctly. While this can be corrected by the team working on PRONOM, these are legitimate sequences that should be handled by the tool.

HOWTO

python skeletongenerator.py

Easy as. The scripts require the existence of the 'pronom-export' folder generated by the scripts in the pronom-xml-export repository: https://github.com/exponential-decay/pronom-xml-export

The input and output locations can be configured by modifying the accompanying cfg file skeletonsuite.cfg.

Files are generated by default by using NULL bytes to 'fill' the file as dictated by a signature. This can be configured in the cfg file using the character value for the requested fill values or <0 or >255 for random bytes.

Version information can be displayed by running:

python skeletongenerator.py --version

Testing reports

I completed two reports on the Skeleton Test Suite back in 2012/2013. They document testing of the files on DROID and explore reasons why some files do or do not work. The reports and links to the test-suites used for testing can be found on the repo wiki, here: https://github.com/exponential-decay/skeleton-test-suite-generator/wiki

TODO

  • Handle multiples of sequence types, e.g. multiple non-colliding BOF sequences.

  • Understand the requirements for metadata to be associated with files, e.g. should the internal structure of files be self-describing?

  • A repository needs to be created on GitHub to host the first non-prototypical output of this generator and the test-suite henceforth.

  • Understand what do we need to do with multiple combinations of byte sequences - currently we always turn-left.

  • Unit tests for signature2bytegenerator.py and filewriter.py as a priority.

For the community TODO

  • Incorporate suite into unit tests for DROID and FIDO

  • Together understand if we can adapt this approach for the UNIX File utility

  • Talk about this tool and potential approach and help to understand how to refine it!

  • Sit tight as we build an infrastructure to host the suite itself online.

License

Copyright (c) 2012 Ross Spencer

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.

  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.

  3. This notice may not be removed or altered from any source distribution.

skeleton-test-suite-generator's People

Contributors

anjackson avatar ross-spencer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

skeleton-test-suite-generator's Issues

Create inverse/negative test output

This thread is interesting: digital-preservation/droid#805

If we reorganize the output we can create files, one per signature? That //shouldn't// match the DROID signature.

This does change the spirit of the suite a little, and right now I think the general use-case is "do all skeleton files match". Folks would be cherry-picking negative matches and would need enough context to say - this file //shouldn't// match signature X and sequence Y.

fmt/1157

The new Folio Infobase File has overlapping beginning-of-file sequences:

image

Only the "Folio" bit is being generated in the test signature.

Variable sequence output does not honour PRONOM documented offsets, e.g. fmt/161 (SIARD)

Hi Ross
Thanks for looking into fmt/189 for me and following up with TNA.
Hope you don't mind, but I've got another curly one...
The SIARD signature (http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=876&strPageToDisplay=signatures) contains a variable sequence beginning "786d". Your signature has that sequence but at offset 14. Unusually for a variable sequence (and perhaps illegally, depending how you read "Technical Paper 1") it has been given an offset of 1024. I'd interpret this to mean a minimum offset of 1024 from the BOF, making your skeleton file invalid.

Thanks in advance!
Richard

Anti-virus false positives - MPEG Elementary Streams

Rather oddly, fmt/640 and fmt/649 skeleton files are both getting picked up as 'trojans' by McAfee as https://nvd.nist.gov/vuln/detail/CVE-2011-4259
these are MPEG-2 Elementary Stream and MPEG-1 Elementary Stream respectively.
Signatures are
000001B3{8-256}000001B5{6-256}000001B8 and

000001B3{8}000001B8

Not sure what to do about it, but it was causing issues with local DROID builds so we're currently having to exclude them from our tests. I've yet to tinker with skeleton files to find a byte pattern McAfee will ignore but will update if I get the chance.

cc @sparkhi @jcharlet

Out of order EOF sequences

Hi Ross
This new signature from the latest PRONOM batch doesn't seem to generate a proper test:
http://apps.nationalarchives.gov.uk/PRONOM/fmt/648

The issue is with this bit:

Position type Absolute from EOF
Offset 8
Value 30333069

which isn't appearing in an absolute position from the EOF.

This seems a bit of a dodgy one anyway as the TNA's example "BOF: 030i (OFFSET 4) EOF: Svar (Offset 62) EOF: 030i (OFFSET 8)" doesn't conform to the signature because that "Svar" sequence should be between offset 20-46 from the EOF, not offset 62.

thanks!
Richard

fmt/894 (v85)

Hi Ross
the second byte sequence in this signature is a "#" (hex 23). It should appear at an absolute offset from the BOF at 128. In yours it is at offset 170, appended to the end of the first byte sequence. This seems a bit like that EOF issue I reported: where you have two byte sequences that can overlay each other, rather than being adjacent.
Otherwise I'm passing all of yours for v85 (except fmt/899 and fmt/900 which you got errors printed for).
ta!
Rich

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.