Giter Site home page Giter Site logo

bagittools's Introduction

BagItTools

Minimum PHP Version Github Actions LICENSE codecov

Introduction

BagItTools is a PHP implementation of the BagIt v1.0 specification (RFC-8493).

Features:

  • Create new bag
  • Load existing directory as a bag.
  • Load archive file (*.zip, *.tar, *.tar.gz, *.tgz, *.tar.bz2)
  • Validate a bag
  • Add/Remove files
  • Add/Remove fetch urls
  • Add/Remove hash algorithms (md5, sha1, sha224, sha256, sha384, sha512, sha3-224, sha3-256, sha3-384, sha3-512)
  • Generate payload for all data/ files for all hash algorithms (depending on PHP support)
  • Generate tag manifests for all root level files and any additional tag directories/files.
  • Add/Remove tags from bag-info.txt files, maintains ordering of tags loaded.
  • Generates/updates payload-oxum and bagging-date.
  • Passes all bagit-conformance-suite tests.
  • Create an archive (zip, tar, tar.gz, tgz, tar.bz2)
  • In-place upgrade of bag from v0.97 to v1.0

Installation

Composer

composer require "whikloj/bagittools"

Clone from Github

git clone https://github.com/whikloj/BagItTools
cd BagItTools
composer install --no-dev

Dependencies

All dependencies are installed or identified by composer.

Some PHP extensions are required and this library will not install if they cannot be found in the default PHP installation (the one used by composer).

The required extensions are:

Usage

You can integrate BagItTools into your own code as a library using the API, or use the CLI commands for some simple functionality.

Command line

Validating a bag

./bin/console validate <path to bag>

This will output a message as to whether the bag is or is NOT valid. It will also respond with an appropriate exit code (0 == valid, 1 == invalid).

If you add the -v flag it will also print any errors or warnings.

This can command can be used with the bagit-conformance-suite like this

./test-harness <path to BagItTools>/bin/console -- -v validate

API

API Documentation

Create a new bag

As this is a v1.0 implementation, by default bags created use the UTF-8 file encoding and the SHA-512 hash algorithm.

require_once './vendor/autoload.php';

use \whikloj\BagItTools\Bag;

$dir = "./newbag";

// Create new bag as directory $dir
$bag = Bag::create($dir);

// Add a file
$bag->addFile('../README.md', 'data/documentation/myreadme.md');

// Add another algorithm
$bag->addAlgorithm('sha1');

// Get the algorithms
$algos = $bag->getAlgorithms();
var_dump($algos); // array(
                  //   'sha512',
                  //   'sha1',
                  // )

// Add a fetch url
$bag->addFetchFile('http://www.google.ca', 'data/mywebsite.html');

// Add some bag-info tags
$bag->addBagInfoTag('Contact-Name', 'Jared Whiklo');
$bag->addBagInfoTag('CONTACT-NAME', 'Additional admins');

// Check for tags.
if ($bag->hasBagInfoTag('contact-name')) {

    // Get tags
    $tags = $bag->getBagInfoByTag('contact-name');
    
    var_dump($tags); // array(
                     //    'Jared Whiklo',
                     //    'Additional admins',
                     // )

    // Remove a specific tag value using array index from the above listing.
    $bag->removeBagInfoTagIndex('contact-name', 1); 

    // Get tags
    $tags = $bag->getBagInfoByTag('contact-name');

    var_dump($tags); // array(
                     //    'Jared Whiklo',
                     // )

    $bag->addBagInfoTag('Contact-NAME', 'Bob Saget');
    // Get tags
    $tags = $bag->getBagInfoByTag('contact-name');
    
    var_dump($tags); // array(
                     //    'Jared Whiklo',
                     //    'Bob Saget',
                     // )
    // Without the case sensitive flag as true, you must be exact.
    $bag->removeBagInfoTagValue('contact-name', 'bob saget');
    $tags = $bag->getBagInfoByTag('contact-name');

    var_dump($tags); // array(
                     //    'Jared Whiklo',
                     //    'Bob Saget',
                     // )

    // With the case sensitive flag set to false, you can be less careful
    $bag->removeBagInfoTagValue('contact-name', 'bob saget', false);
    $tags = $bag->getBagInfoByTag('contact-name');

    var_dump($tags); // array(
                     //    'Jared Whiklo',
                     // )

    // Remove all values for the specified tag.
    $bag->removeBagInfoTag('contact-name');
}

// Write bagit support files (manifests, bag-info, etc)
$bag->update();

// Write the bag to the specified path and filename using the expected archiving method.
$bag->package('./archive.tar.bz2');

Maintainer

Jared Whiklo

License

MIT

Development

To-Do:

  • CLI interface to handle simple bag CRUD (CReate/Update/Delete) functions.

bagittools's People

Contributors

henning-gerhardt avatar jonasraoni avatar whikloj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bagittools's Issues

Path encoding bug

I recently discovered that the BagIt 1.0 specification requires that CR, LF, and % in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR and LF but not % or they encode nothing.

This implementation does not encode paths in the manifest, which means that it would fail to validate BagIt 1.0 bags that include file paths containing CR, LF, or %. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters.

For example, let's say a bag contains the file data/file%0A1.txt. This file should be written to the manifest per the spec as data/file%250A1.txt. However, this implementation writes it as data/file%0A1.txt. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt, which does not exist.

It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.

As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR and LF or leaving the path unchanged, ideally validating all of the files using the same method.

I have not examined fetch.txt implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt is supposed to create data/file%250A1.txt (incorrect) or data/file%0A1.txt (correct).

Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.

Allow setting multiple algorithms at once

Currently you can set a single algorithm or add/remove an algorithm.

It seems convenient to be able to just set all the algorithms at once if you know what they are.

Essentially a setAlgorithm() call followed by multiple addAlgorithm() calls.

No way to add a special tag file

Tag files are outside the data/ directory and are not constrained in their names. Currently you can only add to the standard Bag-Info.txt or add a file into the data/ directory. Need to allow for adding tag files so long as they don't conflict with standard ones.

New empty bag did not validate

Creating a new bag without adding any data should even be valid. But calling validate() method throws a message like array_merge(): Argument #1 is not an array as used warning array was never initialised and is null.

Bags with CR only line endings not handled correctly.

The spec says lines can end with CR, LF or CRLF. Currently this library only deals with LF and CR. Files with only a carriage return cause problems.

This ticket is to correctly handle bagit.txt, bag-info.txt, fetch.txt and all manifest and tagmanifest files that might only use CR line endings.

Automatically set isExtended()

Currently you need to $bag->setExtended(true); before using bag-info tags and adding fetch files.

Instead it would be nice to allow you to do any of:

  • $bag->setExtended(true);
  • $bag->addBagInfoTag('External-Description', 'Some files');
  • $bag->addFetchFile('http://example.org/some-file.txt', 'some-file.txt');

Any of these would enable the extended mode (if not enabled) and cause the tag manifests to be created.

Multiline issue on bag info tag

If you are adding a bag info tag with a lot of text and this text contains at least one colon on "right" place the validation of created bag is failing. Look into this example program:

use \whikloj\BagItTools\Bag;

$dir = __DIR__ . "/newbag";

// Create new bag as directory $dir
$bag = Bag::create($dir);

// set extended on true calculate oxum value
$bag->setExtended(true);

$bag->addBagInfoTag('Title', 'A really long long long long long long long long long long long title with a colon : between and more information are on the way');

$bag->update();

if (!$bag->validate()) {
    echo 'Validation error!' . PHP_EOL;
    foreach ($bag->getErrors() as $number => $reason) {
        echo $reason['file'] .  ': ' . $reason['message'] . PHP_EOL;
    }
} else {
    echo 'Bag successful validated' . PHP_EOL;
}

Output is:

Validation error!
bag-info.txt: Labels cannot begin or end with a whitespace. (Line 2)

Content of bag-info.txt is:

Title: A really long long long long long long long long long long long title
  with a colon : between and more information are on the way
Payload-Oxum: 0.0
Bagging-Date: 2020-09-28

Bag::loadTagManifests() leaves $this->tagManifests uninitialized if bag files count is 0

If the early-return condition of Bag::loadTagManifest() is met, the $this->tagManifests property is left uninitialized leading to the Undefined property: whikloj\BagItTools\Bag::$tagManifests error, e.g. on a $bag->validate() call (because of this line).

A solution would be to set $this->tagManifests before returning, e.g.:

    private function loadTagManifests(): bool
    {
        $tagManifests = [];
        $pattern = $this->getBagRoot() . "/tagmanifest-*.txt";
        $files = BagUtils::findAllByPattern($pattern);
        if (count($files) < 1) {
            # ADDED LINE:
            $this->tagManifests = $tagManifests;
            return false;
        }
        (...)
    }

Abort oversized fetch requests

According to the spec section 5.3

... Implementers SHOULD take steps to monitor and abort transfer when the
received file size exceeds the file size reported in the fetch file. ...

This ticket is to abort the download if a fetch file has a size specified and the downloaded file is exceeds that size by some variance.

Upgrade to Symfony 5

We want to move up to Symfony 5, but currently phpcpd version 4 requires symfony/console 2,3 or 5 and supports PHP 7.2.

Phpcpd version 5 supports Symfony/Console version 5, but requires PHP 7.3. So this will have to wait until we are ready to drop support for 7.2.

Bag-Info is not written unless specifically set it as an extended bag.

You are able to add bag info tags, but those are not written out if you have not also set the extended attribute.
ie.

$bag = Bag::create($this->tmpdir);
$bag->addBagInfoTag("Contact-Name", "Jared Whiklo");
$bag->addBagInfoTag("Source-Organization", "The room.");
$bag->update();

Creates a bag like

bagit.txt
data/
manifest-sha512.txt

Should throw an exception to warn you to make your bag extended before adding tags.

Path issues on a Windows environment

Hi @whikloj!

We're using your package to preserve academic journals at @pkp, and some users had issues under Windows, which I've just confirmed by my own.

Are you open for contributions? For now I saw just two issues:

  • Path issues
    The package is failing to format the path properly in some cases (e.g. trying to open /c:/bag).

  • The repository cannot be even cloned in Windows without hacks, due to invalid file names (below I've left the offending entries):

tests/resources/TestBadFilePathsBag/data/this-file-will-be-fine-%-and-
.txt

tests/resources/TestEncodingBag/data/carriage
return-file.txt

tests/resources/TestEncodingBag/data/directory
line
break/carriage
return/Example-file-%-and-%25.txt

tests/resources/TestEncodingBag/data/directory
line
break/some-file.txt

p.s.: About the cloning problem, that's not a real issue for us, as composer is "fixing" the paths by itself.

Explore using php-vcr

For the Fetch function tests we use donatj/mock-webserver, explore whether php-vcr/php-vcr might be easier and faster.

Wrong calculation of Payload-Oxum value

Scenario: if you use more then one hash algorithm for your bag than the Payload-Oxum is calculated on amount of used hash algorithm. If you use for example two hash algorithm like sha512 and md5 and your bag contain 4 files inside the data directory. In this case the StreamCount value of Payload-Oxum is 8 and not 4. Even more the value of OctetCount is the double size of the correct value.

Reason for this is, that the used payloadFiles contain same data directory entry multiple times.

I can provide a simple correction against master or development branch.

Support for Bag-Size tag

In extended mode of creating a bag the oxum value is generated with total of file size and amount of files in the data directory. The total file size could be used to fill the Bag-Size tag with a human readable value of the total file size.

Node tar-stream files cannot be opened by Archive_Tar

When testing out DART I discovered that the *.tar files generated cannot be opened by Archive_Tar.

You get the directories (in this case data), and the file in data but none of the top-level files.

So a bag that when un-tarred with tar -xf looks like

> ls -l ~/.dart/bags/An\ Example\ Bag
total 40
-rw-r--r--  1 whikloj  staff   77 26 Feb 14:58 aptrust-info.txt
-rw-r--r--  1 whikloj  staff  350 26 Feb 14:58 bag-info.txt
-rw-r--r--  1 whikloj  staff   54 26 Feb 14:58 bagit.txt
drwxr-xr-x  3 whikloj  staff   96 26 Feb 14:58 data
-rw-r--r--  1 whikloj  staff   46 26 Feb 14:58 manifest-md5.txt
-rw-r--r--  1 whikloj  staff  317 26 Feb 14:58 tagmanifest-sha256.txt

> ls -l ~/.dart/bags/An\ Example\ Bag/data 
total 592504
-rw-r--r--  1 whikloj  staff  303360153  7 Feb 15:09 PDF.pdf

ends up like

> ls -l Tx0GFf
total 622720
drwxr-xr-x  3 whikloj  staff         96 26 Feb 15:14 An Example Bag
-rw-r--r--  1 whikloj  staff  303360153  7 Feb 15:09 PaxHeader

> ls -l Tx0GFf/An\ Example\ Bag 
total 0
drwxr-xr-x  2 whikloj  staff  64 26 Feb 15:14 data

> ls -l Tx0GFf/An\ Example\ Bag/data 

Creation of bag with relative bag fails unless ./ prefix is used

The following code fails:

$bag = Bag::create('my_bag_path');

...with...

PHP Fatal error:  Uncaught whikloj\BagItTools\Exceptions\FilesystemException: Unable to put contents to file /my_bag_path/bagit.txt in /.../vendor/whikloj/bagittools/src/BagUtils.php:283

It appears that my_bag_path is interpreted as an absolute path unless you prefix it with ./

I'm not sure if this is intentional, but in my opinion this library should behave like the PHP file functions, which would interpret my_bag_path as a relative path unless there is a / prefix in front.

Allow add multiple bag-info tags in one call

If you have already got the information you want as an array of tag-name => tag-value, it would be nice to just pass that to the bag instead of having to iterate over the array and call addBagInfoTag in the loop.

Something like addBagInfoTags(array $tags)

Remove bag info tags by value

Allow people to remove a bag info tag by providing the bag info tag and the tag value.

Comparisons on bag info tag will be case insensitive, but comparisons of value would require exact match.

Add framework for BagItProfiles

Develop a way to read and validate a bag against a BagItProfile based on the specification.

This does not need to actually implement a specific profile, hopefully it will be agnostic enough to handle any/all/multiple profiles that fit the specification.

End support for PHP 5.6 & 7.1

My plan is that the next major release (2.0.0) of BagItTools will require PHP 7.2.

@henning-gerhardt, @mjordan, @elizoller does this change in requirements cause any concerns?

I know that islandora_bagger currently requires 7.1.3. If it is more desirable I can move to a PHP 7.1 requirement for the 2.0.0 version and PHP 7.2 for 3.0.0.

Just trying to see how many versions I need to maintain.

Handling creation errors

If your destination directory or nfs share get out of space or get by accident remounted read-only or through some other reason the creating of directories and files are failing there is no reaction except PHP Warnings like

PHP Warning:  mkdir(): No space left on device in /<path_to_application>/vendor/whikloj/bagittools/src/Bag.php on line 1142
PHP Warning:  file_put_contents(/<path_to_bag>/bagit.txt): failed to open stream: No such file or directory in /<path_to_application>/vendor/whikloj/bagittools/src/Bag.php on line 1749
PHP Warning:  copy(/<path_to_bag>/data/<some_file>): failed to open stream: No space left on device in /<path_to_application>/vendor/whikloj/bagittools/src/Bag.php on line 441

Return values of uses methods like mkdir(), copy(), file_put_contents(), ... should be evaluated and in case of an error maybe an exception with a good error message should thrown or something similar should be done.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.