Giter Site home page Giter Site logo

dcache-endit-provider's Introduction

dCache ENDIT Provider

This is the Efficient Northern Dcache Interface to TSM (ENDIT) dCache provider plugin. It interfaces with the ENDIT daemons to form an integration for the IBM Spectrum Protect (TSM) storage system.

Installation

To install the plugin, unpack the tarball in the dCache plugin directory (usually /usr/local/share/dcache/plugins).

Configuration

There are two flavors of the ENDIT provider: The watching provider and the polling provider.

The watching provider uses the least system resources.

The polling provider is the most performant, this is what's used in production on NDGF and what we recommend to use.

Watching provider

To use, define a nearline storage in the dCache admin interface:

hsm create osm the-hsm-name endit -directory=/path/to/endit/directory

The endit directory must be on the same file system as the pool's data directory.

The above will create a provider that uses the JVMs file event notification feature which in most cases maps directly to a native file event notification facility of the operating system.

Polling provider

To use a provider that polls for changes, use:

hsm create osm the-hsm-name endit-polling -directory=/path/to/endit/directory

This provider accepts two additional options with the following default values:

-threads=20
-period=5000

The first is the number of threads used for polling for file changes and the second is the poll period in milliseconds.

For sites with large request queues we recommend to increase the thread count further, 200 threads are used in production on NDGF.

Notes on the provider behaviour

  • The polling provider does not monitor the request files, once they are created. Editing or deleting them has no consequences from the perspective of dCache.
  • The polling provider will check whether a requested file does exist already in the /in folder, before it writes a new request file and, if so, move it into the pool's inventory without staging anything.
  • The polling provider will overwrite existing request files, when the pool receives a request (that isn't satisfied by the content of the /in folder). That is important regarding retries of recalls from the pool and pool restarts!
  • The polling provider will check for error files with every poll. If such a file exists for a requested file, it's content is read and verbatim raised as an exception from the staging task. Because the exception is raised, the task will be aborted and all related files should get purged.
  • The error file's path has to be /request/<pnfsid>.err
  • Shutting down the polling provider and/or the pool does clean up existing request files.

More documentation

More verbose instructions are available at https://wiki.neic.no/wiki/DCache_TSM_interface.

Collaboration

Patches, suggestions, and general improvements are most welcome.

We use the GitHub issue tracker to track and discuss proposed improvements.

When submitting code, open an issue to track/discuss pull-request(s) and refer to that issue in the pull-request. Pull-requests should be based on the master branch.

License

AGPL-3.0, see LICENSE

Versioning

Semantic Versioning 2.0.0

Building

To compile the plugin, run:

mvn package

API

FIXME: The file-based API between the ENDIT dCache plugin and the ENDIT daemons needs to be formally documented. For now, read the source of both for documentation.

dcache-endit-provider's People

Contributors

dependabot[bot] avatar gbehrmann avatar hmushegh avatar kofemann avatar krishna-veni avatar maswan avatar vingar avatar znikke avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kofemann hmushegh

dcache-endit-provider's Issues

Rethink how we sleep in StageTask poll()

In order to allow for dsmc to finish setting attributes etc there is a sleep() in the StageTask poll():

I suspect that this sleep() might be the reason for the somewhat unexpected behavior that the WatchingProvider is so slow, and the observation that the PollingProvider performs much better provided that we allocate a LOT of threads to it.

My reasoning is that although it's a Thread.sleep() it still suspends execution of the thread. This will wreak havoc with the watching provider performance and also increases the likelihood for event overflows. For the polling provider the GRACE_PERIOD of 1000 ms is a direct correlation to the observed performance of the 1-thread-per-Hz of staging performance.

What we really ought to do is something along the lines of:

  • When we see a file with a correct size, add it to a FIFO queue together with the current time.
    • Thus allowing the thread to continue processing.
  • Process the queue whenever we have a file at output that's been sitting there for more than GRACE_PERIOD
    • This can probably be done in lots of ways, the naive approach of checking the queue every 100ms will probably be good enough although there is likely a more elegant Java-ish way to do it. Since this ties in to the magic mess of Futures I'm at a loss what the best method may be.

I believe this would allow threads to do actual work instead of sleeping all the time.

Possibility for cancelled/aborted stages to leave stray files in pool

Considering the abort() function

return Files.deleteIfExists(requestFile) && Files.deleteIfExists(errorFile) && Files.deleteIfExists(inFile);

I think it should probably delete the final destination file as well if it exists, since the stage was aborted the pool will have no knowledge of the file if the provider was too quick in placing it there. This of course assumes that abort() isn't used for general cleanup after a successful stage.

It looks to me that there is no check in poll()

public Set<Checksum> poll() throws IOException, InterruptedException, EnditException

that checks that the stage is still active before moving a staged file to its final destination, does the dcache nearline spi handle this or should the provider do some additional checks?

exception from getXattrs()?

Hello,

TRIUMF tested an adapted version of 1.0.9 with dCache 7.2.25, and got flush requests stuck. It turned out that the following statement in FlsushTask.java was the source of the issue, and commenting out bypassed the issue.

xattrs = request.getFileAttributes().getXattrs();

Stuck flush meant that no flush requests were processed. To track it, I put a few debug messages including the ones before and after the call. The debug message before the call appeared only once from the first flush request and no other debug messages from FlushTask after it showed up at all. There was no similar debug message from remaining flush requests. So it seemed like that flush requests were stuck.

There were still other debug messages from WatchingEnditNearLineStorage so the plugin seemed not crashed. We tested flush requests only so we're not sure whether stage requests would have worked or not.

After more investigation, I found that XATTR was not defined so guard() raised IllegalStateException. I wrapped the statement with if/else as a temporary fix.

The first question came to me was that were we the only site who have the issue? If so, why XATTR was not defined in our case. What does it need to fill the map? etc.

The second question is what happened to the exception? I'm not a Java person so I just guess that somewhere above in the call stack did something or crashed(?) only that line?

Could you please enlighten me about it?

Thank you and kind regards,
Yun-Ha

P.S.
When I first adapted Endit when Gerd introduced this wonderful plugin, I changed activate() to activateWithPath() to gurantee that we're able to get path info all the time. If I remember correctly, at that time it was not guaranteed that path info was not always available in StorageInfo. I don't think that results in undefined XATTR but that's the only technical difference.

It seems that now you're extracting path info from StorageInfo. Is path always in StorageInfo now? If so, I'd like to change it back to simple activate method.

Change polling provider default threads from 1 to 20

The polling provider currently defaults to -threads=1, this is too low for anything except functional/developer tests.

Consequently, this should be changed to a more reasonable default. Earlier tests has shown that the big performance improvement happens when the thread count is increased to 20, so let's use that as the default instead.

As a reference, NDGF uses 200 threads on their production tape pools for absolute maximum performance when processing huge numbers of requests.

Error while compiling the plugin

Hi,

I am trying to install the plugin for using it in a new dCache instance.

The compilation fails with the following error:

[ERROR] /root/dcache-endit-provider/src/main/java/org/ndgf/endit/AbstractEnditNearlineStorage.java:[137,24] method transformAsync in class com.google.common.util.concurrent.Futures cannot be applied to given types;
  required: com.google.common.util.concurrent.ListenableFuture<I>,com.google.common.util.concurrent.AsyncFunction<? super I,? extends O>,java.util.concurrent.Executor
  found: com.google.common.util.concurrent.ListenableFuture<java.lang.Void>,<anonymous com.google.common.util.concurrent.AsyncFunction<java.lang.Void,java.lang.Void>>
  reason: cannot infer type-variable(s) I,O
    (actual and formal argument lists differ in length)

As far as I see maven is using the right java version:

# mvn -version
Apache Maven 3.0.5 (Red Hat 3.0.5-17)
Maven home: /usr/share/maven
Java version: 1.8.0_342, vendor: Red Hat, Inc.
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "3.10.0-1160.76.1.el7.x86_64", arch: "amd64", family: "unix"

Maybe I am missing something obvious, could you please help me with this?

Thank you.

Cristina

New start() method in NearlineStorage v7.2

Just for heads up, we tested an adapted version of endit-provider at TRIUMF with dCache v7.2. With 7.2, WatchingEnditNearlineStorage raised exceptions caused by the following one

java.lang.AbstractMethodError: Receiver class org.ndgf.endit.WatchingEnditNearlineStorage does not define or inherit an implementation of the resolved method 'void start()' of interface org.dcache.pool.nearline.spi.NearlineStorage.

It turned out that NearlineStorage in v7.2 introduced a new start() method

default void start() throws IOException {}

which was hidden by the private start() method in WENS. We changed access modifier of start() in WENS to public as a quick fix. So I'd like to inform that it may need the same or cleaner changes to this too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.