Giter Site home page Giter Site logo

profiler's Introduction

Profiler

The Profiler is a java library able to profile (i.e. determine the mediatype, format variant and language) of an arbitrary file.

How to use the Profiler

File myFile = new File("/path/to/my/file");
Profiler profiler = new DefaultProfiler();
List<Profile> detectedProfiles = profiler.profile(myFile);

API

A profiler is a Java class satisfying the Profiler interface (Profiler.java). The profiler interface specifies a single function:

List<Profile> profile(File file) throws IOException, ProfilingException;

A profiler can return multiple Profile objects for a single file when there is ambiguity in the data. However, the order of returned profiles is important, and the profile with the highest confidence should be first on the list.

The Profile is a simple data object for storing a data profile (e.g. mediatype, language, version, other features). Use the static builder() function to make a profile builder, or the nested Profile.Flat class for serialization/deserialization.

The default profiler

A profiler can be specialized for detection of a single format, or be more general and perform just a few simple tests then delegate the detection to other specialized profilers. The main profiler (DefaultProfiler) invokes the Apache Tika library for detecting the general mediatype, then invokes specialized profilers for various formats (xml, text). These profilers in turn invoke more specialized profilers.

Adding a specialized profiler

To add your own specialized profiler, first add a separate class with its implementation, then find the place where a call to your profiler should be inserted, starting from the DefaultProfiler.

For instance, a profiler for an xml subformat would be placed in the eu.clarin.switchboard.profiler.xml package, and a call to it would be placed in the more general XmlProfiler, in the profile method.

profiler's People

Contributors

andmor- avatar dependabot[bot] avatar emanueldima avatar proycon avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

profiler's Issues

Create extension mechanism for recognizing new formats

We need a mechanism to describe custom format variants, its main usage being for the TEI subformats. The mediatype, root element, schemas (either relaxng, schematro, xml schema or dtd) must be testable with a common Turing complete language for full power. A new profile should require just adding a new file with a lambda function taking the list of features and returning a new format (for a match) or empty result (for non match).

Failure to recognise CMDI profiles

When I give the Switchboard a PID pointing to a VCR collection (say http://hdl.handle.net/11372/VC-1034), the profiler returns as media-type "application/xhtml+xml" so that no matching tools are found. When I manually correct the media-type to "application/x-cmdi+html", I get the CMDI explorer tool (the one tool that I expect for my input).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.