Giter Site home page Giter Site logo

xdg-mime-rs's Introduction

crates.io docs.rs Build

xdg-mime-rs

Xdg-mime-rs is a library that parses the shared-mime-info database and allows querying it to determine the MIME type of a file from its extension or from its contents.

Xdg-mime-rs is a complete re-implementation of the xdgmime C library, with some added functionality that typically resides in higher level components, like determining the appropriate icon name for a file from the icon theme.

Documentation.

Installation

Add the following to your Cargo.toml file:

[dependencies]
xdg-mime = "^0.4"

or install cargo-edit and call:

Copyright and license

Copyright 2020 Emmanuele Bassi

This software is distributed under the terms of the Apache License version 2.0.

xdg-mime-rs's People

Contributors

anomalocaridid avatar ebassi avatar federicomenaquintero avatar hfiguiere avatar veeshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

xdg-mime-rs's Issues

Test data/filename for specific mime type(s)

A nice feature would be to test if some data or a filename is a certain mimetype or mimetypes. This way it will be performant as it will only be carrying out a check for the requested mime types and not for each mimetype.

Error handling strategy

Right now, read_globs_v1_from_file and the v2 equivalent return an Option<Vec<Glob>>, with the scheme that None means an I/O error happened, and Some(v) means only the lines that didn't have syntax errors.

How detailed would you like to make this? For example, with this:

pub fn read_globs_v1_from_file<P: AsRef<Path>>(file_name: P) -> Result<Vec<Glob>, ReadGlobsError> { ... }

enum ReadGlobsError {
    Io(io::Error),
    Syntax(GlobLineSyntaxError),
}

struct GlobLineSyntaxError {
    line_num: usize,
    reason: Whatever,
}

It would let you abort on the first syntax error and report it. If you'd rather return the globs for the valid lines, and maybe a list of syntax errors on the side, that's also possible:

Result<(Vec<Glob>, Vec<GlobLineSyntaxError>), io::Error)

Unable to guess `text/turtle` type

Hi,

Here's the simple code I use to read a Turtle file:

let filename = path.file_name().unwrap().to_str().unwrap();
let meta = std::fs::metadata(path)?;
let mut file = File::open(path)?;
let mut buffer = [0; 512];
file.read(&mut buffer)?;

let guess = mime_db.guess_mime_type().file_name(filename).metadata(meta).data(&buffer[0..len]).guess();

It is unable to guess the correct text/turtle type for the following file:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> .

<http://www.w3.org/TR/rdf-syntax-grammar>
  dc:title "RDF/XML Syntax Specification (Revised)" ;
  ex:editor [
    ex:fullname "Dave Beckett";
    ex:homePage <http://purl.org/net/dajobe/>
  ] .

and instead guesses text/plain. It sometimes succeeds on some other Turtle files. Note that the following command successfully find the correct media type:

$ xdg-mime query filetype example.ttl
text/turtle

Also note that on example.ttl, get_mime_types_from_file_name returns a Vec of size 2 with the duplicate entry text/turtle.

AliasesList::unalias_mime_type() uses slow linear search in a sorted Vec

I'm writing a program that heavily utilizes MIME alias resolution. I've noticed that xdg-mime's alias resolution function (AliasesList::unalias_mime_type()) takes linear runtime in the number of MIME aliases on the current system, which scales poorly (my system has around 300).

xdg-mime-rs/src/alias.rs

Lines 68 to 73 in 23204a3

pub fn unalias_mime_type(&self, mime_type: &Mime) -> Option<Mime> {
self.aliases
.iter()
.find(|a| a.alias == *mime_type)
.map(|a| a.mime_type.clone())
}

Oddly enough, it appears that AliasesList.aliases is sorted every time you call add_aliases() (so hopefully it's never out of order), yet unalias_mime_type() calls .iter().find() which performs a linear search. I think it would probably be faster, and scale better, to turn the vector into a slice, then run binary search (binary_search or binary_search_by).

Note that I haven't done any benchmarks yet. I probably should run benchmarks, and afterwards either submit a PR to fix this, or add a comment saying that binary search isn't faster in practice.

GuessBuilder: specifying file_name ignores sniffed content mime

Hello again @ebassi :)

First of all wanted to thank you for this crate.

I found a minor issue in GuessBuilder::path() where the mime type derived from content is ignored.


The file in question: ~/.dotfiles/bin/gp

Content:

#!/usr/bin/bash
git add --all
git commit -a -m "$@"
git push -u origin $(git rev-parse --abbrev-ref HEAD)

Here's what fails (returning application/octet-stream) and works.

Fails:

    let guess = SHARED_MIME_DB.guess_mime_type().path(path).guess();

Fails:

    let guess = SHARED_MIME_DB
        .guess_mime_type()
        .file_name(&path.as_ref().to_string_lossy())
        .data(&std::fs::read(&path)?)
        .guess();

Works:

    let guess = SHARED_MIME_DB
        .guess_mime_type()
        .data(&std::fs::read(path)?)
        .guess();

I tried to figure out why this happens but nothing stood out to me in the sniffing logic. Any help would be appreciated.

Make checking for `application/x-zerosize` optional

I would like for checking for application/x-zerosize optional. This would be useful for when knowing the "intended" (for lack of better terms) type of the file is more important than its exact contents.

This could probably be accomplished by adding a boolean field to GuessBuilder and a method to set it that would be checked in guess when the length of the file is being checked.

I intend to do a PR for this myself, but since this affects the public API, I figured it would be a good idea to discuss what approach to take first.

Content sniffing issues

Hey @ebassi

I'm attempting to migrate to this crate for handlr.
Having some issues with content type detection using get_mime_type_for_data.

image

Here's the code in question:

static SHARED_MIME_DB: Lazy<SharedMimeInfo> = Lazy::new(SharedMimeInfo::new);
pub fn from_filename(name: &str) -> Result<Mime> {
    let mimes = SHARED_MIME_DB.get_mime_types_from_file_name(&name);
    dbg!(&name, &mimes);

    match &*mimes {
        [mime_type] if mime_type == &"application/octet-stream" => {
            let buf = std::fs::read_to_string(name)?;
            let file_mimes =
                SHARED_MIME_DB.get_mime_type_for_data(buf.as_ref());
            dbg!(&buf, file_mimes);

            match SHARED_MIME_DB.get_mime_type_for_data(buf.as_ref()) {
                Some((mime, _)) => Ok(Mime(mime.to_string())),
                None => Err(Error::Ambiguous),
            }
        }
        [mime_type, ..] => Ok(Mime(mime_type.to_string())),
        &[] => Err(Error::Ambiguous),
    }
}

`SharedMimeInfo::get_mime_types_from_file_name()` returns guesses in a wrong order

SharedMimeInfo::get_mime_types_from_file_name() and respectively GlobMap::lookup_mime_type_for_file_name() return mime guesses in a wrong/unintuitive order, with the less likely mime-types before the more likely ones.

The offending line is

matching_globs.sort_by(|a, b| a.weight.cmp(&b.weight));
which sorts the matching globs in the ascending weight order, instead of the descending one.

This affects GuessBuilder::guess() too, as it assumes in general that get_mime_types_from_file_name() returns the more likely mimes first, e.g:

xdg-mime-rs/src/lib.rs

Lines 430 to 432 in 23204a3

// If there are conflicts, and the data does not help us,
// we just pick the first result
if let Some(mime_type) = name_mime_types.get(0) {

Reproduction

Given globs2 with

50:audio/x-mod:*.mod
40:application/x-object:*.mod

get_mime_types_from_file_name(".mod") returns ["application/x-object", "audio/x-mod"]

Expected result

It should return ["audio/x-mod", "application/x-object"]

Read the MIME cache binary blob

The shared-mime-info database has two representations:

  • the various plain files: aliases, globs2, icons, magic, etc.
  • the mime.cache file, which contains all of the above in a single blob, easily memory mappable

We should add a parser for the mime.cache file, and use it whenever it's available, instead of loading all the other files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.