ebassi / xdg-mime-rs Goto Github PK

Rust crate for querying the shared-mime-info database

License: Apache License 2.0

Rust 99.77% Shell 0.16% HTML 0.07%

rust rust-crate xdg mime-types freedesktop

xdg-mime-rs's Introduction

xdg-mime-rs

Xdg-mime-rs is a library that parses the shared-mime-info database and allows querying it to determine the MIME type of a file from its extension or from its contents.

Xdg-mime-rs is a complete re-implementation of the xdgmime C library, with some added functionality that typically resides in higher level components, like determining the appropriate icon name for a file from the icon theme.

Documentation.

Installation

Add the following to your Cargo.toml file:

[dependencies]
xdg-mime = "^0.4"

or install cargo-edit and call:

cargo add [email protected]

Copyright and license

This software is distributed under the terms of the Apache License version 2.0.

xdg-mime-rs's People

Contributors

Stargazers

Watchers

Forkers

hfiguiere federicomenaquintero veeshi liias nethunterslabs

xdg-mime-rs's Issues

Test data/filename for specific mime type(s)

A nice feature would be to test if some data or a filename is a certain mimetype or mimetypes. This way it will be performant as it will only be carrying out a check for the requested mime types and not for each mimetype.

Warning about `dirs` crate being unmaintained

Github is sending security warnings about my project since it started to depend on xdg-mime.

Accoring to the advisory (link beliw) the dirs crate is unmaintained and the dirs-next shoumd be used instead. Both crates seem to have identical APIs.

https://github.com/RustSec/advisory-db/blob/main/crates/dirs/RUSTSEC-2020-0053.md

Error handling strategy

Right now, read_globs_v1_from_file and the v2 equivalent return an Option<Vec<Glob>>, with the scheme that None means an I/O error happened, and Some(v) means only the lines that didn't have syntax errors.

How detailed would you like to make this? For example, with this:

pub fn read_globs_v1_from_file<P: AsRef<Path>>(file_name: P) -> Result<Vec<Glob>, ReadGlobsError> { ... }

enum ReadGlobsError {
    Io(io::Error),
    Syntax(GlobLineSyntaxError),
}

struct GlobLineSyntaxError {
    line_num: usize,
    reason: Whatever,
}

It would let you abort on the first syntax error and report it. If you'd rather return the globs for the valid lines, and maybe a list of syntax errors on the side, that's also possible:

Result<(Vec<Glob>, Vec<GlobLineSyntaxError>), io::Error)

Unable to guess `text/turtle` type

Hi,

Here's the simple code I use to read a Turtle file:

let filename = path.file_name().unwrap().to_str().unwrap();
let meta = std::fs::metadata(path)?;
let mut file = File::open(path)?;
let mut buffer = [0; 512];
file.read(&mut buffer)?;

let guess = mime_db.guess_mime_type().file_name(filename).metadata(meta).data(&buffer[0..len]).guess();

It is unable to guess the correct text/turtle type for the following file:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> .

<http://www.w3.org/TR/rdf-syntax-grammar>
  dc:title "RDF/XML Syntax Specification (Revised)" ;
  ex:editor [
    ex:fullname "Dave Beckett";
    ex:homePage <http://purl.org/net/dajobe/>
  ] .

and instead guesses text/plain. It sometimes succeeds on some other Turtle files. Note that the following command successfully find the correct media type:

$ xdg-mime query filetype example.ttl
text/turtle

Also note that on example.ttl, get_mime_types_from_file_name returns a Vec of size 2 with the duplicate entry text/turtle.

AliasesList::unalias_mime_type() uses slow linear search in a sorted Vec

I'm writing a program that heavily utilizes MIME alias resolution. I've noticed that xdg-mime's alias resolution function (AliasesList::unalias_mime_type()) takes linear runtime in the number of MIME aliases on the current system, which scales poorly (my system has around 300).

xdg-mime-rs/src/alias.rs

Lines 68 to 73 in 23204a3

    
           pub fn unalias_mime_type(&self, mime_type: &Mime) -> Option<Mime> { 
        
               self.aliases 
        
                   .iter() 
        
                   .find(|a| a.alias == *mime_type) 
        
                   .map(|a| a.mime_type.clone()) 
        
           }

Oddly enough, it appears that AliasesList.aliases is sorted every time you call add_aliases() (so hopefully it's never out of order), yet unalias_mime_type() calls .iter().find() which performs a linear search. I think it would probably be faster, and scale better, to turn the vector into a slice, then run binary search (binary_search or binary_search_by).

Note that I haven't done any benchmarks yet. I probably should run benchmarks, and afterwards either submit a PR to fix this, or add a comment saying that binary search isn't faster in practice.

GuessBuilder: specifying file_name ignores sniffed content mime

Hello again @ebassi :)

First of all wanted to thank you for this crate.

I found a minor issue in GuessBuilder::path() where the mime type derived from content is ignored.

The file in question: ~/.dotfiles/bin/gp

Content:

#!/usr/bin/bash
git add --all
git commit -a -m "$@"
git push -u origin $(git rev-parse --abbrev-ref HEAD)

Here's what fails (returning application/octet-stream) and works.

Fails:

    let guess = SHARED_MIME_DB.guess_mime_type().path(path).guess();

Fails:

    let guess = SHARED_MIME_DB
        .guess_mime_type()
        .file_name(&path.as_ref().to_string_lossy())
        .data(&std::fs::read(&path)?)
        .guess();

Works:

    let guess = SHARED_MIME_DB
        .guess_mime_type()
        .data(&std::fs::read(path)?)
        .guess();

I tried to figure out why this happens but nothing stood out to me in the sniffing logic. Any help would be appreciated.

I think you meant `xdg-mime = "^0.3"` in the README

and not xdg_mime = "^0.3", with a _.

MagicRule should use word_size

Right now it is parsed from the magic file, but ignored in the code.

Make checking for `application/x-zerosize` optional

I would like for checking for application/x-zerosize optional. This would be useful for when knowing the "intended" (for lack of better terms) type of the file is more important than its exact contents.

This could probably be accomplished by adding a boolean field to GuessBuilder and a method to set it that would be checked in guess when the length of the file is being checked.

I intend to do a PR for this myself, but since this affects the public API, I figured it would be a good idea to discuss what approach to take first.

Content sniffing issues

Hey @ebassi

I'm attempting to migrate to this crate for handlr.
Having some issues with content type detection using get_mime_type_for_data.

Here's the code in question:

static SHARED_MIME_DB: Lazy<SharedMimeInfo> = Lazy::new(SharedMimeInfo::new);
pub fn from_filename(name: &str) -> Result<Mime> {
    let mimes = SHARED_MIME_DB.get_mime_types_from_file_name(&name);
    dbg!(&name, &mimes);

    match &*mimes {
        [mime_type] if mime_type == &"application/octet-stream" => {
            let buf = std::fs::read_to_string(name)?;
            let file_mimes =
                SHARED_MIME_DB.get_mime_type_for_data(buf.as_ref());
            dbg!(&buf, file_mimes);

            match SHARED_MIME_DB.get_mime_type_for_data(buf.as_ref()) {
                Some((mime, _)) => Ok(Mime(mime.to_string())),
                None => Err(Error::Ambiguous),
            }
        }
        [mime_type, ..] => Ok(Mime(mime_type.to_string())),
        &[] => Err(Error::Ambiguous),
    }
}

`SharedMimeInfo::get_mime_types_from_file_name()` returns guesses in a wrong order

SharedMimeInfo::get_mime_types_from_file_name() and respectively GlobMap::lookup_mime_type_for_file_name() return mime guesses in a wrong/unintuitive order, with the less likely mime-types before the more likely ones.

The offending line is

xdg-mime-rs/src/glob.rs

Line 285 in 23204a3

matching_globs.sort_by(|a, b| a.weight.cmp(&b.weight));

which sorts the matching globs in the ascending weight order, instead of the descending one.

This affects GuessBuilder::guess() too, as it assumes in general that get_mime_types_from_file_name() returns the more likely mimes first, e.g:

xdg-mime-rs/src/lib.rs

Lines 430 to 432 in 23204a3

    
           // If there are conflicts, and the data does not help us, 
        
           // we just pick the first result 
        
           if let Some(mime_type) = name_mime_types.get(0) {

Reproduction

Given globs2 with

50:audio/x-mod:*.mod
40:application/x-object:*.mod

get_mime_types_from_file_name(".mod") returns ["application/x-object", "audio/x-mod"]

Expected result

It should return ["audio/x-mod", "application/x-object"]

Read the MIME cache binary blob

The shared-mime-info database has two representations:

the various plain files: aliases, globs2, icons, magic, etc.
the mime.cache file, which contains all of the above in a single blob, easily memory mappable

We should add a parser for the mime.cache file, and use it whenever it's available, instead of loading all the other files.

HTML files without doctype detected as plain text, despite extension

Hi @ebassi

Here's a reproduction of the bug.

test.html: (incorrect: mime guess is text/plain despite html extension)

<p>test</p>

test_doctype.html (correct: mime guess is text/html)

<!DOCTYPE html>
asdf

	pub fn unalias_mime_type(&self, mime_type: &Mime) -> Option<Mime> {
	self.aliases
	.iter()
	.find(\|a\| a.alias == *mime_type)
	.map(\|a\| a.mime_type.clone())
	}

	// If there are conflicts, and the data does not help us,
	// we just pick the first result
	if let Some(mime_type) = name_mime_types.get(0) {