Giter Site home page Giter Site logo

Comments (7)

worldveil avatar worldveil commented on July 22, 2024

Only doing checksums on mp3s is pretty limiting.

I think something like this is what we want. I'd prefer sha1, but it doesn't matter too much. Basically what is important is that the hashing is done on small parts of the file rather than the entire thing to avoid memory issues.

from dejavu.

pguridi avatar pguridi commented on July 22, 2024

what about skipping the metadata?. two files with different id3 tags, are still the same audio file.

from dejavu.

worldveil avatar worldveil commented on July 22, 2024

Ah yes, good point. Perhaps for now, if the file is mp3, we use the library you mentioned above (it appears as though it processes only small bits in memory at a time), and otherwise just a straight up md5/sha-1 hash of file contents? Not perfect, but a good way to start. Further contributions would be to add support for other audio file types (.wav, .ogg, etc).

from dejavu.

pguridi avatar pguridi commented on July 22, 2024

Sounds good. The project I linked is a standalone C app. But If we can use that for the .mp3s, Ill make a wrapper in python for that project.

from dejavu.

Wessie avatar Wessie commented on July 22, 2024

I don't really feel dejavu should be the one keeping track of what files have been fingerprinted.

At best I would just use an md5 of the full files content. Metadata changes are going to of course change the result but is that really something dejavu should be worrying about?

from dejavu.

worldveil avatar worldveil commented on July 22, 2024

It's an interesting idea, perhaps adding a checksum field to the songs table? Then before you fingerprint you check that. It wouldn't add any appreciable disk usage.

The good news is that since we enforce the uniqueness constraint, the exact same file being hashed won't take up space unnecessarily (inserts will be ignored). It would however waste CPU cycles. Given that, I can see the argument for including checksums.

Again as long as we don't affect the performance of other aspects, I think this is a worthy feature.

from dejavu.

thesunlover avatar thesunlover commented on July 22, 2024

Let's weigh the options..

if we add the feature we will:
/+ not refingerprint same songs
/+ we shall not need to lookup for the same hashes in the DB during recognition
/- we will have to do additional CPU job, but it will only depend on how often do we add new songs.

from dejavu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.