Giter Site home page Giter Site logo

Comments (4)

kam193 avatar kam193 commented on June 12, 2024 2

Great, thank you! I believe EPUB/Mobi are the most important, and when it comes to extracting data: I would extract everything because a) it prevents hiding content inside a book archive, b) triggers specialized services to look at the core files.

When it comes to other formats, not only books: I think the most essential is to extract content if the file is any form of a known archive. In this case, I wouldn't bother you with the identification if the file had been identified as an archive and extracted.

from assemblyline.

gdesmar avatar gdesmar commented on June 12, 2024 1

The unittests/integration tests within Assemblyline are going to be on the light side, because I want to make absolutely sure we do not get into any trouble with books, but from my manual testing, it should all work.

Mobi files and AZW3 files, which are based on Mobi files, are going to be identified as document/mobi. Extract is going to be able to extract their content thanks to the https://github.com/iscc/mobi library. It looks like there are chances the mobi file end up extracting an epub file, which would go back to Extract and extract again. The system should use caching for duplicated images and files, but if you find it bothersome in the filetree or for some other reasons, you're welcomed to open another ticket for Extract. Instead of an epub, it may extract an html or pdf file, and I wasn't sure if in those case we still wanted it.

I very quickly looked into other ebook formats, and am not certain how often people would need to analyze them, or how much of an attack vector they are. A few of them have not been updated for a long time, like DjVu since 2005 and FictionBook since 2008, and are very difficult to identify (FictionBook is just a big xml file). For those reasons, I'll wait for direct request for other ebook format support.

For now, we need a new minor core release/build for Identify to be updated. I don't plan on triggering it until we get a few more things in, but keep an eye out for 4.5.0.6 or higher, and then the next Extract build. :)

from assemblyline.

gdesmar avatar gdesmar commented on June 12, 2024

If you're in a hurry, you can set use_custom_safelisting to False in Extract to bypass java specific (and others) safelisting until it is correctly fixed. I will add identification for epub files as document/epub, and handle it like a simple zip in Extract. I see that Characterize gives a most of the characteristics for it already through exiftool. I'm thinking that a screenshot preview of the first page would be nice, but beside that, would you see any other need or things I'm missing regarding epubs?

from assemblyline.

kam193 avatar kam193 commented on June 12, 2024

Don't worry, there is no hurry - just an improvement request. A preview of the first page sounds great, I don't feel there is anything more useful at the moment. Extracting content should allow analyzing whatever there is inside. Would you have time to ensure other ebook formats (Mobi? azw3?) are also supported in the same way? It's probably almost the same way to do.

from assemblyline.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.