Describe the bug Following file: <a href="https://github.com/Cybe

EPUB identified as java/jar about assemblyline HOT 4 CLOSED

kam193 commented on June 12, 2024

EPUB identified as java/jar

from assemblyline.

Comments (4)

kam193 commented on June 12, 2024 2

Great, thank you! I believe EPUB/Mobi are the most important, and when it comes to extracting data: I would extract everything because a) it prevents hiding content inside a book archive, b) triggers specialized services to look at the core files.

When it comes to other formats, not only books: I think the most essential is to extract content if the file is any form of a known archive. In this case, I wouldn't bother you with the identification if the file had been identified as an archive and extracted.

from assemblyline.

gdesmar commented on June 12, 2024 1

The unittests/integration tests within Assemblyline are going to be on the light side, because I want to make absolutely sure we do not get into any trouble with books, but from my manual testing, it should all work.

Mobi files and AZW3 files, which are based on Mobi files, are going to be identified as document/mobi. Extract is going to be able to extract their content thanks to the https://github.com/iscc/mobi library. It looks like there are chances the mobi file end up extracting an epub file, which would go back to Extract and extract again. The system should use caching for duplicated images and files, but if you find it bothersome in the filetree or for some other reasons, you're welcomed to open another ticket for Extract. Instead of an epub, it may extract an html or pdf file, and I wasn't sure if in those case we still wanted it.

I very quickly looked into other ebook formats, and am not certain how often people would need to analyze them, or how much of an attack vector they are. A few of them have not been updated for a long time, like DjVu since 2005 and FictionBook since 2008, and are very difficult to identify (FictionBook is just a big xml file). For those reasons, I'll wait for direct request for other ebook format support.

For now, we need a new minor core release/build for Identify to be updated. I don't plan on triggering it until we get a few more things in, but keep an eye out for 4.5.0.6 or higher, and then the next Extract build. :)

from assemblyline.

gdesmar commented on June 12, 2024

If you're in a hurry, you can set use_custom_safelisting to False in Extract to bypass java specific (and others) safelisting until it is correctly fixed. I will add identification for epub files as document/epub, and handle it like a simple zip in Extract. I see that Characterize gives a most of the characteristics for it already through exiftool. I'm thinking that a screenshot preview of the first page would be nice, but beside that, would you see any other need or things I'm missing regarding epubs?

from assemblyline.

kam193 commented on June 12, 2024

Don't worry, there is no hurry - just an improvement request. A preview of the first page sounds great, I don't feel there is anything more useful at the moment. Extracting content should allow analyzing whatever there is inside. Would you have time to ensure other ebook formats (Mobi? azw3?) are also supported in the same way? It's probably almost the same way to do.

from assemblyline.

EPUB identified as java/jar about assemblyline HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent