Comments (4)
Great, thank you! I believe EPUB/Mobi are the most important, and when it comes to extracting data: I would extract everything because a) it prevents hiding content inside a book archive, b) triggers specialized services to look at the core files.
When it comes to other formats, not only books: I think the most essential is to extract content if the file is any form of a known archive. In this case, I wouldn't bother you with the identification if the file had been identified as an archive and extracted.
from assemblyline.
The unittests/integration tests within Assemblyline are going to be on the light side, because I want to make absolutely sure we do not get into any trouble with books, but from my manual testing, it should all work.
Mobi files and AZW3 files, which are based on Mobi files, are going to be identified as document/mobi. Extract is going to be able to extract their content thanks to the https://github.com/iscc/mobi library. It looks like there are chances the mobi file end up extracting an epub file, which would go back to Extract and extract again. The system should use caching for duplicated images and files, but if you find it bothersome in the filetree or for some other reasons, you're welcomed to open another ticket for Extract. Instead of an epub, it may extract an html or pdf file, and I wasn't sure if in those case we still wanted it.
I very quickly looked into other ebook formats, and am not certain how often people would need to analyze them, or how much of an attack vector they are. A few of them have not been updated for a long time, like DjVu since 2005 and FictionBook since 2008, and are very difficult to identify (FictionBook is just a big xml file). For those reasons, I'll wait for direct request for other ebook format support.
For now, we need a new minor core release/build for Identify to be updated. I don't plan on triggering it until we get a few more things in, but keep an eye out for 4.5.0.6 or higher, and then the next Extract build. :)
from assemblyline.
If you're in a hurry, you can set use_custom_safelisting to False in Extract to bypass java specific (and others) safelisting until it is correctly fixed. I will add identification for epub files as document/epub, and handle it like a simple zip in Extract. I see that Characterize gives a most of the characteristics for it already through exiftool. I'm thinking that a screenshot preview of the first page would be nice, but beside that, would you see any other need or things I'm missing regarding epubs?
from assemblyline.
Don't worry, there is no hurry - just an improvement request. A preview of the first page sounds great, I don't feel there is anything more useful at the moment. Extracting content should allow analyzing whatever there is inside. Would you have time to ensure other ebook formats (Mobi? azw3?) are also supported in the same way? It's probably almost the same way to do.
from assemblyline.
Related Issues (20)
- Cannot submit archived expired file
- Wrong file type identification - Python as INI HOT 4
- Missed .online static domain HOT 1
- UI: Badlisted tags are not colored in file details view HOT 2
- Scaler to recognize service in failed state HOT 2
- Suricata service can be stuck for hours if suricata didn't start HOT 5
- Health checks for services are broken in Docker Compose HOT 1
- Update service stays in a loop trying to install obsoletes or non accessible docker images. HOT 1
- Intezer-Analyze short-circuit download
- Feature Request: tolerations and nodeAffinity HOT 12
- Identity: Python obfuscated code identified as text/plain HOT 4
- Suricata 4.5.0.7 seems to be broken HOT 1
- Expose `delete_file_from_filestore` API to Python Client HOT 1
- Allow "private" submissions
- FrankenStrings URL extraction seems to trim URLs on char 0, even when it's not a binary file HOT 2
- AL 4.5.0.27: updater cannot upgrade any service HOT 15
- YARA service cannot parse rules with negative integers in metadata HOT 4
- Signature update services may not expose new signatures for workers immediately
- Unable to setup - Kibana keeps failing HOT 8
- Error: 504 Gateway-Timeout when all containers are up and healthy. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from assemblyline.