paulproteus / debian-dedup Goto Github PK
View Code? Open in Web Editor NEWFork of git://murkel.subdivi.de/~helmut/debian-dedup.git
License: Other
Fork of git://murkel.subdivi.de/~helmut/debian-dedup.git
License: Other
Required packages ----------------- aptitude install python python-debian python-lzma curl python-jinja2 python-werkzeug sqlite3 python-imaging python-yaml python-concurrent.futures python-pkg-resources Create a database ----------------- The database name is currently hardcoded as `test.sqlite3`. So copy the SQL statements from `schema.sql` into `sqlite3 test.sqlite3`. In addition it is highly recommended to put the database into WAL mode. Otherwise all your reading queries will block forever when doing an import. This setting is permanent. PRAGMA journal_mode = WAL; Import packages --------------- Import individual packages by feeding them to importpkg.py and readyaml.py: ./importpkg.py < somepkg.deb | ./readyaml.py You can import your local apt cache: ./autoimport.py /var/cache/apt/archives Import a full mirror (only http supported): ./autoimport.py -n -p http://your.mirror.example/debian After changing the database, a few tables caching expensive computations need to be (re)generated. Execute `./update_sharing.py`. Without this step the web interface will report wrong results. Viewing the results ------------------- Run `./webapp.py` and enjoy a webinterface at `0.0.0.0:8800` or inspect the SQL database by hand. Here are some example queries. Finding the 100 largest files shared with multiple packages. SELECT pa.name, a.filename, pb.name, b.filename, a.size FROM content AS a JOIN hash AS ha ON a.id = ha.cid JOIN hash AS hb ON ha.hash = hb.hash JOIN content AS b ON b.id = hb.cid JOIN package AS pa ON a.pid = pa.id JOIN package AS pb ON b.pid = pb.id WHERE (a.pid != b.pid OR a.filename != b.filename) ORDER BY a.size DESC LIMIT 100; Finding those top 100 files that save most space when being reduced to only one copy in the archive. SELECT hash, sum(size)-min(size), count(*), count(distinct pid) FROM content JOIN hash ON content.id = hash.cid JOIN function ON hash.fid = function.id WHERE function.name = "sha512" GROUP BY hash ORDER BY sum(size)-min(size) DESC LIMIT 100; Finding PNG images that do not carry a .png file extension. SELECT package.name, content.filename, content.size FROM content JOIN hash ON content.id = hash.cid JOIN package ON content.pid = package.id JOIN function ON hash.fid = function.id WHERE function.name = "png_sha512" AND lower(filename) NOT LIKE "%.png"; Finding .gz files which either are not gziped or contain errors. SELECT package.name, content.filename FROM content JOIN package ON content.pid = package.id WHERE filename LIKE "%.gz" AND (SELECT count(*) FROM hash JOIN function ON hash.fid = function.id WHERE hash.cid = content.id AND function.name = "gzip_sha512") = 0;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.