Giter Site home page Giter Site logo

Comments (17)

sebbacon avatar sebbacon commented on August 19, 2024

Some possible mitigation strategies for current approach, current bugs etc:

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Install more recent poppler-utils e.g. 0.12.0 can definitely convert this to HTML, extacting the images:
http://www.whatdotheyknow.com/request/13903/response/36117/attach/html/4/FOI%20beaver%20site%20species%20audit%20SNH%20review%20of%20proposal%20redact.pdf.html
Really need a "pdftk -nodrm" to remove compression from encrypted PDFs, so strips emails from e.g. http://www.whatdotheyknow.com/request/14414/response/38590/attach/html/3/090807%20FOI.pdf.html

... this misses a whole page out (someone emailed us) http://www.whatdotheyknow.com/request/unredacted_expense_claims_for_jo#incoming-49674

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Worth doing View as HTML ourselves for .docx, .ppt, .tif (covered now by Google Docs)
View as HTML for .txt requested

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Failed to detect attachments are emails and decode them:
http://www.whatdotheyknow.com/request/malicious_communication_act#incoming-12964

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

When indexing .docx do you need to index docProps/custom.xml and docProps/app.xml
as well as word/document.xml ? (thread on xapian-discuss does so)

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Consider using odt2txt or unoconv
http://www-verimag.imag.fr/~moy/opendocument/

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

VSD files vsdump - example in zip file
http://www.whatdotheyknow.com/request/dog_control_orders#incoming-3510
doing file RESPONSE/Internal documents/Briefing with Contact Islington/Contact Islington Flowchart Jul 08.vsd content type

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Search for other file extensions that we have now and look for ones we could
and should be indexing
(call IncomingMessage.find_all_unknown_mime_types to find them - needs
updating to do it in clumps as all requests won't load in RAM now )

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Render HTML alternative rather than text (so tables look good) e.g.:
http://www.whatdotheyknow.com/request/parking_policy

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

These attachment.bin files should come out as winmail.dat and be parsed
by existing TNEF code. For some reason though TMail doesn't get the right
content-type out of them. Not sure why.
http://www.whatdotheyknow.com/request/acting_up_in_a_higher_rank

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Make HTML attachments have view as HTML :)
http://www.whatdotheyknow.com/request/enforced_medication#incoming-7395

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Knackered view as HTML:
http://www.whatdotheyknow.com/request/1385/response/5483/attach/html/3/Response%20465.2008.pdf.html

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

Some other pdftohtml bugs (fix them or file about them)
http://www.whatdotheyknow.com/request/sale_of_public_land#incoming-8146
http://www.whatdotheyknow.com/request/childrens_database_compliance_wi#incoming-8088
http://www.whatdotheyknow.com/request/3326/response/7701/attach/html/2/Scan001.PDF.pdf.html
http://www.whatdotheyknow.com/request/risk_log#incoming-8090 (bad tables)
http://www.whatdotheyknow.com/request/4635/response/11248/attach/html/4/FOI%20request.pdf.html (bad table)
Orientation wrong:
http://www.whatdotheyknow.com/request/3153/response/7726/attach/html/2/258850.pdf.html
Bug in wvHtml, segfaults when converting this:
http://www.whatdotheyknow.com/request/subject_access_request_guide_sar#incoming-10242

Images aren't coming out here
http://www.whatdotheyknow.com/request/33682/response/83455/attach/html/3/100428%20Reply%201519%2010.doc.html

Doesn't detect doc type of a few garbage results in this list right:
http://www.whatdotheyknow.com/search/UWE

from alaveteli.

sebbacon avatar sebbacon commented on August 19, 2024

.tif files are hard for people to view as multi page, consider automatically
separating out the pages as separate links (to .png files or whatever)
http://www.whatdotheyknow.com/request/windsor_maidenhead_council_commo#incoming-1910
Heck, may as well give thumbnails of all images, indeed all docs while you're at it :)

from alaveteli.

hsenag avatar hsenag commented on August 19, 2024

Another knackered HTML conversion: http://www.whatdotheyknow.com/request/registered_pharmacists_prescribi#incoming-245446

from alaveteli.

hsenag avatar hsenag commented on August 19, 2024

Just to emphasise that tables that really need the HTML alternative are quite common, e.g.:

 http://www.whatdotheyknow.com/request/it_support_services_1295#incoming-258044
 http://www.whatdotheyknow.com/request/it_support_services_347#incoming-258014
 http://www.whatdotheyknow.com/request/it_support_services_1236#incoming-257000

from alaveteli.

TomSteinberg avatar TomSteinberg commented on August 19, 2024

Replaced by #1529 #1528 #1527 - if we've missed any substantive issues please make new specific tickets @hsenag

from alaveteli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.