Giter Site home page Giter Site logo

Comments (7)

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
As a workaround, I have the following code that uses 
http://chardet.feedparser.org/ to check for strange encodings:

import chardet

def fix_ffdl_encoding(data):
    ''' Deal with utf8 encoded as a unicode object, amongst others
        Doesn't deal with lists of objects'''

    if not isinstance(data, unicode):
        return data
    try:
        endata = data.encode('cp1252') #standard windows western encoding
    except UnicodeEncodeError:
        #chances are it's utf8, or nor a string
        return data
    results = chardet.detect(endata)
    if results['confidence'] > 0.8:
        return unicode(endata, results['encoding'])
    else:
        return data

Original comment by [email protected] on 24 Aug 2011 at 3:25

  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
The problem you've found is ultimately caused by the fact that fictionalley.org 
reports all its pages as utf8, even when they're really cp1252.

In fact, most of the older stories I've found there that aren't just ascii were 
cp1252.  I don't recall seeing one with true utf8 before.

Re-encoding the description only fixes the problem for this particular story, 
but won't help when the title, chapter names, story text, etc are true utf8.

I'm intrigued by the idea of using chardet or something like it to detect the 
real encoding for sites like fictionalley.org that lie to us.  

It's not necessarily going to solve all the problems, though.  I believe I once 
even saw utf8 and cp1252 on the same page.

Original comment by [email protected] on 14 Sep 2011 at 6:19

  • Changed state: Accepted
  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
(info update)
chardet can spot utf8 with reasonable confidence in the stories I've tested.  
But it keeps incorrectly calling the windows-1252/ISO-8859-1 texts ISO-8859-2 
with 70-85% confidence.

Here's an example:
http://www.fictionalley.org/authors/aerie22/DWM01a.html

Original comment by [email protected] on 14 Sep 2011 at 8:06

  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
Hmmm, I think some level of intelligence is possibly useful here. I believe 
fictionalley is 100% english fics, so there's only really two encodings we need 
to worry about - utf8 and cp1252.

utf8 is fairly easy to detect, so maybe say "if chardet reckons it's 95% sure 
it's utf8, call it utf8, otherwise cp1252" and you should catch nearly all the 
edge cases there...

Original comment by [email protected] on 14 Sep 2011 at 8:26

  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
That's just what I was thinking.  Check it against the list of encodings 
suitable for the adapter in case there are non-English sites some day.

I'm also considering making it an optional feature that can be turned on/off 
from the ini.  I like options, but I'm not sure it's useful.

Original comment by [email protected] on 14 Sep 2011 at 9:51

  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
I've add chardet to the system, but I don't trust it completely.  So I've made 
a user customizable option for website encoding and added a pseudo-encoding 
'auto' that uses chardet's encoding, but only if it's 90+% confident.

It's checked into HG, but it's not the default web version yet, You can pull it 
from HG or try it out here:

http://4-0-5.fanfictionloader.appspot.com/

Be sure to add to your personal.ini(CLI) or User Configuration(web):

[www.fictionalley.org]
website_encodings: auto, Windows-1252, utf8

Please confirm if this works sufficiently for you.

Changeset: 214 (61d72dfc9f63) 
Add website_encodings option to change encoding list, add 'auto' as encoding 
type.
'auto' uses Universal Encoding Detector(http://chardet.feedparser.org/) to
derive encoding.  It spots utf-8 fairly well, but not iso8859-1/windows-1252,
so we require 90+% confidence before using it.

Original comment by [email protected] on 15 Sep 2011 at 7:56

  • Changed state: Started
  • Added labels: ****
  • Removed labels: ****

from fanficfare.

GoogleCodeExporter avatar GoogleCodeExporter commented on May 19, 2024
4.0.5 now default.

Original comment by [email protected] on 19 Sep 2011 at 9:18

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****

from fanficfare.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.