Comments (7)
As a workaround, I have the following code that uses
http://chardet.feedparser.org/ to check for strange encodings:
import chardet
def fix_ffdl_encoding(data):
''' Deal with utf8 encoded as a unicode object, amongst others
Doesn't deal with lists of objects'''
if not isinstance(data, unicode):
return data
try:
endata = data.encode('cp1252') #standard windows western encoding
except UnicodeEncodeError:
#chances are it's utf8, or nor a string
return data
results = chardet.detect(endata)
if results['confidence'] > 0.8:
return unicode(endata, results['encoding'])
else:
return data
Original comment by [email protected]
on 24 Aug 2011 at 3:25
- Added labels: ****
- Removed labels: ****
from fanficfare.
The problem you've found is ultimately caused by the fact that fictionalley.org
reports all its pages as utf8, even when they're really cp1252.
In fact, most of the older stories I've found there that aren't just ascii were
cp1252. I don't recall seeing one with true utf8 before.
Re-encoding the description only fixes the problem for this particular story,
but won't help when the title, chapter names, story text, etc are true utf8.
I'm intrigued by the idea of using chardet or something like it to detect the
real encoding for sites like fictionalley.org that lie to us.
It's not necessarily going to solve all the problems, though. I believe I once
even saw utf8 and cp1252 on the same page.
Original comment by [email protected]
on 14 Sep 2011 at 6:19
- Changed state: Accepted
- Added labels: ****
- Removed labels: ****
from fanficfare.
(info update)
chardet can spot utf8 with reasonable confidence in the stories I've tested.
But it keeps incorrectly calling the windows-1252/ISO-8859-1 texts ISO-8859-2
with 70-85% confidence.
Here's an example:
http://www.fictionalley.org/authors/aerie22/DWM01a.html
Original comment by [email protected]
on 14 Sep 2011 at 8:06
- Added labels: ****
- Removed labels: ****
from fanficfare.
Hmmm, I think some level of intelligence is possibly useful here. I believe
fictionalley is 100% english fics, so there's only really two encodings we need
to worry about - utf8 and cp1252.
utf8 is fairly easy to detect, so maybe say "if chardet reckons it's 95% sure
it's utf8, call it utf8, otherwise cp1252" and you should catch nearly all the
edge cases there...
Original comment by [email protected]
on 14 Sep 2011 at 8:26
- Added labels: ****
- Removed labels: ****
from fanficfare.
That's just what I was thinking. Check it against the list of encodings
suitable for the adapter in case there are non-English sites some day.
I'm also considering making it an optional feature that can be turned on/off
from the ini. I like options, but I'm not sure it's useful.
Original comment by [email protected]
on 14 Sep 2011 at 9:51
- Added labels: ****
- Removed labels: ****
from fanficfare.
I've add chardet to the system, but I don't trust it completely. So I've made
a user customizable option for website encoding and added a pseudo-encoding
'auto' that uses chardet's encoding, but only if it's 90+% confident.
It's checked into HG, but it's not the default web version yet, You can pull it
from HG or try it out here:
http://4-0-5.fanfictionloader.appspot.com/
Be sure to add to your personal.ini(CLI) or User Configuration(web):
[www.fictionalley.org]
website_encodings: auto, Windows-1252, utf8
Please confirm if this works sufficiently for you.
Changeset: 214 (61d72dfc9f63)
Add website_encodings option to change encoding list, add 'auto' as encoding
type.
'auto' uses Universal Encoding Detector(http://chardet.feedparser.org/) to
derive encoding. It spots utf-8 fairly well, but not iso8859-1/windows-1252,
so we require 90+% confidence before using it.
Original comment by [email protected]
on 15 Sep 2011 at 7:56
- Changed state: Started
- Added labels: ****
- Removed labels: ****
from fanficfare.
4.0.5 now default.
Original comment by [email protected]
on 19 Sep 2011 at 9:18
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****
from fanficfare.
Related Issues (20)
- AO3 not logging in for locked works HOT 2
- use_flaresolverr_proxy:directimages 403 cover Failures HOT 2
- https://forum.questionablequesting.com/ HOT 1
- Minor Typo - Update Anthology Failure HOT 5
- Colored text (and hidden text) HOT 1
- Downloads from scifistories.com and other WLPC sites fail after title page HOT 5
- Storiesonline.net/n/ HOT 1
- [Bug] Specific SpaceBattles Thread Giving NoneType Object Decompose Error HOT 1
- [FimFiction] Fetching all stories in a users bookshelf HOT 2
- Less than signs in threadmark titles on Sufficient Velocity not being escaped. HOT 1
- 'NoneType' object has no attribute 'decompose' HOT 1
- Erotica Tags to Extra Tags HOT 1
- No connection to flaresolverr HOT 2
- Cannot download from Royal Road HOT 3
- browsercache_blockfile(Chromium) SuperFastHash Wrong for Some Keys
- RoyalRoad Inserting Invisitext Into Chapters HOT 22
- [Feature request] Add support for literotica series HOT 2
- Possible DeviantArt Site Change has broken fetching stories HOT 2
- [Feature Request] Allow to download fics from AO3 Series URL, Or Allow for Multiple URL extractions for a single URL. HOT 1
- DeviantArt login stopped working HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fanficfare.