Comments (8)
Example search demonstrating the behaviour: https://uidemo.commonsearch.org/?g=en&q=emoji
from cosr-back.
U+1F525 belongs the Miscellaneous Symbols and Characters Unicode block and can be filtered using Python's unicodedata module, the category in question is 'Cn'.
>>> print u'\U0001F42D' 🐭 >>> [unicodedata.category(c) for c in u'\U0001F42D'] ['Cn']
from cosr-back.
@Sentimentron that looks like the right solution!
Should we whitelist of blacklist classes? https://en.wikipedia.org/wiki/Unicode_character_property
Should be straightforward to implement now!
from cosr-back.
Also interesting discovery: cchardet
is not able to correctly determine the encoding of these symbols, even though it can detect UTF-8. Thus, if the page is decoded by chardet
, it's unlikely to be able to strip these symbols.
>>> import cchardet >>> import urllib2 >>> ta_dic = urllib2.urlopen("http://www.tamildict.com/english.php").read() >>> cchardet.detect(ta_dic) {'confidence': 0.9900000095367432, 'encoding': u'UTF-8'} >>> ta_em = u"😋 Super Emoji-Land.com" >>> ta_em = ta_em.encode('utf8') >>> cchardet.detect(ta_em) {'confidence': 0.8154354095458984, 'encoding': u'ISO-8859-9'} >>> print ta_em.decode('ISO-8859-9') ğ Super Emoji-Land.com >>> ta_em = urllib2.urlopen("http://unicode.org/emoji/charts/full-emoji-list.html").read() >>> cchardet.detect(ta_em) {'confidence': 0.4998016357421875, 'encoding': u'WINDOWS-1252'}
The last example is particularly damning, since it's page that consists of basically nothing except emoji and their UTF-8 encodings.
from cosr-back.
Good find! I'm not sure how we could fix this. maybe there are not enough emoji in the dataset cchardet was trained on?
from cosr-back.
@Sentimentron looking back at the patch, I think we should also remove emojis in descriptions, don't you think?
from cosr-back.
So I did some searching: Google does strip emoji's from descriptions, but Bing doesn't. I think Bing's results for "pile of poop emoji" are actually more descriptive.
from cosr-back.
Interesting! My instinct would be to remove them, but maybe we can reconsider later when the results will have evolved a bit!
from cosr-back.
Related Issues (20)
- Structure of ES clusters HOT 9
- Questions on deployment HOT 1
- Spark-submit uses only 1 core. HOT 4
- Errors During Installation HOT 3
- Speed up Travis builds HOT 1
- Improve host-level PageRanks HOT 1
- Investigate MyHTML parser
- Add GDELT document source
- Add a Github document source HOT 2
- Add a Reddit data source
- Add Stack Overflow document source HOT 1
- Add new Malware/Phishing Blacklists HOT 1
- Advertising Lists HOT 2
- Add Makefile commands to save/load elasticsearch snapshots
- PageRank & other jobs: check if output directory already exists
- Integrate the new Common Crawl News dataset HOT 1
- Add docker-compose for the local tests
- Deduplicate URLs in backlinks plugin results
- Investigate annotating all the code with mypy
- Getting error running 'make virtualenv' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cosr-back.