commonsearch / cosr-back Goto Github PK
View Code? Open in Web Editor NEWBackend of Common Search. Analyses webpages and sends them to the index.
Home Page: https://about.commonsearch.org
License: Apache License 2.0
Backend of Common Search. Analyses webpages and sends them to the index.
Home Page: https://about.commonsearch.org
License: Apache License 2.0
https://lexborisov.github.io/myhtml/
They are reporting an impressive 10x speedup over Gumbo:
http://lexborisov.github.io/benchmark-html-persers/
There are a few concerns beyond performance (testing on huge datasets, security, python bindings, ...) but 10x is large enough an improvement that we should look into it!
Our current tokenizer is... rather simple :)
Let's discuss what would be reasonable, short-term improvements as well as some mid-term ideas?
We should take into account the way documents are indexed in elasticsearch (currently a big list of words) and the tokenization we could do on search queries (currently none).
FilterLists is a meta-list for advertising, spam, astroturfing, fraud, and piracy. It would be relatively easy to manually pull in many of these feeds, but some require parsing as the advertising filters often contain regular sites with CSS selectors. I've emailed the admin and asked if s/he could produce a JSON or TXT master list that we can parse.
I've also submitted PrivacyBadger's yellowlist for inclusion as well. If that isn't added for some reason, we should add it.
Dumps seem to be available at https://archive.org/details/stackexchange
Still not sure if we should entirely remove searcher.py
It's only used in tests. If we remove it we'd have to make at least the cosr-back
tests depend on the cosr-front
code.
Pros:
Cons:
Just announced: http://blog.commoncrawl.org/2016/02/february-2016-crawl-archive-now-available/
Should be rather easy to switch in the code: https://github.com/commonsearch/cosr-back/blob/master/scripts/import_commoncrawl.sh#L13
We seem to not be indexing the public suffix part of domains. Intention of that may have been to avoid indexing "com" all the time but this is too restrictive.
https://github.com/commonsearch/cosr-back/blob/master/cosrlib/document/__init__.py#L96
As a result, nord.gouv.fr is not found in https://uidemo.commonsearch.org/?g=fr&q=nord+gouv+fr
I've got some additional lists not covered by UT1 with notes on their availability. Given that the information stored with the crawl will be dated, I doubt anyone would mind us publishing the information..
Google's SafeSearch is the big one, aggregating anti-phishing feeds (probably PhishTank), Malware from stopbadware.org, as well as their own list of unwanted software.
They specifically state that "All use of Safe Browsing APIs is free of charge." and their usage restrictions is strictly concerned with displaying information to users.
They offer dumps for use in local databases and provide the following contact information for large scale users: [email protected]
.
Collection of domains pushing malware. They provide offer data for research purposes, they may be fine with us making it publicly available (esp if we introduce a time lag).
Collaborative phishing list, not CC licensed but I seriously doubt they would mind it if we used their information. Ping the mailing list for more info.
Not sure if this should be done before indexing or completely in Elasticsearch, but it would be helpful for cases like commonsearch/cosr-results#5 to improve the tokenization of URLs to allow better partial matches.
As a first step, could the presence of separated terms ("Le Monde") in the title have an influence on the tokenization of the URLs? ("lemonde.fr" => "le monde fr" in addition to "lemonde fr")
As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.
Here is a list of our current ideas to improve it, feel free to contribute yours!
rel=nofollow
linksGoing to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.
We current download the DMOZ data but we only store a boolean signal for the presence of URLs or domains in their dumps.
We should start storing titles and descriptions, and then use them as fallbacks in the search results. An example where this would help is commonsearch/cosr-results#3
We should also add support for <META NAME="ROBOTS" CONTENT="NOODP">
as explaned here:
http://sitemaps.blogspot.com/2006/07/more-control-over-page-snippets.html
A few pointers:
format_title
and format_summary
using url_metadata
: https://github.com/commonsearch/cosr-back/blob/master/cosrlib/formatting.pyI had to restart the Explainer which was unresponsive because of this:
Traceback (most recent call last):
File "/cosr/back/venv/lib/python2.7/site-packages/gevent/baseserver.py", line 175, in _do_read
File "/cosr/back/venv/lib/python2.7/site-packages/gevent/server.py", line 114, in do_read
File "/cosr/back/venv/lib/python2.7/site-packages/gevent/_socket2.py", line 180, in accept
error: [Errno 24] Too many open files
Not sure where we could be leaking some file descriptors, of if we happened to have too many open sockets.
Currently we have different weights/boosts for title, url and body text.
We should further split the body text to give a higher weight to text in h1-h6 titles for instance.
First question is how to store those different groups of text in Elasticsearch? Do we create as many fields as level of weights we can have?
As Greg Lindhal (@wumpus) pointed out, DMOZ's data is rather low-quality these days, so it could be great to add presence in https://github.com/blekko/slashtag-data as another signal.
This should be pretty straightforward to do in the code, by duplicating what is currently done with DMOZ.
Is there an explicit license to this data though?
Cookie notices are more of an annoyance than regular boilerplate because they usually appear on top of the page and may pollute the snippets.
Right now we have very basic code to filter some of them, but we could use some of the lists at https://filterlists.com/ to filter more of them.
One big issue is the format of these lists though: they use CSS selectors, sometimes as complex as cofunds.co.uk###idrMasthead > .idrPageRow[style*='z-index:1']
. We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.
We may want to start by only using definitions by IDs and classes, which should take care of most cases.
Rough todo list:
tests/testdata
directorycosrlib/document/html/
It's hard to believe that with @mlinksva in the loop this hasn't been proposed before ;-)
How important/useful would it be to index Creative Commons (and others?) license tags and be able to filter results depending on them?
Travis builds are timeouting :-/
There are a few things to investigate:
Hello,
Thanks for everyone who replied! Really helped me.
So all last week i have been trying to setup up this project using AWS.
I have multiple questions about development!
Does cosr-ops script provide everything needed for creation of Es clusters and all that worker stuff ?
You didnt mention you use rocksdb and gombo, i installed it manually in spark-master.
When i try to index test data, by running command make aws_spark_deploy_cosrback
, it gives me error:
This happens after ALEXATOP1M is downloaded and it tried to write it to rocksdb.
Traceback (most recent call last):
File "urlserver/import.py", line 21, in <module>
ds.import_dump()
File "./urlserver/datasources/__init__.py", line 62, in import_dump
for i, row in self.iter_dump():
File "./urlserver/datasources/__init__.py", line 102, in iter_dump
f = self.open_dump()
File "./urlserver/datasources/__init__.py", line 144, in open_dump
return GzipStreamFile(f)
File "/cosr/back/venv/src/gzipstream/gzipstream/gzipstreamfile.py", line 62, in __init__
super(GzipStreamFile, self).__init__(self._gzipstream)
File "/usr/lib64/python2.6/io.py", line 921, in __init__
raw._checkReadable()
AttributeError: '_GzipStreamFile' object has no attribute '_checkReadable'
Any ideas ? I have been beating my head on this for a long time.
make aws_elasticsearch_create
and successfully creating ES instances(3 of them). How do i access them ? Whats the port of ES to point frontend to ?I really hope to finally set everything up, so i can work on the ISSUES( frontend and backend ). I have hight hopes for this project.
This would be helpful in the future, but also in the short-term to debug issues like commonsearch/cosr-results#2
A domain
field will probably need to be added to the Elasticsearch mapping
Then we have to parse terms like site:example.com
in the queries. Possibly related: #2 & commonsearch/cosr-front#18
The explainer
is a web service for debugging query results.
It should be useful in these usecases:
HTMLDocument
, to view which parts of the page we ignore or consider as boilerplate.I'm still cleaning an early prototype of this, and will push soon.
Hey Guys,
I get some errors after I execute make docker_test_coverage . I am unsure of what they mean exactly. Apparently, I pass 49 tests, skipped 1, and failed 16. The last error I get is :
ProtocolError: ('Connection aborted.', error(111, 'Connection refused'))
Here are some screenshots as well. Help would be greatly appreciated.
Some titles contain characters like 🔥, which we probably do not want in the search results.
Is there a simple way (or existing Python module?) to clean all those characters without messing with international characters?
This would avoid errors late in the job like this:
Traceback (most recent call last):
File "/cosr/back/spark/jobs/pagerank.py", line 459, in <module>
job.run()
File "/cosr/back/cosrlib/spark.py", line 207, in run
self.run_job(sc, sqlc)
File "/cosr/back/spark/jobs/pagerank.py", line 75, in run_job
self.custom_pagerank(sc, sqlc)
File "/cosr/back/spark/jobs/pagerank.py", line 289, in custom_pagerank
compression="gzip" if self.args.gzip else "none"
File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 632, in text
File "/usr/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'path file:/cosr/back/out/pagerank already exists.;'
reported by @HenriqueLimas
http://www.robotstxt.org/meta.html
Not sure if Common Crawl already filters those pages, but we should do it on our side too anyway.
Some pointers:
is_indexable
method to DocumentHTMLDocument.is_indexable()
should test for the presence of noindex
in self.head_metas["robots"]
nofollow
we could simply test for its presence in close_tagCurrently we force canonical URLs declared in the meta tags to be on the same domain as the base document:
https://github.com/commonsearch/cosr-back/blob/master/cosrlib/document/html/htmldocument.py#L346
Is this requirement too strict? If we relax it (same root domain? same DNS owner? any domain?), would some abuse/impersonation be possible?
Hello,
So in your architecture image you tell us that there are 2 ES clusters: https://about.commonsearch.org/developer/architecture
But when you view Operations documentation: You tell us that there is 1 cluster and 3 nodes, so witch one is it ?
Also i successfully executed ES cloudformation create and it created 1 instances, but when i open the public link, there is no ES installed ?
I am really confused, can someone explain ?
There are many simple things we could do to improve our normalized URLs and avoid duplicates:
PHPSESSID
, utm_*
, ..)Some good ideas there:
https://github.com/iipc/webarchive-commons/tree/master/src/main/java/org/archive/url
https://github.com/rajbot/surt/tree/master/surt
Code for this should be done in https://github.com/commonsearch/cosr-back/blob/master/cosrlib/url.py. Exhaustive unit tests would be great!
They mention opening their data:
https://github.com/blog/2201-making-open-source-data-more-available
I'm not sure if the dumps are publicly accessible outside of BigQuery? If not, is using the API the only solution?
We will need to have a model that evaluates many features from documents and gives us a document quality score.
Before doing any machine learning, it would be great to explore the first few features/signals we could include.
A first list of ideas, please add your own!
<blink>
:-)Currently our test coverage score is lower than it should be.
We are not collecting coverage information for the code running under spark-submit
:
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_index.py#L79
We are already able to collect coverage data for subprocesses (we collect while doing make import_local_data
https://github.com/commonsearch/cosr-back/blob/master/Makefile#L144), but pyspark seems to copy the job file in a temporary folder, which breaks coverage collection, probably because the filenames change and/or the pytest-cov.pth file is not executed?
https://filterlists.com/ could help determine.
Allow users to filter based on index and/or boost results lacking presence.
Looking at https://about.commonsearch.org/values it seems such filters would be mainstream (more so than license filters) and possibly aligned with privacy, though as stated the value is only about what Common Search does with user data. But Common Search's independence could allow it to take stronger (or at least different) measures to protect searchers than Google does.
I'd love to be able to search the web sans ad-laden sites. Not to avoid the ads (for that I use an ad blocker) but to avoid the junk content. Searching for info on many consumer products on Google, one has to wade through ad/affiliate-driven reviews and stores to find neutral information or even information provided by the manufacturer. Filtering out stores would be harder so I didn't put in the title of this issue.
It's not a bottleneck at this point, but it could clearly be improved.
A first step could be to avoid using the Python _bulk
helper class and use ujson
to dump results directly.
Batch sizes should also be added as config parameters so that they can be optimized by ops at index time.
Decompressing the WARC files from Common Crawl is a relatively slow step in the indexing process. It would be great to see how much improvement CloudFlare's version of zlib can bring.
Some benchmarks here:
http://www.snellman.net/blog/archive/2015-06-05-updated-zlib-benchmarks/
We should keep the ability to fallback on the regular implementation.
To do this we may have to fork commoncrawl/gzipstream, which is the place where zlib
is imported: https://github.com/commoncrawl/gzipstream/blob/master/gzipstream/gzipstreamfile.py#L2
If the improvement is significant we should use it and add the build commands to our Dockerfile.
We will end up importing the whole Wikipedia dumps soon enough, however a simple first step would be to import the abstracts, after #10 is finished.
That could allow us to add good descriptions, possibly of better quality than DMOZ. That would be helpful for commonsearch/cosr-results#1 for instance.
We could also possibly start including all wikipedia urls in the results, even if we didn't index their whole content yet.
https://meta.wikimedia.org/wiki/Data_dumps
There seems to be a combined abstract.xml file ("Recombine extracted page abstracts for Yahoo"), is this the one we should use?
https://dumps.wikimedia.org/enwiki/20160204/
Link text is a powerful signal for relevance.
Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.
A couple options I see:
The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.
https://www.wikidata.org/wiki/Wikidata:Database_download
Importing wikidata would be (for starters) a good way to associate a lot of official URLs to their named entity and wikipedia page. This would help commonsearch/cosr-results#4 for instance.
The way we import Alexa should be a good starting point.
For a first version I think it should be ok to store (key, value) in rocksdb as (normalized_url, (name, english description, english wikipedia slug)).
Wikidata should then be added to https://about.commonsearch.org/data-sources
Rather easy to do and helps seeing the project "health" immediately:
IMHO a lot of people (including myself) look for these badges (e.g. to get a first impression on test quality).
Example project with (IMHO relevant) badges: https://github.com/zalando/connexion
So i recently deployed the project on aws and i was supriced by the low performance of indexer. I investigated and found out that spark indexer only uses 1 core of all available( on 100% ), why is that ?
I can't seem to figure out the way to fix this, any ideas ?
As mentioned in commonsearch/cosr-results#2, some big domains are missing from Common Crawl for various reasons that we will try to fix, but we should have a fallback with "fake" documents created from DMOZ and/or Wikidata items, in order to avoid any large gaping holes in the short term.
The main question is how this would fit in our current pipeline. The simplest way would probably be to iterate over entries from DMOZ & Wikidata (either with a range query on URLServer or straight from the dumps?) and send op_type=create
queries to Elasticsearch, to avoid overwriting documents that were already indexed from Common Crawl:
https://www.elastic.co/guide/en/elasticsearch/guide/current/create-doc.html
This method would only work after a clean reindex from Common Crawl, but this shouldn't be a big issue short-term. Open to other ideas though!
I'm not sure if it is relevant but it may be interesting to use json-ld [1] for document description, may be even with the schema.org [2] vocabulary.
For example https://en.wikipedia.org/wiki/Jean-François_Champollion could be represented by (in an amazing future when Common search will be able to guess page topics):
{
"@context": "http://schema.org",
"@type": "WebPage",
"@id": "https://en.wikipedia.org/wiki/Jean-François_Champollion",
"name": "Jean-François Champollion - Wikipedia, the free encyclopedia",
"description": "Jean-François Champollion (a.k.a. Champollion le jeune; 23 December 1790 – 4 March 1832) was a French scholar, philologist and orientalist",
"inLanguage": "en",
"image": "//upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Leon_Cogniet_-_Jean-Francois_Champollion.jpg/220px-Leon_Cogniet_-_Jean-Francois_Champollion.jpg",
"dateModified": "2016-03-01T14:47:00",
"fileFormat": "text/html",
"mainEntity":{
"@type": "Person",
"@id": "http://www.wikidata.org/entity/Q260",
"name": "Jean-François Champollion"
}
}
[1] http://json-ld.org
[2] http://schema.org
In many cases it would be an interesting info to show in the results.
There are many ways of getting this data from page or headers, with varying complexity and confidence. Let's investigate them!
We currently have ad-hoc scripts in https://github.com/commonsearch/cosr-back/tree/master/scripts to import our various datasources, each of them creating a different rocksdb database.
This is not ideal and should be improved in the following ways:
mprpc
for the sake of simplicity.There seem to be a few cases left where we get � characters in search results:
Check to see if they can be fixed, and prefer ignoring the error completely than showing any �.
There is a dataset available for 2006 to August 2015:
https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/
How to use it? The votes are probably an interesting signal for ranking.
There are a few ranking signals we could extract from data related to the domain name registration and records:
Getting complete whois/zonefile dumps doesn't seem easy at the moment. Any ideas?
A couple interesting links:
When I run
spark-submit jobs/spark/index.py --warc_limit 1 --only_homepages --profile
as described in README.md, the follow error will appear:
16/03/15 07:15:18 INFO BlockManagerMaster: Registered BlockManager
Traceback (most recent call last):
File "/cosr/back/jobs/spark/index.py", line 174, in
spark_main()
File "/cosr/back/jobs/spark/index.py", line 145, in spark_main
warc_filenames = list_warc_filenames()
File "/cosr/back/jobs/spark/index.py", line 72, in list_warc_filenames
warc_files = list_commoncrawl_warc_filenames(limit=args.warc_limit, skip=args.warc_skip)
File "/cosr/back/cosrlib/webarchive.py", line 22, in list_commoncrawl_warc_filenames
with open(warc_paths, "r") as f:
IOError: [Errno 2] No such file or directory: '/cosr/back/local-data/common-crawl/warc.paths.txt'
Currently for images we index the alt
attribute as well as the filename.
However we don't exclude Data URIs, which we should do because it makes no sense to index that.
Some page titles out there are either plain wrong or unhelpful. It's way worse for descriptions.
Most other search engines take some liberty and don't just use the <title>
tag as source for the title or the <meta>
tags for the description.
Other sources of data could include:
<h1>
tagsAny other idea?
How to choose between these source will be a complex topic but we can build something reasonably simple in the short term. We already support a blacklist of titles and a few fallbacks in formatting.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.