commonsearch / cosr-back Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 24.0 4.45 MB

Backend of Common Search. Analyses webpages and sends them to the index.

Home Page: https://about.commonsearch.org

License: Apache License 2.0

Makefile 0.05% Python 1.29% Shell 0.01% HTML 98.63% CSS 0.01% JavaScript 0.01% Protocol Buffer 0.01% Perl 0.01%

cosr-back's Issues

Investigate MyHTML parser

https://lexborisov.github.io/myhtml/

They are reporting an impressive 10x speedup over Gumbo:
http://lexborisov.github.io/benchmark-html-persers/

There are a few concerns beyond performance (testing on huge datasets, security, python bindings, ...) but 10x is large enough an improvement that we should look into it!

Tokenizer improvements

Our current tokenizer is... rather simple :)

Let's discuss what would be reasonable, short-term improvements as well as some mid-term ideas?

We should take into account the way documents are indexed in elasticsearch (currently a big list of words) and the tokenization we could do on search queries (currently none).

FilterLists is a meta-list for advertising, spam, astroturfing, fraud, and piracy. It would be relatively easy to manually pull in many of these feeds, but some require parsing as the advertising filters often contain regular sites with CSS selectors. I've emailed the admin and asked if s/he could produce a JSON or TXT master list that we can parse.

I've also submitted PrivacyBadger's yellowlist for inclusion as well. If that isn't added for some reason, we should add it.

Add Stack Overflow document source

Dumps seem to be available at https://archive.org/details/stackexchange

Deprecate searcher.py in favor of the Go client?

Still not sure if we should entirely remove searcher.py

It's only used in tests. If we remove it we'd have to make at least the cosr-back tests depend on the cosr-front code.

Pros:

Less code duplication/maintenance
Less risk of subtle diffs that would make debugging hard.

Cons:

More complex setup
Unnatural dependency between both projects
One more language to understand for people willing to contribute

Update to Common Crawl's February 2016 crawl.

Just announced: http://blog.commoncrawl.org/2016/02/february-2016-crawl-archive-now-available/

Should be rather easy to switch in the code: https://github.com/commonsearch/cosr-back/blob/master/scripts/import_commoncrawl.sh#L13

Index the public suffix part of domains

We seem to not be indexing the public suffix part of domains. Intention of that may have been to avoid indexing "com" all the time but this is too restrictive.

https://github.com/commonsearch/cosr-back/blob/master/cosrlib/document/__init__.py#L96

As a result, nord.gouv.fr is not found in https://uidemo.commonsearch.org/?g=fr&q=nord+gouv+fr

Add new Malware/Phishing Blacklists

I've got some additional lists not covered by UT1 with notes on their availability. Given that the information stored with the crawl will be dated, I doubt anyone would mind us publishing the information..

SafeSearch

Google's SafeSearch is the big one, aggregating anti-phishing feeds (probably PhishTank), Malware from stopbadware.org, as well as their own list of unwanted software.

They specifically state that "All use of Safe Browsing APIs is free of charge." and their usage restrictions is strictly concerned with displaying information to users.

They offer dumps for use in local databases and provide the following contact information for large scale users: [email protected].

StopBadWare.org

Collection of domains pushing malware. They provide offer data for research purposes, they may be fine with us making it publicly available (esp if we introduce a time lag).

PhishTank.com

Collaborative phishing list, not CC licensed but I seriously doubt they would mind it if we used their information. Ping the mailing list for more info.

Improve partial indexing/matching of URLs

Not sure if this should be done before indexing or completely in Elasticsearch, but it would be helpful for cases like commonsearch/cosr-results#5 to improve the tokenization of URLs to allow better partial matches.

As a first step, could the presence of separated terms ("Le Monde") in the title have an influence on the tokenization of the URLs? ("lemonde.fr" => "le monde fr" in addition to "lemonde fr")

Add GDELT document source

http://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/

Improve host-level PageRanks

As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.

Here is a list of our current ideas to improve it, feel free to contribute yours!

Don't follow rel=nofollow links
Better weights on the edges (treat links between subdomains differently? give less weight for links in the boilerplate and/or at the end of the page? give more weight depending on the number of distinct pages linking to the domain?)
Try to group domains belonging to the same owner (By IP address/DNS info? See #15)

Going to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.

Parse and use DMOZ titles/summaries

We current download the DMOZ data but we only store a boolean signal for the presence of URLs or domains in their dumps.

We should start storing titles and descriptions, and then use them as fallbacks in the search results. An example where this would help is commonsearch/cosr-results#3

We should also add support for <META NAME="ROBOTS" CONTENT="NOODP"> as explaned here:
http://sitemaps.blogspot.com/2006/07/more-control-over-page-snippets.html

A few pointers:

Plug into format_title and format_summary using url_metadata : https://github.com/commonsearch/cosr-back/blob/master/cosrlib/formatting.py
Add some tests! https://github.com/commonsearch/cosr-back/blob/master/tests/cosrlibtests/test_formatting.py

Too many open files in Explainer

I had to restart the Explainer which was unresponsive because of this:

Traceback (most recent call last):
  File "/cosr/back/venv/lib/python2.7/site-packages/gevent/baseserver.py", line 175, in _do_read
  File "/cosr/back/venv/lib/python2.7/site-packages/gevent/server.py", line 114, in do_read
  File "/cosr/back/venv/lib/python2.7/site-packages/gevent/_socket2.py", line 180, in accept
error: [Errno 24] Too many open files

Not sure where we could be leaking some file descriptors, of if we happened to have too many open sockets.

Add Makefile commands to save/load elasticsearch snapshots

Add different weights for parts of the page

Currently we have different weights/boosts for title, url and body text.

We should further split the body text to give a higher weight to text in h1-h6 titles for instance.

First question is how to store those different groups of text in Elasticsearch? Do we create as many fields as level of weights we can have?

Import Blekko slashtag data

As Greg Lindhal (@wumpus) pointed out, DMOZ's data is rather low-quality these days, so it could be great to add presence in https://github.com/blekko/slashtag-data as another signal.

This should be pretty straightforward to do in the code, by duplicating what is currently done with DMOZ.

Is there an explicit license to this data though?

Improve filtering of EU cookie notices

Cookie notices are more of an annoyance than regular boilerplate because they usually appear on top of the page and may pollute the snippets.

Right now we have very basic code to filter some of them, but we could use some of the lists at https://filterlists.com/ to filter more of them.

One big issue is the format of these lists though: they use CSS selectors, sometimes as complex as cofunds.co.uk###idrMasthead > .idrPageRow[style*='z-index:1']. We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.

We may want to start by only using definitions by IDs and classes, which should take care of most cases.

Rough todo list:

Decide which lists to use depending on license, maintenance and coverage
Write a script to download, parse and store them (in a rocksdb database, like we do in the urlserver?)
Write a test, ideally with a dummy list like the others in the tests/testdata directory
Implement in cosrlib/document/html/

Index license info

It's hard to believe that with @mlinksva in the loop this hasn't been proposed before ;-)

How important/useful would it be to index Creative Commons (and others?) license tags and be able to filter results depending on them?

Speed up Travis builds

Travis builds are timeouting :-/

There are a few things to investigate:

Cache Docker builds. See moby/moby#20316 https://docs.travis-ci.com/user/caching/ https://developer.zendesk.com/blog/big-and-fast-tests-taking-our-travis-build-from-4-hours-to-13-minutes http://michael.stapelberg.de/Artikel/docker-on-travis-for-new-tools-and-fast-runs for ideas.
If we add a versioning system to invalidate the Docker cache, it would be great to use it to warn developers that their local images are out of date too.
Parallelize tests
Optimize longer tests, particularly the new PageRank ones.

Questions on deployment

Hello,

Thanks for everyone who replied! Really helped me.

So all last week i have been trying to setup up this project using AWS.

I have multiple questions about development!

Does cosr-ops script provide everything needed for creation of Es clusters and all that worker stuff ?
You didnt mention you use rocksdb and gombo, i installed it manually in spark-master.
When i try to index test data, by running command make aws_spark_deploy_cosrback, it gives me error:

This happens after ALEXATOP1M is downloaded and it tried to write it to rocksdb.

   Traceback (most recent call last):
      File "urlserver/import.py", line 21, in <module>
        ds.import_dump()
      File "./urlserver/datasources/__init__.py", line 62, in import_dump
        for i, row in self.iter_dump():
      File "./urlserver/datasources/__init__.py", line 102, in iter_dump
        f = self.open_dump()
      File "./urlserver/datasources/__init__.py", line 144, in open_dump
        return GzipStreamFile(f)
      File "/cosr/back/venv/src/gzipstream/gzipstream/gzipstreamfile.py", line 62, in __init__
        super(GzipStreamFile, self).__init__(self._gzipstream)
      File "/usr/lib64/python2.6/io.py", line 921, in __init__
        raw._checkReadable()
    AttributeError: '_GzipStreamFile' object has no attribute '_checkReadable'

Any ideas ? I have been beating my head on this for a long time.

After running make aws_elasticsearch_create and successfully creating ES instances(3 of them). How do i access them ? Whats the port of ES to point frontend to ?

I really hope to finally set everything up, so i can work on the ISSUES( frontend and backend ). I have hight hopes for this project.

Add support for site:example.com queries

This would be helpful in the future, but also in the short-term to debug issues like commonsearch/cosr-results#2

A domain field will probably need to be added to the Elasticsearch mapping

Then we have to parse terms like site:example.com in the queries. Possibly related: #2 & commonsearch/cosr-front#18

Add Explainer prototype

The explainer is a web service for debugging query results.

It should be useful in these usecases:

Enter an URL, fetch it, parse it like if it was just extracted from a WARC file and output a representation of the internal HTMLDocument, to view which parts of the page we ignore or consider as boilerplate.
Enter a search + a URL/domain/docid and view the way the static rank was computed (from all the signals) and the Elasticsearch relevance debug info. This should be able completely explain the order of the search results.

I'm still cleaning an early prototype of this, and will push soon.

Errors During Installation

Hey Guys,

I get some errors after I execute make docker_test_coverage . I am unsure of what they mean exactly. Apparently, I pass 49 tests, skipped 1, and failed 16. The last error I get is :

ProtocolError: ('Connection aborted.', error(111, 'Connection refused'))

Here are some screenshots as well. Help would be greatly appreciated.

Clean titles and descriptions

Some titles contain characters like 🔥, which we probably do not want in the search results.

Is there a simple way (or existing Python module?) to clean all those characters without messing with international characters?

PageRank & other jobs: check if output directory already exists

This would avoid errors late in the job like this:

Traceback (most recent call last):
  File "/cosr/back/spark/jobs/pagerank.py", line 459, in <module>
    job.run()
  File "/cosr/back/cosrlib/spark.py", line 207, in run
    self.run_job(sc, sqlc)
  File "/cosr/back/spark/jobs/pagerank.py", line 75, in run_job
    self.custom_pagerank(sc, sqlc)
  File "/cosr/back/spark/jobs/pagerank.py", line 289, in custom_pagerank
    compression="gzip" if self.args.gzip else "none"
  File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 632, in text
  File "/usr/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'path file:/cosr/back/out/pagerank already exists.;'

reported by @HenriqueLimas

Support the Robots meta tag

http://www.robotstxt.org/meta.html

Not sure if Common Crawl already filters those pages, but we should do it on our side too anyway.

Some pointers:

Add a is_indexable method to Document
HTMLDocument.is_indexable() should test for the presence of noindex in self.head_metas["robots"]
Documents should be skipped accordingly in indexer.py
To support nofollow we could simply test for its presence in close_tag

How should canonical URLs be validated?

Currently we force canonical URLs declared in the meta tags to be on the same domain as the base document:
https://github.com/commonsearch/cosr-back/blob/master/cosrlib/document/html/htmldocument.py#L346

Is this requirement too strict? If we relax it (same root domain? same DNS owner? any domain?), would some abuse/impersonation be possible?

Structure of ES clusters

Hello,

So in your architecture image you tell us that there are 2 ES clusters: https://about.commonsearch.org/developer/architecture

But when you view Operations documentation: You tell us that there is 1 cluster and 3 nodes, so witch one is it ?

Also i successfully executed ES cloudformation create and it created 1 instances, but when i open the public link, there is no ES installed ?

I am really confused, can someone explain ?

Simple improvements to URL normalization

There are many simple things we could do to improve our normalized URLs and avoid duplicates:

Remove known session or marketing trackers (PHPSESSID, utm_*, ..)
Remove default port numbers (80 for http, 443 for https)
Remove wwwNN subdomains (we already remove www)
Sort query string params ?

Some good ideas there:
https://github.com/iipc/webarchive-commons/tree/master/src/main/java/org/archive/url
https://github.com/rajbot/surt/tree/master/surt

Code for this should be done in https://github.com/commonsearch/cosr-back/blob/master/cosrlib/url.py. Exhaustive unit tests would be great!

Add a Github document source

They mention opening their data:
https://github.com/blog/2201-making-open-source-data-more-available

I'm not sure if the dumps are publicly accessible outside of BigQuery? If not, is using the API the only solution?

Add first document-level quality signals

We will need to have a model that evaluates many features from documents and gives us a document quality score.

Before doing any machine learning, it would be great to explore the first few features/signals we could include.

A first list of ideas, please add your own!

Vocabulary (we should look at email spam filters)
Broken HTML (this is rather broad!)
Use of tags like <blink> :-)
Usage of known JavaScript trackers/libraries (could be good or bad)
Specific services like domain parking
Usage of ALL CAPS text?

Make coverage.py work with pyspark / spark-submit

Currently our test coverage score is lower than it should be.

We are not collecting coverage information for the code running under spark-submit:
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_index.py#L79

We are already able to collect coverage data for subprocesses (we collect while doing make import_local_data https://github.com/commonsearch/cosr-back/blob/master/Makefile#L144), but pyspark seems to copy the job file in a temporary folder, which breaks coverage collection, probably because the filenames change and/or the pytest-cov.pth file is not executed?

Index presence of ads, trackers

https://filterlists.com/ could help determine.

Allow users to filter based on index and/or boost results lacking presence.

Looking at https://about.commonsearch.org/values it seems such filters would be mainstream (more so than license filters) and possibly aligned with privacy, though as stated the value is only about what Common Search does with user data. But Common Search's independence could allow it to take stronger (or at least different) measures to protect searchers than Google does.

I'd love to be able to search the web sans ad-laden sites. Not to avoid the ads (for that I use an ad blocker) but to avoid the junk content. Searching for info on many consumer products on Google, one has to wade through ad/affiliate-driven reviews and stores to find neutral information or even information provided by the manufacturer. Filtering out stores would be harder so I didn't put in the title of this issue.

Optimize Elasticsearch indexing

It's not a bottleneck at this point, but it could clearly be improved.

A first step could be to avoid using the Python _bulk helper class and use ujson to dump results directly.

Batch sizes should also be added as config parameters so that they can be optimized by ops at index time.

Investigate using CloudFlare's zlib

Decompressing the WARC files from Common Crawl is a relatively slow step in the indexing process. It would be great to see how much improvement CloudFlare's version of zlib can bring.

Some benchmarks here:
http://www.snellman.net/blog/archive/2015-06-05-updated-zlib-benchmarks/

We should keep the ability to fallback on the regular implementation.

To do this we may have to fork commoncrawl/gzipstream, which is the place where zlib is imported: https://github.com/commoncrawl/gzipstream/blob/master/gzipstream/gzipstreamfile.py#L2

If the improvement is significant we should use it and add the build commands to our Dockerfile.

Import Wikipedia abstracts

We will end up importing the whole Wikipedia dumps soon enough, however a simple first step would be to import the abstracts, after #10 is finished.

That could allow us to add good descriptions, possibly of better quality than DMOZ. That would be helpful for commonsearch/cosr-results#1 for instance.

We could also possibly start including all wikipedia urls in the results, even if we didn't index their whole content yet.

https://meta.wikimedia.org/wiki/Data_dumps

There seems to be a combined abstract.xml file ("Recombine extracted page abstracts for Yahoo"), is this the one we should use?
https://dumps.wikimedia.org/enwiki/20160204/

Index link text

Link text is a powerful signal for relevance.

Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.

A couple options I see:

Begin our Spark indexing pipeline by gathering a list of the top N link texts for every page/domain, and then in the same job iterate over the WARC files again, fetching the link text from Spark RDDs.
Same as above, but instead of keeping the link texts in Spark RDDs, store them in a large key/value db (target_url=>link_texts), from which they would be fetched by our current index process. Storing them in a permanent database would allow us to use them elsewhere but is obviously more complex.
Index like we do currently, and then do a second indexing pass just for link text. Which would mean an update to the Elasticsearch document, which is a costly operation.

The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.

Import wikidata dumps

https://www.wikidata.org/wiki/Wikidata:Database_download

Importing wikidata would be (for starters) a good way to associate a lot of official URLs to their named entity and wikipedia page. This would help commonsearch/cosr-results#4 for instance.

The way we import Alexa should be a good starting point.

For a first version I think it should be ok to store (key, value) in rocksdb as (normalized_url, (name, english description, english wikipedia slug)).

Wikidata should then be added to https://about.commonsearch.org/data-sources

Add unit test and code coverage badges to README

Rather easy to do and helps seeing the project "health" immediately:

add build status badge (https://travis-ci.org/commonsearch/cosr-back)
add code coverage badge (using a service such as coveralls or codecov)
maybe add license badge too

IMHO a lot of people (including myself) look for these badges (e.g. to get a first impression on test quality).

Example project with (IMHO relevant) badges: https://github.com/zalando/connexion

Spark-submit uses only 1 core.

So i recently deployed the project on aws and i was supriced by the low performance of indexer. I investigated and found out that spark indexer only uses 1 core of all available( on 100% ), why is that ?

I can't seem to figure out the way to fix this, any ideas ?

Create documents from DMOZ/Wikidata when they are missing in CC

As mentioned in commonsearch/cosr-results#2, some big domains are missing from Common Crawl for various reasons that we will try to fix, but we should have a fallback with "fake" documents created from DMOZ and/or Wikidata items, in order to avoid any large gaping holes in the short term.

The main question is how this would fit in our current pipeline. The simplest way would probably be to iterate over entries from DMOZ & Wikidata (either with a range query on URLServer or straight from the dumps?) and send op_type=create queries to Elasticsearch, to avoid overwriting documents that were already indexed from Common Crawl:
https://www.elastic.co/guide/en/elasticsearch/guide/current/create-doc.html

This method would only work after a clean reindex from Common Crawl, but this shouldn't be a big issue short-term. Open to other ideas though!

Use json-ld for document description

I'm not sure if it is relevant but it may be interesting to use json-ld [1] for document description, may be even with the schema.org [2] vocabulary.

For example https://en.wikipedia.org/wiki/Jean-François_Champollion could be represented by (in an amazing future when Common search will be able to guess page topics):

{
  "@context": "http://schema.org",
  "@type": "WebPage",
  "@id": "https://en.wikipedia.org/wiki/Jean-François_Champollion",
  "name": "Jean-François Champollion - Wikipedia, the free encyclopedia",
  "description": "Jean-François Champollion (a.k.a. Champollion le jeune; 23 December 1790 – 4 March 1832) was a French scholar, philologist and orientalist",
  "inLanguage": "en",
  "image": "//upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Leon_Cogniet_-_Jean-Francois_Champollion.jpg/220px-Leon_Cogniet_-_Jean-Francois_Champollion.jpg",
  "dateModified": "2016-03-01T14:47:00",
  "fileFormat": "text/html",
  "mainEntity":{
    "@type": "Person",
    "@id": "http://www.wikidata.org/entity/Q260",
    "name": "Jean-François Champollion"
  }
}

[1] http://json-ld.org
[2] http://schema.org

Parse document created/updated dates

In many cases it would be an interesting info to show in the results.

There are many ways of getting this data from page or headers, with varying complexity and confidence. Let's investigate them!

Improve datasources import

We currently have ad-hoc scripts in https://github.com/commonsearch/cosr-back/tree/master/scripts to import our various datasources, each of them creating a different rocksdb database.

This is not ideal and should be improved in the following ways:

Use only one database. We may use https://github.com/facebook/rocksdb/wiki/Merge-Operator to update the keys with each new datasource, though I'm not sure it is needed if we only import sequentially.
Move the import code inside the URLServer
Stream the data from the network instead of downloading files and then reading them.
Have a look at low-hanging fruits performance-wise
Write a guide on how to add a new datasource. Let's try to have the bar for experimentation as low as possible.
Investigate the complexity of migrating to gRPC / Protobufs. I had a prototype of this but went with mprpc for the sake of simplicity.
Update the architecture schema to include the URLServer

Still some encoding issues

There seem to be a few cases left where we get � characters in search results:

Check to see if they can be fixed, and prefer ignoring the error completely than showing any �.

Add a Reddit data source

There is a dataset available for 2006 to August 2015:
https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/

How to use it? The votes are probably an interesting signal for ranking.

Import DNS metadata

There are a few ranking signals we could extract from data related to the domain name registration and records:

Date created, expiry date, date of last owner change?
IP (to be used later to fight spam?)

Getting complete whois/zonefile dumps doesn't seem easy at the moment. Any ideas?

A couple interesting links:

Error running index job

When I run
spark-submit jobs/spark/index.py --warc_limit 1 --only_homepages --profile

as described in README.md, the follow error will appear:

16/03/15 07:15:18 INFO BlockManagerMaster: Registered BlockManager
Traceback (most recent call last):
File "/cosr/back/jobs/spark/index.py", line 174, in
spark_main()
File "/cosr/back/jobs/spark/index.py", line 145, in spark_main
warc_filenames = list_warc_filenames()
File "/cosr/back/jobs/spark/index.py", line 72, in list_warc_filenames
warc_files = list_commoncrawl_warc_filenames(limit=args.warc_limit, skip=args.warc_skip)
File "/cosr/back/cosrlib/webarchive.py", line 22, in list_commoncrawl_warc_filenames
with open(warc_paths, "r") as f:
IOError: [Errno 2] No such file or directory: '/cosr/back/local-data/common-crawl/warc.paths.txt'

Avoid indexing data URIs for images

Currently for images we index the alt attribute as well as the filename.

However we don't exclude Data URIs, which we should do because it makes no sense to index that.

Smarter page titles & descriptions

Some page titles out there are either plain wrong or unhelpful. It's way worse for descriptions.

Most other search engines take some liberty and don't just use the <title> tag as source for the title or the <meta> tags for the description.

Other sources of data could include:

DMOZ (there is even a tag for avoiding it ) : #12
Wikipedia: #11
Other content in the page, like <h1> tags

Any other idea?

How to choose between these source will be a complex topic but we can build something reasonably simple in the short term. We already support a blacklist of titles and a few fallbacks in formatting.py

commonsearch / cosr-back Goto Github PK

cosr-back's Issues

SafeSearch

Recommend Projects

Recommend Topics

Recommend Org