trivio / common_crawl_index Goto Github PK

View Code? Open in Web Editor NEW

187.0 187.0 49.0 782 KB

Index URLs in Common Crawl

Python 100.00%

common_crawl_index's People

Contributors

Stargazers

Watchers

Forkers

keiw wallberg-umd wiseman thebennos jiangwei1221 oyiptong abinghua jeffnappi icrazybone imclab pinzhang pombredanne renandev soheil-zz saloua-cliqz pwaila wumpus benoitdherin mortar-repos bag-of-projects xfredcox princeedward kellymorrison hainguyen007 stephaniemak jjangsangy casafc ebolless popol1991 project-renard-survey kalyanp davidchu201 dportabella gurusura nelsonjiao ci-research commoncrawl zouzias gptcod sysujayce trendsci bobthehands confetticoncept

common_crawl_index's Issues

Any plans to index and support the newer datasets?

Docs should mention that urls are stored in revers hostname order.

arcFileParition should be arcFilePartition in code example at end of README.md

project deprecated?

Is this project deprecated? I see there are no commits since 2013, and there appears to be a new index scheme available since 2015: http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/

Since the example in the README.md is not working, I'm guessing that this project is not being (and does not need to be) maintained. If that's the case, I think we could help some people by updating the README.md file to indicate that the project is deprecated by the new index.

If that's true, I am happy to create a pull request for it.

typo: arcFileParition should be arcFilePartition

Using the remote_read script I think there is a typo in the item_keys. Shouldn't arcFileParition be arcFilePartition?

Retrieving urls for a specific coutnry tld?

Hello,

Could this script retrieve urls belonging to a country tld eg *.fi?

AttributeError: 'NoneType' object has no attribute 'get_key'

Input:

./bin/remote_copy check "org.domain.www"

Output:

Traceback (most recent call last):
  File "./bin/remote_copy", line 152, in <module>
    mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')
  File "./bin/remote_copy", line 35, in __init__
    self.key = bucket.get_key(key_name)
AttributeError: 'NoneType' object has no attribute 'get_key'

How can I get a file in text mode?

I am trying to copy a file in text mode, but it is not working. The URL is com.wordpress.alinebessa/2011/06/11/documenting-accerciser-first-impressions/:http

which exists in CommonCrawl. When I check it out here: http://urlsearch.commoncrawl.org/page/1346876860454/1346973204444/3513/41986721/13163

It gets loaded correctly, but this does not happen when I try to fetch it in remote_copy (method copy_arc_files) by making:

if src_key:
print src_key.get_contents_as_string(headers=headers, encoding="iso-8859-1")

It comes back to me as bytes. Can you folks please help me in retrieving the actual text? Thanks!

./remote_read www.direkt-einkauf.at does not return anything

The reason i believe it should return something is:

url -r 1048584-1048744 http://s3.amazonaws.com/aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792

which is the second index block, which contains prefixes for this domain

URLs not correctly sorted in index?

I think I found incorrectly sorted URLs in the index. For example, in block 4, net.about-plumbing.www/... comes after net.absolutely.www/...:

  'my.com.mahsuri.www/blog/rawatan/ti\x00',
  'name.armando.francesco.www/gallery/andrea_e_vera/slides/DSCN4325\x00',
  'net.123tools.www/audio_multimedia/audio_file_players/index-n-123tools-3\x00',
+ 'net.about-plumbing.www/new-hampshire/colebrook-m\x00',
  "net.absolutely.www/event/Premiere_of_'Land_of_the_Lost'/land_of_the_lost_02_wenn24377\x00",
- 'net.about-plumbing.www/new-hampshire/colebrook-m\x00',
  'net.adiochiropractic.www/templates20/article/1296\x00',
  'net.agilpage.www/index~m~1~w~11276\x00',
  'net.alblasserdam.www/nieuws/2009-07-02-6051-internetprovider-proserve-bouwt-regionaal-datacenter-in-alblasserdam.html:http\x00',

(The above is a diff of the actual list of prefixes in the block and the sorted list of prefixes in the block.)

This could cause a failure of the bisection lookup.

ImportError: No module named boto

Hello,
I am experimenting with your Python script which seems very promising for a separate project requiring lookups of Common Crawl URLs. I have installed your current version from Git on my Ubuntu machine and ran:

$ bin/index_lookup_remote france.fr Traceback (most recent call last): File "bin/index_lookup_remote", line 26, in <module> import boto ImportError: No module named boto
Can you please tell me what boto is supposed to be?
Thanks,
Dan

Update index?

Not sure if this is the right place to create a ticket, but I'm wondering if it's possible to update the url index to the March 2014 crawl?

revers hostname transformation is ambiguous

Function reversehost() in some cases makes original url recovery ambiguous,

i.e ua.com.book-hunter.www/book/view/231/page:16:http represents both

reversehost('http://www.book-hunter.com.ua/book/view/231/page:16') 
== reversehost('http://www.book-hunter.com.ua:16/book/view/231/page') 
== 'ua.com.book-hunter.www/book/view/231/page:16:http'

which makes such revers hostname order ambiguous and reversehost() procedure not always invertible to the actual URL.

Unexpected results with different key lengths

I expected all of these to give the same first result:

$ bin/remote_read org.wikipedia.en | head -1
org.wikipedia.en/wiki/1525:http {'compressedSize': 15889, 'arcSourceSegmentId': 1346876860777, 'arcFilePartition': 4752, 'arcFileDate': 1346910706993, 'arcFileOffset': 75900503}
$ bin/remote_read org.wikipedia.en/wiki | head -1
org.wikipedia.en/wiki/1647_in_literature:http {'compressedSize': 9294, 'arcSourceSegmentId': 1346876860777, 'arcFilePartition': 1900, 'arcFileDate': 1346910123817, 'arcFileOffset': 77716338}
$ bin/remote_read org.wikipedia.en/wiki/1 | head -1
org.wikipedia.en/wiki/1942:_Joint_Strike:http {'compressedSize': 10488, 'arcSourceSegmentId': 1346823846039, 'arcFilePartition': 1724, 'arcFileDate': 1346872155599, 'arcFileOffset': 10475238}
$ bin/remote_read org.wikipedia.en/wiki/1942 | head -1
[no output]

The last one didn't even find the existing URL. Am I doing something wrong?

reverse hostname transformation breaks urls with username:[email protected]

Right now CC contains some urls with username:password in them.
On index creation they were transformed by function reversehost() and as a result they are not searchable.

reversehost('http://123456:[email protected]/') 
== '123456/:[email protected]:http'

reversehost('http://Dennis:[email protected]/members/index.shtml') 
== 'Dennis/members/index.shtml:[email protected]:http'

Investigate anomalies

https://github.com/vbabu75/common_crawl/blob/master/other/common_crawl_index_anomalies.txt