Giter Site home page Giter Site logo

bookcorpus's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bookcorpus's Issues

HTTPError: HTTP Error 401: Authorization Required

Thanks for you code, but I got some network trouble when I run the download_list script. The full error message is
Failed to open https://www.smashwords.com/books/category/1/downloads/0/free/medium/0
HTTPError: HTTP Error 401: Authorization Required

What's more, when I use your url_list.jsonl to download file, the download_filles script gaves the same error message:
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 401: Authorization Required

And I tried to open the url in my chrome, and I can see that page without error 401. Could help to find a solution? Thanks a lot~

Can anyone download all the files in the url list file?

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot HTTP Error: 403 Forbidden How to fix this ? Or can i get the all the bookscorpus data from somewhere ?

Thanks

Update on the `url_list.jsonl`

Hello, on 2022-12-17 I run the script download_list.py with modified number to page to 31430 which covered the last search page. Here is the updated url_list.jsonl.zip

There are 4544 entries loss, and 8475 entries added from the original file

Hope this help

Books3 Links are Dead

The download links provided for books3.tar.gz no longer work. Is there an updated host?

Here’s a download link for all of bookcorpus as of Sept 2020

You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21

it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

The new script:

  1. Correctly preserves structure, matching the table of contents very closely;

  2. Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),

  3. Correctly preserves code structure, so that source code and similar things are visually coherent,

  4. Converts numbered lists from “1\.” to “1.”

  5. Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;

  6. Expands Unicode ellipses to “...” (three separate ascii characters).

The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.

How to resolve URLError SSL: CERTIFICATE_VERIFY_FAILED

If you get the following error:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>

Adding this block of code at the top of the script at download_files.py will resolve it.

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

intermittent issues with connections and file names

example:
python3.6 download_files.py --list url_list.jsonl --out out_txts --trash-bad-count
0 files had already been saved in out_txts.
File is not a zip file |
File is not a zip file
File is not a zip file
File is not a zip file
File is not a zip file
Failed to open https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Succeeded in opening https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub
File is not a zip file
File is not a zip file |
File is not a zip file
File is not a zip file
File is not a zip file |
File is not a zip file
File is not a zip file
"There is no item named '' in the archive"
File is not a zip file
File is not a zip file
"There is no item named 'OPS/' in the archive"
File is not a zip file
File is not a zip file |
File is not a zip file
Failed to open https://www.smashwords.com/books/download/793264/8/latest/0/0/jaynells-wolf.epub
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Succeeded in opening https://www.smashwords.com/books/download/793264/8/latest/0/0/jaynells-wolf.epub
Failed to open https://www.smashwords.com/books/download/479710/6/latest/0/0/tainted-ava-delaney-lost-souls-1.txt
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Succeeded in opening https://www.smashwords.com/books/download/479710/6/latest/0/0/tainted-ava-delaney-lost-souls-1.txt
File is not a zip file
"There is no item named 'OPS/' in the archive"
File is not a zip file
Failed to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub
HTTPError: HTTP Error 404: Not Found
Failed to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub
HTTPError: HTTP Error 404: Not Found
Gave up to open https://www.smashwords.com/books/download/496160/8/latest/0/0/royal-blood-royal-blood-1.epub
[Errno 2] No such file or directory: 'out_txts/royal-blood-royal-blood-1.epub'

smashwords.com forbids this; readme should tell people to get permission first

The code in this repo violates both the robots.txt of smashwords.com:

$ curl -s https://www.smashwords.com/robots.txt | tail -4
User-agent: *
Disallow: /books/search?
Disallow: /books/download/
Crawl-delay: 4

and their terms of service, as far as I can see: “Third parties are not authorized to download, host and otherwise redistribute Smashwords books without prior written agreement from Smashwords” (you could imagine that this only prohibits downloading for subsequent hosting or redistribution, but I think that would be an opportunistic interpretation :) ).

The readme should tell people very clearly that they must get permission from smashwords.com before running this stuff against their site.

Network Error

Hi,Thanks for your code, it's really useful for most nlp researchers and thank you again.

And when I run this code, it's often interrupted by network error after download a little files, I thought this maybe caused by my network.
so, could you please send me a email attached with the crawled BookCorpus datasets if you have ?

My email is: [email protected]. Thank you very much.

Best,

Sort by author

Is it possible to sort the downloaded files author-wise here?
Thanks!

download_list.py not working due to title change.

Apparently the titles on smashwords changed.
txt is now found under "Plain text; contains no formatting"
and epub under "Supported by many apps and devices (e.g., Apple Books, Barnes and Noble Nook, Kobo, Google Play, etc.)"

Could you share the processed all.txt?

Hi Sosuke,

Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is [email protected]. Thanks!

Zhijie

epub2txt.py produces incorrect results for many epubs

Specifically this line:

html = file.read(ops + t.content.split("#")[0])

image

When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.

The reason is that the Table of Contents looks similar to this:

ch1.html#section1
ch1.html#section2
ch1.html#section3
...
ch2.html#section1
ch2.html#section2
...

The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.

I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.