Giter Site home page Giter Site logo

docdown's Introduction

DocDown

文档:DocDown 使用 Playwright 驱动的 豆丁 docin / 原创力文档 book118/ 百度文库 baidu 预览文档下载工具

使用 playwright 强力驱动的 原创力文档 book118 & 豆丁网 docin & 百度文库 baiduwenku 下载工具。

支持范围:book118 doc ppt pdf,docin doc,百度文库。

项目说明

使用 playwright 强力驱动的 原创力文档 book118 & 豆丁网 docin & 百度文库 baiduwenku 下载工具。

支持范围:book118 doc ppt pdf,docin doc,百度文库。

使用教程

打包版本

下载链接

访问待下载网站,点击预览,复制链接,格式如下;

https://max.book118.com/html/2017/1105/139064432.shtm

以上面的链接为例,在下载目标文件夹下,右键-在终端中打开(Windows11),按住 Shift+右键-在此处打开 Powershell 窗口(Windows10),然后运行

./docdown 下载链接带英文引号

# 例如:
./docdown 'https://max.book118.com/html/2017/1105/139064432.shtm'

之后会弹出浏览器窗口,一段时间后会在目录下生成 PDF 文件。

直接运行源码

克隆本项目,安装依赖

pip install -r requirements.txt

# 安装playwright库
pip install playwright

# 安装浏览器驱动文件(安装过程稍微有点慢)
python3 -m playwright install

# 或者(如果上面命令报错)
playwright install

访问待下载网站,点击预览,复制链接,格式如下;

https://max.book118.com/html/2017/1105/139064432.shtm

以上面的链接为例,在项目文件夹下,使用:

## book118
python run.py 'https://max.book118.com/html/2017/1105/139064432.shtm'

# 或者

python3 run.py 'https://max.book118.com/html/2019/0929/6203012025002111.shtm'

## docin
python run.py 'https://www.docin.com/p-1052644960.html'

运行将会在运行目录下生成pdf文档。

如果报错Image contains an alpha channel which will be stored as a separate soft mask (/SMask) image in PDF.属于正常现象,不影响最终结果。

从源码打包

克隆项目,打开 cmd,使用

set PLAYWRIGHT_BROWSERS_PATH=0
playwright install webkit

安装 webkit,然后使用 pyinstaller 打包文件run.py

参考:playwright在pyinstaller下打包

常见问题

使用问题

如果遇到运行错误请先确保以下内容均已注意,再提 issue。

  • 注意关闭系统代理。
  • 复制粘贴链接时需要打上英文引号'

技术问题

目前这些问题无法解决,如果您有好的解决方法请提 issue。

  • 部分文档格式不支持。
  • 需要付费预览的文档不支持。
  • 只支持下载为 PDF 格式(image 转 pdf)。
  • 百度文库清晰度较低(Playwright 截图限制)。

进阶使用

您可以考虑使用 百度 OCR 对下载的 PDF 文档作转文本操作。

docdown's People

Contributors

alitterman avatar kerm-me avatar kermsite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

docdown's Issues

原创力文档pptx下载问题

原创力上某些ppt会出现一种问题,正常情况前后翻阅都正常,但一旦翻到最后一页,再往前翻就会有一页出现空白。不清楚是不是特地针对这个下载器的,希望作者能够考虑在出现这种下载失败时不直接exception退出而是继续下载,下完了我自己补上一页也行啊

doc88的算法有问题

148行的data=false,怎么在循环外面,弄得我刚开始一直保存第一页,应该放到循环里吧

提示下载失败

现在进行豆丁的文档下载,都会提示下载失败,关闭代理。实际上代理是一直关闭的

生成PDF后,pdf文件是空的

下载这个地址'https://max.book118.com/html/2017/0425/102360892.shtm'的文档在生成PDF后,pdf文件是空的,
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
下载失败,请注意关闭代理,如果还有问题,请至GitHub提交issue,附上下载链接

输入命令行后,浏览器没有挑出来,过一段时间docdown报错关闭

电脑没有开启代理,命令行就是输入的事例:./docdown 'https://max.book118.com/html/2017/1105/139064432.shtm'

按回车后docdown窗口跳出过一段时间报错自动关闭,浏览器一直没有跳出。

docdown报错代码:
Traceback (most recent call last):
File “run.py”,line 7 ,in
File "tool.py", line 151, in download from url
File "playwright\sync_api_generated.py," line 11538,in launch
File“playwright impl\sync basepy",line 88,insync
File "playwright impll browser type.py",line 90,in launch
File“playwright impll connection.py",line 39, in send
File "playwright imp1 connection.py",line 63, in inner send
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
[5196] Failed to execute script run due to unhandled exception!

使用pyinstaller打包的问题

打包成exe后,运行后出现错误,找不到playwright模块,但在conda的虚拟环境中已经安装,想问下作者打包时是否出现此问题

命令中的网址不需要加单引号(Win11,release v1.1.0)

我实际使用结果如下。
作者大大可以考虑更改一下readme,我看到之前也有issue反应这个问题。

E:\tools>docdown 'https://max.book118.com/html/2017/1221/145235973.shtm'
Traceback (most recent call last):
  File "run.py", line 6, in <module>
  File "tool.py", line 121, in download_from_url
  File "playwright\sync_api\_generated.py", line 7413, in goto
  File "playwright\_impl\_sync_base.py", line 88, in _sync
  File "playwright\_impl\_page.py", line 493, in goto
  File "playwright\_impl\_frame.py", line 122, in goto
  File "playwright\_impl\_connection.py", line 39, in send
  File "playwright\_impl\_connection.py", line 63, in inner_send
playwright._impl._api_types.Error: Protocol error (Playwright.navigate): Cannot navigate to invalid URL [{"code":-32000,"message":"Cannot navigate to invalid URL"}]
=========================== logs ===========================
navigating to "'https://max.book118.com/html/2017/1221/145235973.shtm'", waiting until "load"
============================================================
[12224] Failed to execute script 'run' due to unhandled exception!

E:\tools>docdown https://max.book118.com/html/2017/1221/145235973.shtm
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.
Image contains an alpha channel. Computing a separate soft mask (/SMask) image to store transparency in PDF.

报错:navigating to "", waiting until "load"

PS C:\WINDOWS\system32> cd
PS C:\WINDOWS\system32> cd C:\Users\Desktop\docdown-1.1.0
PS C:\Users\Desktop\docdown-1.1.0> ./docdown 'https://max.book118.com/html/2022/1028/6113124131005010.shtm'
您可以直接访问PPT预览(无广告):
https://view41.book118.com/?readpage=jDA0r@OLdvjWvcxa0TKXsQ==&furl=c0LZaqbditkfo3Mn_ZthAeCN86wcDVAMFcwHXsMVYHJtg27mBxWPCRGee0YeQxh9NyNl0gM8pactvnE1TwnN11DXLFTQGzZbJDOxY2rxS_M=&n=1
invalid literal for int() with base 10: ''
下载PPT失败,请至GitHub提交issue,附上下载链接
下载失败,请注意关闭代理,如果还有问题,请至GitHub提交issue,附上下载链接
PS C:\Users\Desktop\docdown-1.1.0> ./docdown 'https://max.book118.com/html/2022/1028/6113124131005010.shtm'
您可以直接访问PPT预览(无广告):

Protocol error (Playwright.navigate): Cannot navigate to invalid URL [{"code":-32000,"message":"Cannot navigate to invalid URL"}]
=========================== logs ===========================
navigating to "", waiting until "load"
============================================================
下载PPT失败,请至GitHub提交issue,附上下载链接
下载失败,请注意关闭代理,如果还有问题,请至GitHub提交issue,附上下载链接
PS C:\Users\Desktop\docdown-1.1.0>

文档下载链接:https://max.book118.com/html/2022/1028/6113124131005010.shtm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.