Comments (3)
支持的,首先设置好内容页规则,比如:
'content_url_regexes' => array(
"http://www.mafengwo.cn/i/\d+.html",
),
然后在on_scan_page里面批量生成内容页url
$spider->on_scan_page = function($page, $content, $phpspider)
{
for ($i = 0; $i < 1000; $i++)
{
$url = "http://www.mafengwo.cn/i/{$i}.html";
$phpspider->add_url($url);
}
};
from phpspider.
如果内容页并没有在入口页面或者列表页面呢,我只想批量生成内容页面url,然后爬虫挨个爬内容
from phpspider.
对add_url
函数做点小调整就可以。新增一个$force_content
参数,调用add_url
函数时设置该参数为true
。内容页、列表页规则都留空即可。
public function add_url($url, $options = array(), $depth = 0, $force_content = false)
{
// 投递状态
$status = false;
$link = $options;
$link['url'] = $url;
$link['depth'] = $depth;
$link = $this->link_uncompress($link);
if ($this->is_list_page($url))
{
$link['url_type'] = 'list_page';
$status = $this->queue_lpush($link);
}
if ($this->is_content_page($url) || $force_content)
{
$link['url_type'] = 'content_page';
$status = $this->queue_lpush($link);
}
if ($status)
{
if ($link['url_type'] == 'scan_page')
{
log::debug("Find scan page: {$url}");
}
elseif ($link['url_type'] == 'list_page')
{
log::debug("Find list page: {$url}");
}
elseif ($link['url_type'] == 'content_page')
{
log::debug("Find content page: {$url}");
}
}
return $status;
}
from phpspider.
Related Issues (20)
- 验证码识别问题 HOT 1
- 我在windows环境下运行了demo下的马蜂窝 HOT 2
- [error] Domain of scan_urls ("https://bbs.zhibo8.cc/forum/list/?fid=62") does not match the domains of the domain name
- 建议用swoole HOT 1
- 文档里的某个xpath不起作用 HOT 3
- redis、mysql 执行长都出现了超时的情况
- 发现一个过时函数 & 一个 bug HOT 3
- 有遇到这个问题的吗? HOT 1
- 最新的知乎应该怎么爬
- 如果知道动态网页的加载API并且也可以请求到json的数据,怎么能通过接口嵌入到框架里进一步抓取 HOT 1
- 高版本PHP已废弃这种 $s0{0} 写法,请使用$s0[0] HOT 2
- 关于知乎用户数据的爬虫我确实想过一个用途
- 在用回调函数on_list_page去获得列表页数据时候,无法真正add_url HOT 2
- 用js渲染数据的页面可以抓去吗?类似vue作为前段的 HOT 1
- 关于attached_url的bug HOT 5
- tp5 默认会写入一下报错到日志里面 HOT 2
- 关于分页采集 怎么搞都不对 HOT 2
- 能不能内容也 先点击一个动作,余下全文,然后再开始采集?
- 修复7.4.16版本报错bug--修复打个tag
- PHP8运行官网demo报错 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phpspider.