Comments (5)
phpspider.php 文件的第 2114 行,下载应该使用 $collect_url $html = requests::$method($collect_url, $params);
否则不会去下载 attached_url
你改下提个patch给我呀
from phpspider.
大哥!!!我真的太感谢你了!!
我找半天 总是找不到为什么加载了 下载不了,只能下载主页,原来是代码有问题!
@owner888 群主啊,你害人不清啊!! 虽然你的代码节省了我们大量时间,你好歹也测试下啊。
我搞了3天3夜没找到原因
from phpspider.
@yingzheng1980 还有个问题,有的详情页有分页 有的没有 如何判断呢
`
'fields' => array(
array(
'name' => "contents",
'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href
////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
////div[contains(@class,'art-pre')]//a//@href
'repeated' => true,
'required' => true,//必填
'children' => array(
array(
// 抽取出其他分页的url待用
'name' => 'content_page_url',
'selector' => "//text()"
),
array(
// 抽取其他分页的内容
'name' => 'page_content',
'source_type' => 'attached_url',
'attached_url' => 'content_page_url', // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
'selector' => "//div[contains(@class,'textWrap')]"
),
),`
from phpspider.
http://www.qikan.com.cn/articleinfo/dinb20222801-2.html
from phpspider.
改了也不对,
1:没有底部分页的会自动无视 跳过
'selector' => "//div[contains(@Class,'art-pre')]/a/@href", 因为有的网页没有分页 底部就没有下一页
没有这个就不会被执行采集
2:有分页的 只采集下一页等几个内容页,当前页内容并没有被采集
`<?php
require_once DIR . '/../autoloader.php';
use phpspider\core\phpspider;
use phpspider\core\requests;
use phpspider\core\selector;
/* Do NOT delete this comment /
/ 不要删除这段注释 */
/【重要 模拟登录】/
$cookies = "ASP.NET_SessionId=uqbyzahwaa5fedgldcawsogx; Hm_lvt_782a719ae16424b0c7041b078eb9804a=1657892367,1658402814,1658581663,1658932747; Hm_lvt_29f14b13cac2f8b4e5fc964806f3ea52=1657892367,1658402820,1658581663,1658932747; Hm_lpvt_782a719ae16424b0c7041b078eb9804a=1658932755; Hm_lpvt_29f14b13cac2f8b4e5fc964806f3ea52=1658932755; UserToken=nrbjZ+ZFD3ulIoEX50957cwO1CrVaO5/NLAFj6bcy1Gx6rsh; LoginUserName=kavt12; LoginPassword=NRW/PSbsXFo=";
requests::set_cookies($cookies, 'www.qikan.com.cn');
/*[for scan_urls 计算出年份和周数 每次请求加上即可]
*
取当年份
取当周数
*/
//周数
$year=date('Y');
$week = date('W'); //电脑报一般周一下午出 周数-2
$week=$week-2;
//die;
//7 主要尝试增加分页
$configs = array(
'name' => 'diannaobao',
'log_show' => true,
'max_fields' => 2, //最大采集2条 每次
'domains' => array(
'www.qikan.com.cn'
),
//入口
'scan_urls' => array(
"http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html" // http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),
//内容 也对了
'content_url_regexes' => array(
"http://www.qikan.com.cn/article/[\s\S]+", //http://www.qikan.com.cn/article/dinb20222701.html
"http://www.qikan.com.cn/articleinfo/[\s\S]+"
),
'fields' => array(
array(
'name' => "contents",
//'selector_type' => 'regex',
'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href
////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
////div[contains(@class,'art-pre')]//a//@href
'repeated' => true,
'required' => true,//必填
'children' => array(
array(
// 抽取出其他分页的url待用
'name' => 'content_page_url',
'selector' => "//text()"
),
array(
// 抽取其他分页的内容
'name' => 'page_content',
'source_type' => 'attached_url',
'attached_url' => 'content_page_url', // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
'selector' => "//div[contains(@class,'textWrap')]"
),
),
),
// 抽取内容页的文章标题
array(
'name' => "title",
'selector' => "//div[contains(@class,'article')]//h1", // 备用//*[@id=\"form1\"]/div[6]/div/div[2]/div[1]/h1
'required' => true
),
//正文 //div[contains(@class,'textWrap')]
/*
array(
'name' => "text",
'selector' => "//div[contains(@Class,'textWrap')]", ///article-main //div[contains(@Class,"article-content")]', 内容部分 //div[contains(@Class,'textWrap')]
),
*/
/*
array(
'name' => "contents",
'selector' => "//html", ////div[@id='art-pre']//a//@href
'repeated' => true,
'children' => array(
array(
// 抽取出其他分页的url待用
'name' => 'content_page_url',
'selector' => "div[contains(@class,'art-pre')]//a//@href" ////div[contains(@class,'art-pre')]//a//@href
),
array(
// 抽取其他分页的内容
'name' => 'page_content',
// 发送 attached_url 请求获取其他的分页数据
// attached_url 使用了上面抓取的 content_page_url
'source_type' => 'attached_url',
'attached_url' => 'http://www.qikan.com.cn/{content_page_url}', //"https://www.zhihu.com/r/answers/{comment_id}/comments", http://www.qikan.com.cn/
'selector' => "//div[contains(@class,'textWrap')]"
)
)
),
*/
//图片
array(
'name' => "pic",
'selector' => "//figure[contains(@class,'image')]//img", ///html/body/div[1]/div[3]/div[1]/div/div[2]/div[1]/div[1]/div[5]
//返回的是图片数组 需要取一个出来 $data=$data[0]; 估计要做个判断 一个就直接显示 多个就显示第一个 目前看来不处理也可以的
),
),
'export' => array(
'type' => 'sql',
'file' => './data/8.sql',
'table' => '数据表',
),
);
$spider = new phpspider($configs);
//【如何对采集到的字段进行二次处理?】 on_extract_field进行二次处理即可
$spider->on_extract_field = function($fieldname, $data, $page)
{
if ($fieldname == 'contents')
{
$contents = $data;
$data = "";
$num=count($contents)-1;
for ($i=0; $i <$num ; $i++) {
$data .= $contents[$i]['page_content'];
}
/*
foreach ($contents as $content)
{
$data .= $content['page_content'];
}
*/
}
return $data;
};
$spider->start();`
from phpspider.
Related Issues (20)
- 验证码识别问题 HOT 1
- 我在windows环境下运行了demo下的马蜂窝 HOT 2
- [error] Domain of scan_urls ("https://bbs.zhibo8.cc/forum/list/?fid=62") does not match the domains of the domain name
- 建议用swoole HOT 1
- 文档里的某个xpath不起作用 HOT 3
- redis、mysql 执行长都出现了超时的情况
- 发现一个过时函数 & 一个 bug HOT 3
- 有遇到这个问题的吗? HOT 1
- 最新的知乎应该怎么爬
- 如果知道动态网页的加载API并且也可以请求到json的数据,怎么能通过接口嵌入到框架里进一步抓取 HOT 1
- 高版本PHP已废弃这种 $s0{0} 写法,请使用$s0[0] HOT 2
- 关于知乎用户数据的爬虫我确实想过一个用途
- 在用回调函数on_list_page去获得列表页数据时候,无法真正add_url HOT 2
- 用js渲染数据的页面可以抓去吗?类似vue作为前段的 HOT 1
- tp5 默认会写入一下报错到日志里面 HOT 2
- 关于分页采集 怎么搞都不对 HOT 2
- 能不能内容也 先点击一个动作,余下全文,然后再开始采集?
- 修复7.4.16版本报错bug--修复打个tag
- PHP8运行官网demo报错 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phpspider.