Giter Site home page Giter Site logo

Comments (3)

owner888 avatar owner888 commented on July 16, 2024

支持的,首先设置好内容页规则,比如:
'content_url_regexes' => array(
"http://www.mafengwo.cn/i/\d+.html",
),
然后在on_scan_page里面批量生成内容页url
$spider->on_scan_page = function($page, $content, $phpspider)
{
for ($i = 0; $i < 1000; $i++)
{
$url = "http://www.mafengwo.cn/i/{$i}.html";
$phpspider->add_url($url);
}
};

from phpspider.

p0h5 avatar p0h5 commented on July 16, 2024

如果内容页并没有在入口页面或者列表页面呢,我只想批量生成内容页面url,然后爬虫挨个爬内容

from phpspider.

eddy8 avatar eddy8 commented on July 16, 2024

add_url函数做点小调整就可以。新增一个$force_content参数,调用add_url函数时设置该参数为true。内容页、列表页规则都留空即可。

    public function add_url($url, $options = array(), $depth = 0, $force_content = false)
    {
        // 投递状态
        $status = false;

        $link = $options;
        $link['url'] = $url;
        $link['depth'] = $depth;
        $link = $this->link_uncompress($link);

        if ($this->is_list_page($url))
        {
            $link['url_type'] = 'list_page';
            $status = $this->queue_lpush($link);
        }

        if ($this->is_content_page($url) || $force_content)
        {
            $link['url_type'] = 'content_page';
            $status = $this->queue_lpush($link);
        }

        if ($status)
        {
            if ($link['url_type'] == 'scan_page')
            {
                log::debug("Find scan page: {$url}");
            }
            elseif ($link['url_type'] == 'list_page')
            {
                log::debug("Find list page: {$url}");
            }
            elseif ($link['url_type'] == 'content_page')
            {
                log::debug("Find content page: {$url}");
            }
        }

        return $status;
    }

from phpspider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.