Giter Site home page Giter Site logo

lucasxlu / lagoujob Goto Github PK

View Code? Open in Web Editor NEW
260.0 29.0 129.0 28.81 MB

Data Analysis & Mining for lagou.com

Home Page: https://www.zhihu.com/question/36132174/answer/94392659

License: Apache License 2.0

Python 99.29% Shell 0.71%
lagou data-analysis web-crawler machine-learning data-mining python3 nlp

lagoujob's Introduction

Hi there 👋

lagoujob's People

Contributors

lucasxlu avatar m2shad0w avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lagoujob's Issues

爬取机器学习就开始报错

items = response.json()['content']['data']['page']['result'] 这里会有
KeyError: 'content'

继续debug在response这就有 {'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '112.95.180.65', 'state': 2402}

没有PositionAjax.json

拉勾网异步加载,查看XHR类型请求,只有CompanyAjax.json,根本没有PositionAjax.json请求,请问拉勾网是如何请求服务的,我查了很多资料都显示有PositionAjax.json,请问拉勾网是不是又增加了反爬机制。我是小白,请求指教,谢谢!

Couldn't find a tree builder with the features you requested: html5lib

运行spider时,出现如下错误:
File "C:\Users\Admin\Documents\PythonSpyder\LagouJob\spider\jobdetail_spider.py", line 30, in crawl_job_detail
soup = BeautifulSoup(response.text, 'html5lib')

File "C:\Users\Admin\Anaconda3\lib\site-packages\bs4_init_.py", line 156, in init
if isinstance(features, str):

FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

注:我已经安装了html5lib, 不知道这个问题应该如何解决? 谢谢
运行环境:windows 7, anaconda, python 3.5

代码没实现反爬虫

楼主代码有几个问题:
1、没有实现反爬虫,不能一直爬下去
2、通过total_count有多少页判断有多少页,但是有个问题就是total_count判断出来的多少页并不是所有页都是有数据的,到后面返回的result是空的

Suggest to loosen the dependency on snownlp

Hi, your project LagouJob requires "snownlp==0.12.3" in its dependency. After analyzing the source code, we found that some other versions of snownlp can also be suitable without affecting your project, i.e., snownlp 0.9.8, 0.9.8.2, 0.9.9, 0.9.10, 0.10.1, 0.11.1, 0.12.0, 0.12.1, 0.12.2. Therefore, we suggest to loosen the dependency on snownlp from "snownlp==0.12.3" to "snownlp>=0.9.8,<=0.12.3" to avoid any possible conflict for importing more packages or for downstream projects that may use LagouJob.

May I pull a request to loosen the dependency on snownlp?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?



For your reference, here are details in our analysis.

Your project LagouJob(commit id: 2c95ac0) directly uses 1 APIs from package snownlp.

snownlp.__init__.SnowNLP.__init__

From which, 3 functions are then indirectly called, including 2 snownlp's internal APIs and 1 outsider APIs, as follows (neglecting some repeated function occurrences).

[/lucasxlu/LagouJob]
+--snownlp.__init__.SnowNLP.__init__

We scan snownlp's versions among [0.9.8, 0.9.8.2, 0.9.9, 0.9.10, 0.10.1, 0.11.1, 0.12.0, 0.12.1, 0.12.2] and 0.12.3, the changing functions (diffs being listed below) have none intersection with any function or API we mentioned above (either directly or indirectly called by this project).

diff: 0.12.3(original) 0.9.8
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.load', 'snownlp.tag.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.sentiment.__init__.Sentiment', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.utils.trie.Trie', 'snownlp.__init__.SnowNLP.pinyin', 'snownlp.seg.seg.Seg.load', 'snownlp.__init__.SnowNLP.keywords', 'snownlp.utils.tnt.TnT', 'snownlp.tag.__init__.save', 'snownlp.seg.seg.Seg.__init__', 'snownlp.utils.tnt.TnT.train', 'snownlp.classification.bayes.Bayes.save', 'snownlp.summary.textrank.TextRank.solve', 'snownlp.sentiment.__init__.load', 'snownlp.sentiment.__init__.save', 'snownlp.summary.words_merge.SimpleMerge.merge', 'snownlp.utils.tnt.TnT.load', 'snownlp.utils.trie.Trie.__init__', 'snownlp.utils.tnt.TnT.save', 'snownlp.sentiment.__init__.Sentiment.save', 'snownlp.__init__.SnowNLP', 'snownlp.normal.pinyin.PinYin.__init__', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.normal.zh.transfer', 'snownlp.seg.__init__.train', 'snownlp.seg.__init__.save', 'snownlp.summary.words_merge.SimpleMerge', 'snownlp.summary.textrank.TextRank', 'snownlp.utils.trie.Trie.translate', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.__init__.load', 'snownlp.seg.seg.Seg', 'snownlp.tag.__init__.load', 'snownlp.summary.textrank.KeywordTextRank.solve', 'snownlp.summary.words_merge.SimpleMerge.__init__', 'snownlp.utils.trie.Trie.find', 'snownlp.classification.bayes.Bayes.load', 'snownlp.seg.seg.Seg.save', 'snownlp.seg.seg.Seg.train', 'snownlp.classification.bayes.Bayes', 'snownlp.normal.pinyin.PinYin', 'snownlp.normal.__init__.get_pinyin', 'snownlp.sentiment.__init__.Sentiment.load', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.save', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.utils.trie.Trie.insert', 'snownlp.seg.__init__.seg', 'snownlp.normal.pinyin.PinYin.get']

diff: 0.12.3(original) 0.9.8.2
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.load', 'snownlp.tag.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.sentiment.__init__.Sentiment', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.utils.trie.Trie', 'snownlp.__init__.SnowNLP.pinyin', 'snownlp.seg.seg.Seg.load', 'snownlp.utils.tnt.TnT', 'snownlp.tag.__init__.save', 'snownlp.seg.seg.Seg.__init__', 'snownlp.utils.tnt.TnT.train', 'snownlp.classification.bayes.Bayes.save', 'snownlp.summary.textrank.TextRank.solve', 'snownlp.sentiment.__init__.load', 'snownlp.sentiment.__init__.save', 'snownlp.utils.tnt.TnT.load', 'snownlp.utils.trie.Trie.__init__', 'snownlp.utils.tnt.TnT.save', 'snownlp.sentiment.__init__.Sentiment.save', 'snownlp.__init__.SnowNLP', 'snownlp.normal.pinyin.PinYin.__init__', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.normal.zh.transfer', 'snownlp.seg.__init__.train', 'snownlp.seg.__init__.save', 'snownlp.summary.textrank.TextRank', 'snownlp.utils.trie.Trie.translate', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.__init__.load', 'snownlp.seg.seg.Seg', 'snownlp.tag.__init__.load', 'snownlp.summary.textrank.KeywordTextRank.solve', 'snownlp.classification.bayes.Bayes.load', 'snownlp.utils.trie.Trie.find', 'snownlp.seg.seg.Seg.save', 'snownlp.seg.seg.Seg.train', 'snownlp.classification.bayes.Bayes', 'snownlp.normal.pinyin.PinYin', 'snownlp.normal.__init__.get_pinyin', 'snownlp.sentiment.__init__.Sentiment.load', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.save', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.utils.trie.Trie.insert', 'snownlp.seg.__init__.seg', 'snownlp.normal.pinyin.PinYin.get']

diff: 0.12.3(original) 0.9.9
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.load', 'snownlp.tag.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.sentiment.__init__.Sentiment', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.utils.trie.Trie', 'snownlp.__init__.SnowNLP.pinyin', 'snownlp.seg.seg.Seg.load', 'snownlp.utils.tnt.TnT', 'snownlp.tag.__init__.save', 'snownlp.seg.seg.Seg.__init__', 'snownlp.utils.tnt.TnT.train', 'snownlp.classification.bayes.Bayes.save', 'snownlp.sentiment.__init__.load', 'snownlp.sentiment.__init__.save', 'snownlp.utils.tnt.TnT.load', 'snownlp.utils.trie.Trie.__init__', 'snownlp.utils.tnt.TnT.save', 'snownlp.sentiment.__init__.Sentiment.save', 'snownlp.__init__.SnowNLP', 'snownlp.normal.pinyin.PinYin.__init__', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.normal.zh.transfer', 'snownlp.seg.__init__.train', 'snownlp.seg.__init__.save', 'snownlp.utils.trie.Trie.translate', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.__init__.load', 'snownlp.seg.seg.Seg', 'snownlp.tag.__init__.load', 'snownlp.summary.textrank.KeywordTextRank.solve', 'snownlp.classification.bayes.Bayes.load', 'snownlp.utils.trie.Trie.find', 'snownlp.seg.seg.Seg.save', 'snownlp.seg.seg.Seg.train', 'snownlp.classification.bayes.Bayes', 'snownlp.normal.pinyin.PinYin', 'snownlp.normal.__init__.get_pinyin', 'snownlp.sentiment.__init__.Sentiment.load', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.save', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.utils.trie.Trie.insert', 'snownlp.seg.__init__.seg', 'snownlp.normal.pinyin.PinYin.get']

diff: 0.12.3(original) 0.9.10
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.load', 'snownlp.tag.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.sentiment.__init__.Sentiment', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.utils.trie.Trie', 'snownlp.__init__.SnowNLP.pinyin', 'snownlp.seg.seg.Seg.load', 'snownlp.utils.tnt.TnT', 'snownlp.tag.__init__.save', 'snownlp.seg.seg.Seg.__init__', 'snownlp.utils.tnt.TnT.train', 'snownlp.classification.bayes.Bayes.save', 'snownlp.sentiment.__init__.load', 'snownlp.sentiment.__init__.save', 'snownlp.utils.tnt.TnT.load', 'snownlp.utils.trie.Trie.__init__', 'snownlp.utils.tnt.TnT.save', 'snownlp.sentiment.__init__.Sentiment.save', 'snownlp.__init__.SnowNLP', 'snownlp.normal.pinyin.PinYin.__init__', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.normal.zh.transfer', 'snownlp.seg.__init__.train', 'snownlp.seg.__init__.save', 'snownlp.utils.trie.Trie.translate', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.__init__.load', 'snownlp.seg.seg.Seg', 'snownlp.tag.__init__.load', 'snownlp.summary.textrank.KeywordTextRank.solve', 'snownlp.classification.bayes.Bayes.load', 'snownlp.utils.trie.Trie.find', 'snownlp.seg.seg.Seg.save', 'snownlp.seg.seg.Seg.train', 'snownlp.classification.bayes.Bayes', 'snownlp.normal.pinyin.PinYin', 'snownlp.normal.__init__.get_pinyin', 'snownlp.sentiment.__init__.Sentiment.load', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.save', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.utils.trie.Trie.insert', 'snownlp.seg.__init__.seg', 'snownlp.normal.pinyin.PinYin.get']

diff: 0.12.3(original) 0.10.1
['snownlp.tag.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.sentiment.__init__.Sentiment', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.utils.trie.Trie', 'snownlp.__init__.SnowNLP.pinyin', 'snownlp.tag.__init__.save', 'snownlp.classification.bayes.Bayes.save', 'snownlp.sentiment.__init__.load', 'snownlp.sentiment.__init__.save', 'snownlp.utils.trie.Trie.__init__', 'snownlp.sentiment.__init__.Sentiment.save', 'snownlp.__init__.SnowNLP', 'snownlp.normal.pinyin.PinYin.__init__', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.normal.zh.transfer', 'snownlp.seg.__init__.train', 'snownlp.seg.__init__.save', 'snownlp.utils.trie.Trie.translate', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.__init__.load', 'snownlp.seg.seg.Seg', 'snownlp.tag.__init__.load', 'snownlp.summary.textrank.KeywordTextRank.solve', 'snownlp.classification.bayes.Bayes.load', 'snownlp.utils.trie.Trie.find', 'snownlp.classification.bayes.Bayes', 'snownlp.seg.seg.Seg.train', 'snownlp.normal.pinyin.PinYin', 'snownlp.normal.__init__.get_pinyin', 'snownlp.sentiment.__init__.Sentiment.load', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.utils.trie.Trie.insert', 'snownlp.seg.__init__.seg', 'snownlp.normal.pinyin.PinYin.get']

diff: 0.12.3(original) 0.11.1
['snownlp.classification.bayes.Bayes', 'snownlp.tag.__init__.train', 'snownlp.seg.__init__.train', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.sentiment.__init__.train', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.normal.__init__.get_pinyin', 'snownlp.seg.__init__.seg', 'snownlp.summary.textrank.KeywordTextRank.solve']

diff: 0.12.3(original) 0.12.0
['snownlp.classification.bayes.Bayes', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.classification.bayes.Bayes.classify', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.normal.__init__.get_pinyin', 'snownlp.seg.__init__.seg', 'snownlp.summary.textrank.KeywordTextRank.solve']

diff: 0.12.3(original) 0.12.1
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.normal.__init__.get_pinyin', 'snownlp.seg.__init__.seg', 'snownlp.summary.textrank.KeywordTextRank.solve']

diff: 0.12.3(original) 0.12.2
['snownlp.seg.y09_2047.CharacterBasedGenerativeModel.tag', 'snownlp.summary.textrank.KeywordTextRank', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel.train', 'snownlp.seg.y09_2047.CharacterBasedGenerativeModel', 'snownlp.summary.textrank.KeywordTextRank.solve']

As for other packages, the APIs of @outside_package_name are called by snownlp in the call graph and the dependencies on these packages also stay the same in our suggested versions, thus avoiding any outside conflict.

Therefore, we believe that it is quite safe to loose your dependency on snownlp from "snownlp==0.12.3" to "snownlp>=0.9.8,<=0.12.3". This will improve the applicability of LagouJob and reduce the possibility of any further dependency conflict with other projects/packages.

拉钩某些岗位没有城市字段

  1. 拉钩的某些岗位没有城市字段,建议做一个兼容
    2.run lagou_spider.py 时是直接运行的改页面定义的抓取字段,而没有读取config下面的job,这个是需要什么特殊处理

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.