Giter Site home page Giter Site logo

fintech_spider's Introduction

FinTech Spider

FinTech(i.e. Financial Technology)

"FinTech Spider" is a spider based on Scrapy to crawl a large number of financial data on the Internet.

The data crawled by "FinTech Spider" has been used by 嗅金牛, 数知源.

Structrue of "FinTech Spider"

Only important dirs & files are listed here.

Directory/File Author Usage
README.md lxw The document for this project
Anti_Anti_Spider/ hee
Demo/ Some Demonstrations(e.g. PhantomJS/Proxies, etc.)
Demo/ArticleSpider/ hee
Demo/CNKI_Patent/ lxw A demo for Scrapy spiders project which supports Selenium/PhantomJS/User-Agent/IP-Proxy
Demo/geetestcrack.py hee
Demo/phantomjs_proxy.py lxw Add IP proxy in PhantomJS
Demo/user_agent.txt hee A large number of User-Agents
Spiders/ The Spiders directory stores Python scripts that crawl data we need from the Internet)
Spiders/CJODocIDSpider/ lxw (w/ scrapy)Spiders for crawling data(case details) from **裁判文书网(China Judgements Online)
Spiders/CJOSpider/ lxw (w/ scrapy)Spiders for crawling data(basic info) from **裁判文书网(China Judgements Online)
Spiders/CninfoSpider/ hee Spiders for crawling data from 巨潮资讯
Spiders/CNKI_Patent_Spider/ lxw (w/o scrapy)Spiders for crawling patent data from **知网
Spiders/NECIPSSpider/ lxw (w/ scrapy)Spiders for crawling data from 国家企业信用信息公示系统(National Enterprise Credit Information Publicity System)
Spiders/new_three_board/ lxw (w/ scrapy)Spiders for crawling data from 全国中小企业股份转让系统
Spiders/SBJSpider/ hee
Spiders/TYCSpider/ lxw (w scrapy, PhantomJS)Spiders for crawling patent/copyright data from 天眼查

TODO

He Chen:

  1. 在README.md中更新所提交的关键目录的用途(如果子目录中有关键的文件,也请列出)

Xiaowei Liu:

  • CJOSpider CJOSpider架构存在问题,把URL去重关闭了, 可能会存在重复抓取的问题
  1. 【比rpush可能会稍微好一点儿,这个暂时不改了,感觉怎么改都会有问题】proxy的获取策略改成lpop() + insert(第六个位置),而不是lpop() + rpush()
  2. [NO, 按理说只用CJOSpider.py然后重新运行就可以] 增加对Redis中TASKS_HASH没有爬取结束任务的爬取代码(一定小于CONCURRENT_REQUESTS个?)
  3. [NO, 按理说只用CJODocIDSpider.py然后重新运行就可以] 增加对Redis中DOC_ID_HASH没有爬取结束任务的爬取代码

fintech_spider's People

Contributors

lxw0109 avatar hee0624 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.