Giter Site home page Giter Site logo

awesome-web-data-extractor's Introduction

awesome-web-data-extractor

A curated list of promising Web Data Extractors resources

80legs - Powerful and Economical Service Platform for Crawling and Processing Web Content
http://www.80legs.com/
Agenty – Hosted Web Scraping Tool
https://www.agenty.com/
Anthracite
http://freecode.com/projects/anthracite
Aristo - Answer Questions with a Knowledgeable Machine http://allenai.org/aristo/
artoo.js - The Client-Side Scraping Companion http://medialab.github.io/artoo/
AutoMate - Automate Data Extraction
https://www.networkautomation.com/

Automated RSS Scraper Scripts
http://www.djeaux.com/rss/
Automated Information Solutions
http://www.automated-info-solutions.com/
Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery
http://portal.acm.org/citation.cfm?id=640423&dl=ACM&coll=portal
Beautiful Soup
http://freecode.com/projects/beautifulsoup
Beautiful Soup - HTML/XML Parser for Quick Turnaround Screen Scraping and Web Data Extraction http://www.crummy.com/software/BeautifulSoup/
BLIASoft Knowledge Discovery http://www.bliasoft.com/Eindex.html
Bot Research
http://www.BotResearch.info/
BYU Data Extraction Research Group
http://www.deg.byu.edu/
Captiva Software: Digital Information Capture Software
http://www.emc.com/enterprise-content-management/captiva/captiva.htm
ChartSearch Data Search Technology
http://www.ChartSearch.net/
Client-Side Deep Web Data Extraction
http://www.tic.udc.es/~mad/publications/ceceast2004.pdf
CloudScrape – Extract, Enrich and Connect
http://www.cloudscrape.com/
Common Crawl
http://www.commoncrawl.org/

Connotate – Web Data Extraction and Monitoring
http://www.connotate.com/
Content Grabber – Extract Data from Websites
http://www.ContentGrabber.com/
ContextMiner - Tools to Collect Data, Metadata and Contextual Information http://www.contextminer.org/
cQuery - Content Query Engine
http://cquery.com/
CrawlMonster
http://www.crawlmonster.com/
Crawly
http://crawly.diffbot.com/
Create a Crawler - Extract Data From an Entire Website https://www.import.io/
cURL groks URLs - Command Line Tool for Transferring Data http://curl.haxx.se/
Data Extraction Services
http://www.dataextractionservices.com/
DataHen – Advanced Web Scraping and Data Extraction Services
https://www.datahen.com/
Data Mining Resources
http://www.DataMiningResources.info/
Data Miner – Extract Data From any Website in Seconds
https://data-miner.io/
Dataminr - Real-time Information Discovery http://www.dataminr.com/
Data Scraper – East Web Scraping with Google Chrome
https://chrome.google.com/webstore/detail/data-scraper-easy-web-scr/nndknepjnldbdbepjfgmncbggmopgden?hl=en-US

DataSift - Powerful Social Data Platform http://datasift.com/
Data Toolbar – Web Data Extraction Software Made Simple
http://datatoolbar.com/
DataWatch Monarch – Self-Service Data Preparation
http://www.datawatch.com/
DataWrangler - Data Cleaning and Transformation Tool http://vis.stanford.edu/wrangler/
Deep Web Research 2017
http://www.DeepWebResearch.info/
DEiXTo – Powerful Web Data Extraction Tool Based on W3C DOM
http://deixto.com/
dexi.io – Web Data Processing for Professionals – Extract, Enrich and Connect
https://dexi.io/
DiffBot – Web Data Extraction Using Artificial Intelligence
http://www.DiffBot.com/
Digital Footprints - Collect Facebook Data http://digitalfootprints.dk/
DiscoverText - Import, Sort, Distribute and Analyze Electronic Content from eMail, Document Repositories, and Social Media http://discovertext.com/
Easy PDF Cloud https://www.easypdfcloud.com/
Easy Web Extract – Best Tool for Web Scraping
http://webextract.net/
eGrabber - Data Capture Tools
http://www.egrabber.com/
Facepager - Fetching Public Data From Facebook https://github.com/strohne/Facepager

FeedsAPI - Extract Content from Web Pages Tool http://www.feedsapi.com/
Ficstar Software - Web Data Extraction
http://www.ficstar.com/
File Information Tool Set (FITS) https://projects.iq.harvard.edu/fits
FMiner – Web Scraping Software
http://www.fminer.com/
Fresh WebSuction
http://www.freshwebmaster.com/
Grabby
https://grabby.io/
Grepsr – Web Scraping Made Simple, Fast and Manageable
https://www.grepsr.com/
Helium Scraper
http://www.heliumscraper.com/
Huginn - Your Agents Are Standing By https://github.com/cantino/huginn
iMacros – Data Extraction
http://imacros.net/overview
Imagination Engines
http://www.Imagination-Engines.com/
Import.io - Turn the Web Into Data With Extractors, Crawlers and Connectors https://import.io/
InfoExtractor - Extracts Relevant Information from Blogs, YouTube and Twitter http://www.infoextractor.org/
Information Retrieval (IR) and Information Extraction (IE) on the Web
http://www.webir.org/

Introduction to Information Retrieval
http://www-nlp.stanford.edu/IR-book/
iOpus Internet Macros
http://www.iopus.com/imacros/
iRobotSoft – Visual Web Scraping and Web Automation
http://irobotsoft.com/
iWeb Scraping Services
http://www.iwebscraping.com/
Junar - Discovering Data http://www.junar.com/
Karma - Data Integration Tool
http://www.isi.edu/integration/karma/
Kimono - Turn Website Into Structured APIs From Your Browser In Seconds https://www.kimonolabs.com/
Knowledge Discovery Resources
http://www.KnowledgeDiscovery.info/
Knowlesys® - Web Data Extraction, Web Grabber and Screen Scraper
http://www.knowlesys.com/index.htm
Liberty Metrics – Web Scraping Services
http://libertymetrics.com/
LingPipe – Information Extraction and Data Mining Tools
http://alias-i.com/lingpipe/
Metadata Extraction Tool
http://meta-extractor.sourceforge.net/
Mozenda – Comprehensive Web Data Gathering
http://www.mozenda.com/
NCapture - Capture Web Content http://www.qsrinternational.com/products_nvivo_add-ons.aspx

Netlytic - Making Sense of Online Conversations https://netlytic.org/home/
Newprosoft – Web Data Extraction Software
http://newprosoft.com/
NewsClipper.com - Snip and Ship Dynamic News Content to Your Web Pages
http://www.newsclipper.com/
Octoparse – Automated Web Scraping Software
http://www.octoparse.com/
Online Data Extractor Tool
http://www.onlinedataextractor.com/
OutWit Hub - Harvest the Web With Your Own Web Collection Engine http://www.outwit.com/
ParseHub – Web Crawling Using Machine Learning
http://www.ParseHub.com/
Pervasive Data Management and Integration Products
http://www.pervasive.com/
Priceonomics - Crawl Data From the Web http://priceonomics.com/
QL2 Software - Unstructured Data Management and Web Mining Software
http://www.ql2.com/
Quick Code
https://quickcode.io/
REBOL Technologies
http://www.rebol.com/
SalesTools.io
https://salestools.io/
Semantic Scholar - Free Scientific Literature Search and Discovery http://allenai.org/semantic-scholar/

ScrapeForge
http://freecode.com/projects/scrapeforge
ScrapeHero
https://www.scrapehero.com/
Scraper
http://freecode.com/projects/scraper
ScrapingHub – Cloud Based Data Extraction Tool
http://www.ScrapingHub.com/
Scraping Solutions – When the Solution You Seek Seems Impossible
https://www.scrapingsolutions.com.au/
Scrapy – Open Source Web Scraping Framework for Python
http://scrapy.org/
Screen-Scraper
http://freecode.com/projects/screenscraper
Screen-Scraper – Extracts Information From Web Sites
http://www.Screen-Scraper.com/
Screenscraping the Senate by Paul Ford
http://www.xml.com/pub/a/2004/09/01/hack-congress.html
Search and Replace with TextPipe Pattern Matching
http://www.datamystic.com/textpipe.html
Sensible Code
http://sensiblecode.io/
Social Media Data Collection Tools http://socialmediadata.wikidot.com/
Software for Web Scraping
http://scraping.pro/software-for-web-scraping/
Spinn3r - Indexing the Blogosphere http://docs.spinn3r.com/#overview

SPSS Modeler
http://developer.ibm.com/predictiveanalytics
Squirro - Find, Remember, Organize and Share Important Information https://squirro.com/
STACKS - Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse https://github.com/bitslabsyr/stack
TadaWeb - Clone and Amplify Human Intelligence for Web Data Collection and Analysis https://www.tadaweb.com/
Texifter - Search, Sift, Sort, Classify and Analyze http://texifter.com/
TextConverter 4
https://www.simx.com/
TextRazor - Text Analysis Infrastructure https://www.textrazor.com/
Topicgrazer - Graze On Web Pages and Documents http://www.topicscape.com/Topicgrazer/help.php
UiPath – Web Data Extraction
https://www.uipath.com/guides/web-data-extraction
Unit Miner - Web Data Extraction Software
http://www.unitminer.com/
VietSpider
http://binhgiang.sourceforge.net/
VisualScraper – Web Data Extractor
http://www.VisualScraper.com/
Visual Web Ripper – Data Extraction Software
http://www.VisualWebRipper.com/
Visual Web Task
http://www.lencom.com/VisualWTSite.html

W3C Publishes Data Extraction Language (DEL) as W3C Note
http://xml.coverpages.org/ni2001-11-06-a.html
Web Content Extractor
http://www.newprosoft.com/
Web Data Extraction
http://www.wintask.com/web-data-extraction.php
Web Data Extraction Software Data Toolbar
https://webdataextractionsoftwaredatatoolbar.en.softonic.com/
Web Data Extractor
http://www.rafasoft.com/
Web Data Extractor
http://www.webextractor.com/
Web Data Extractor
http://fivesmallq.github.io/web-data-extractor
Web Data Extractor
http://www.lantechsoft.com/web-data-extractor.html
Web Data Guru – Web Data Extraction and Scraping Services
http://www.webdataguru.com/
Web-Harvest – Open Source Web Data Extraction Tool
http://web-harvest.sourceforge.net/index.php
WebHarvy – Intuitive Powerful Visual Web Scraper
https://www.webharvy.com/index.html
Webhose.io – Web Data For Your Business
http://www.webhose.io/
Web Robots – Web Scraping and Crawling
https://webrobots.io/
Web Scraper
http://www.webscraper.io/

Web Scraping – Wikipedia
https://en.wikipedia.org/wiki/Web_scraping
Website Data Extractor – Time to Rethink Web Scraping
http://www.kofax.com/
Website Extractor – Offline Browser
http://www.internet-soft.com/extractor.htm
WebSunDew – Advanced Web Scraping Tool
http://www.websundew.com/
Wikimedia Public Data Dumps http://meta.wikimedia.org/wiki/Data_dumps
WinAutomation
http://www.winautomation.com/
XRay Web Scraping Tool
http://freecode.com/projects/xrayguibasedwebscrapingtool
YaCy Web page Indexer
http://freecode.com/projects/yacy

Subject Tracer™ Information Blogs
Subject Tracer™ Information Blogs created and developed by the Virtual Private Library™ combine the best of the latest tools on the Internet. Using bots, blogs and news aggregators the Subject Tracer™ Information blogs generate RSS feeds with the latest resources to create a current information resource flow through niched subject tracers. I am proud to be the creator of the Internet’s first Subject Tracer™ Information Blogs:
Virtual Private Library™ http://www.VirtualPrivateLibrary.com/
Accessibility Resources
http://www.AccessibilityResources.info/
Agriculture Resources
http://www.AgricultureResources.info/
AnswerSpot
http://www.AnswerSpot.us/
Artificial Intelligence Resources
http://www.AIResources.info/
Astronomy Resources
http://www.AstronomyResources.info/
Auction Resources
http://www.AuctionResources.info/
Biological Informatics
http://www.BiologicalInformatics.info/
Biotechnology Resources
http://www.BiotechnologyResources.info/
Bot Research
http://www.BotResearch.info/
Business Intelligence Resources
http://www.BIResources.info/

ChatterBots
http://www.ChatterBots.info/
Data Mining Resources
http://www.DataMiningResources.info/
Deep Web Research
http://www.DeepWebResearch.info/
Directory Resources
http://www.DirectoryResources.info/
eCommerce Resources
http://eCommerceResources.info/
Education and Academic Resources
http://www.EducationResources.info/
Elder Resources
http://www.ElderResources.info/
Employment Resources
http://www.EmploymentResources.info/
Entrepreneurial Resources
http://www.EntrepreneurialResources.info/
Fact Checkers Directory
http://www.FactCheckers.info/
Financial Sources
http://www.FinancialSources.info/
Finding People
http://www.FindingPeople.info/
Games Resources
http://www.GamesResources.info/
Genealogy Resources
http://www.GenealogyResources.info/

Grant Resources
http://www.GrantResources.info/
Green Files
http://www.GreenFiles.info/
Grid, Distributed and Cloud Computing Resources
http://www.GridResources.info/
Healthcare Resources
http://www.HealthcareResources.info/
Information Futures Markets
http://www.InformationFuturesMarkets.com/
Information Quality Resources
http://www.InformationQualityResources.info/
International Trade Resources
http://www.InternationalTradeResources.info/
Internet Alerts
http://www.InternetAlerts.info/
Internet Demographics
http://www.InternetDemographics.info/
Internet Experts 2016
http://www.InternetExperts.info/
Internet Hoaxes
http://www.InternetHoaxes.info/
Intrapreneurial Resources
http://www.IntrapreneurialResources.info/
Journalism Resources
http://www.JournalismResources.info/
Knowledge Discovery
http://www.KnowledgeDiscovery.info/

Military Resources
http://www.MilitaryResources.info/
New Economy Analytics, Resources and Alerts
http://www.NewEconomyAnalytics.com/
Outsourcing/Offshoring Information and Resources
http://www.OutsourcingOffshore.us/
Privacy Resources
http://www.PrivacyResources.info/
ProxyCrawl crawling and scraping tools https://proxycrawl.com Reference Resources
http://www.ReferenceResources.info/
Research Resources
http://www.ResearchResources.info/
RestStress™
http://www.RestStress.com/
Script Resources
http://www.ScriptResources.info/
ShoppingBots
http://www.ShoppingBots.info/
Social Informatics
http://www.SocialInformatics.info/
Statistics Resources and Big Data
http://www.StatisticsResources.info/
Student Research
http://www.StudentResearch.info/
Theology Resources
http://www.TheologyResources.info/
Tutorial Resources
http://www.TutorialResources.info/

World Wide Web Reference
http://www.WWWReference.info/

Orginial material from and Inspired by Web Data Extractors 2018 A White Paper Link Compilation written by Marcus P. Zillman, M.S., A.M.H.A

awesome-web-data-extractor's People

Contributors

crawlbase avatar wanghaisheng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.