salimk / rcrawler Goto Github PK
View Code? Open in Web Editor NEWAn R web crawler and scraper
Home Page: http://www.sciencedirect.com/science/article/pii/S2352711017300110
License: Other
An R web crawler and scraper
Home Page: http://www.sciencedirect.com/science/article/pii/S2352711017300110
License: Other
When I set Maxdepth=1, It is crawling depth wise instead of breadthwise.
For example, page1.html has two links(page2.html,page3.html) in it.
Page2.html has link to page5.html
I want to crawl page1.html,page2.html,pages3.html but Rcrawler is crawling page1.html,page2.html,page5.html
Also, How can I crawl only starting page of a website with Rcrawler(just page1.html). I tried with MaxDepth=0. But it is not downloading any page content. It is just creating folder with the domain name.
Is it possible to use RCrawler on a password protected site?
Preparing browser process Error in "browser:" + i : non-numeric argument to binary operator
Starting at line 514
for(i in 1:no_cores){
pkg.env$Lbrowsers[[i]]<-run_browser()
cat("browser:"+i+" port:"+pkg.env$Lbrowsers[[i]]$process$port)
Sys.sleep(1)
cat(".")
flush.console()
}
When I try to crawl the following website, it always returns an error when the crawler gets to 24.98% complete.
Here is the R command:
Rcrawler(Website = "http://www.lamoncloa.gob.es", no_cores = 1, no_conn = 1)
Here is the error:
24.98 % : 582 parssed from 2330
In process : 586 ..
Error in allpaquet[[s]][[1]][[3]] : subscript out of bounds
Thanks,
Tim
I ran the code for Rcrawler using the sample example, however this did not return any results.
can someone let me know what I am missing?
crawler(Website = "http://www.glofile.com", no_cores = 4, no_conn = 4)
In process : 1..
Progress: 100.00 % : 1 parssed from 1 | Collected pages: 0 | Level: 1
I used Rcrawler to scrape this site: "http://www.thegreenbook.com/". I set it to only crawl one level using CSS and used url regular expression filter. But it ignores some links for no reason.
I used rvest,stringr to double check and found that 7 links are omitted.
Below is the code I used for double checking results.
library(Rcrawler)
library(rvest)
library(stringr)
library(dplyr)
url <- "http://www.thegreenbook.com/"
css = "#classificationIndex a"
filter_string = "products/search"
#using Rcrawler
Rcrawler(Website = url,
no_cores = 4,
no_conn = 4,
ExtractCSSPat = c(css),
MaxDepth = 1,
urlregexfilter = c(filter_string))
length_Rcrawler <- nrow(INDEX[INDEX$Level==1,])
#using rvest -----------------------------------------------------
#getting hrefs using the same css
hrefs <- html_session(url) %>%
html_nodes(css) %>%
html_attr("href")
hrefs_filtered <- hrefs[str_detect(hrefs,filter_string)] # filters as using `urlregexfilter`
length_rvest<- length(hrefs_filtered)
links retreived using Rcrawler and rvest are:
> length_Rcrawler
[1] 28
> length_rvest
[1] 35
Below are the links that Rcrawler omitted:
> setdiff(hrefs_filtered,INDEX[INDEX$Level==1,]$Url)
[1] "http://www.thegreenbook.com/products/search/electrical-guides/"
[2] "http://www.thegreenbook.com/products/search/pharmaceutical-guides/"
[3] "http://www.thegreenbook.com/products/search/office-equipment-supplies-guides/"
[4] "http://www.thegreenbook.com/products/search/garment-textile-guides/"
[5] "http://www.thegreenbook.com/products/search/pregnancy-parenting-guides/"
[6] "http://www.thegreenbook.com/products/search/beauty-care-guides"
[7] "http://www.thegreenbook.com/products/search/golden-year-guides/"
I don't know what could possibly cause this issue, as the response codes are all 200 and Stats are all finished. Also, ExtractCSSPat
and urlregexfilter
are correct as I have double checked using rvest. So my conclusion is that these links are just ignored.
Did I do something wrong while using the Rcrawler
or is it a bug? Any help is appreciated, thanks!
Thanks for this super useful package. I want to restrict the crawl to certain URL specifications, but capture all links on the crawled pages regardless of whether they match the filter. I can't get this to work in practice. An example:
Rcrawler(
Website = "https://beta.companieshouse.gov.uk/company/02906991",
no_cores = 4, no_conn = 4 ,
NetworkData = TRUE, statslinks = TRUE,
crawlUrlfilter = '02906991',
saveOnDisk = F
)
Page https://beta.companieshouse.gov.uk/company/02906991/officers
(which is crawled) includes links such as
https://beta.companieshouse.gov.uk/officers/...
but these pages are not included in the results. E.g:
NetwIndex %>% str_subset('uk/officers')
character(0)
Shouldn't this links be captured, since I have provided no dataUrlfilter
argument? Or am I missing something here?
Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter
argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter
does exactly what I am looking for.
When the pattern passed to crawlUrlfilter
contains only one level of the URL, like in the following code
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")
I get the desired results, i.e. only those URLS that match the pattern "article", e.g.
https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example
However, when I want to filter URLs based on a pattern of two levels of the URL, such as:
https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup
the following code does not find any matches:
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")
Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/"
it should be no problem at all passing an argument that contains several "/".
hello,
mac osx 10.12.5
r 3.4.0
rstudio 1.0.143
java 8 update 131
i've downloaded using install.packages but get the following error message when i try to load using the library command. same with developer version:
+++
library(Rcrawler)
Error: package or namespace load failed for ‘Rcrawler’:
.onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so
Reason: image not found
+++
any help appreciated. thank you.
When I execute this code
Rcrawler("http://www.tirodefensivoperu.com/forum/", 4, 4,
urlregexfilter = "?topic",
ExtractXpathPat = c("//*[(@id = 'top_subject')]", "//div[@class='inner']"),
PatternsNames = c("title", "post"),
ManyPerPattern = TRUE,
ignoreUrlParams = c("PHPSESSID", "prev_next")
)
I still get scraped this type of urls
https://www.tirodefensivoperu.com/forum/index.php?topic=11796.0;prev_next=prev
Is there a way to specify the joining character?
First...impressive work here, good job. I use the httr and curl packages to pull data from Smartsheet. Where I work they use a proxy server, so I have to use 'use_proxy' from the curl package and then feed that into my 'GET' statement from httr. The issue is that I want to scrape a web page with various links and cannot seem to find a way to incorporate the proxy into the 'Rcrawler' function. I tried without success using "website = paste("www.website.com", config_proxy), no_cores = 4, no_conn = 4)" where I set up my proxy to be named config_proxy with the curl package 'use_proxy' function. Is there a specific way I can pass the proxy information to the Rcrawler function? I tried your example with various alterations on incorporating the config_proxy variable however no success.
The example in the Reference manual for rcrawler
function is not showing up correctly
Hi,
I was scraping data from a website when I realized that it's downloading .html files only. I couldn't find any info on whether or not it downloads PDF files from the website.
Can you please clarify this?
Regards,
Jatin Gupta
Hello,
I'm new user of R and R Studio and i'm very interesting by your web crawler. Thanks for it.
But i have a problem when i want to install the package.
> system("java -version")
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)
Downloading GitHub repo salimk/Rcrawler@master
from URL https://api.github.com/repos/salimk/Rcrawler/zipball/master
Installing Rcrawler
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ \
--no-save --no-restore --quiet CMD INSTALL \
'/private/var/folders/jz/3njpl5xx15zc_xrsgpn53z_00000gn/T/Rtmpz4mtT1/devtools38c076f3b618/salimk-Rcrawler-fb92537' \
--library='/Users/damiencosta/Library/R/3.4/library' --install-tests
* installing *source* package ‘Rcrawler’ ...
** R
** inst
** preparing package for lazy loading
Error : .onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: impossible de charger l'objet partagé '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so
Reason: image not found
ERROR: lazy loading failed for package ‘Rcrawler’
* removing ‘/Users/damiencosta/Library/R/3.4/library/Rcrawler’
Installation failed: Command failed (1)
Someone for help please ?
Is it possible to get attributes using ContentScraper(), like I would get using rvest with the following commands?
read_html(url) %>%
html_nodes(xpath) %>%
html_attr("href")
Hi,
The latest version does not sync with the ReadMe page.
Several references to Xpath and CSS patterns which does not exist now in the code.
In Rcrawler only "ExtractPatterns" is a valid argument.
In ContentScraper only "patterns" is a valid argument.
No CSS patterns allowed anymore.
May be the ReadMe page should be updated.
Regards,
Herman
I tried the example in tutorial and find that texts in downloaded HTML are garbled(I opened it with chrome in UTF-8 encoding): garbled text (left is downloaded version and right is online version).
I tried to switch system locale, it was Chinese and I switched to English. But it still doesn't work.
The encoding should be recognized correctly:
Id Url Stats Level OUT IN Http Resp Content Type Encoding Accuracy
1 http://www.glofile.com finished 0 13 1 200 text/html UTF-8
#Doesn't work
> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
#Also doesn't work
> Sys.setlocale("LC_ALL","English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------------
setting value
version R version 3.4.4 (2018-03-15)
system x86_64, mingw32
ui RStudio (1.1.383)
language (EN)
collate English_United States.1252
tz America/Los_Angeles
date 2018-04-28
Packages ----------------------------------------------------------------------------------------------------------------
package * version date source
base * 3.4.4 2018-03-15 local
clipr 0.4.0 2017-11-03 CRAN (R 3.4.2)
codetools 0.2-15 2016-10-05 CRAN (R 3.4.4)
compiler 3.4.4 2018-03-15 local
curl 3.1 2017-12-12 CRAN (R 3.4.3)
data.table 1.10.4-3 2017-10-27 CRAN (R 3.4.3)
datasets * 3.4.4 2018-03-15 local
devtools 1.13.4 2017-11-09 CRAN (R 3.4.3)
digest 0.6.14 2018-01-14 CRAN (R 3.4.3)
doParallel 1.0.11 2017-09-28 CRAN (R 3.4.3)
foreach 1.4.4 2017-12-12 CRAN (R 3.4.3)
graphics * 3.4.4 2018-03-15 local
grDevices * 3.4.4 2018-03-15 local
httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
iterators 1.0.9 2017-12-12 CRAN (R 3.4.3)
magrittr 1.5 2014-11-22 CRAN (R 3.4.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.1)
methods * 3.4.4 2018-03-15 local
parallel 3.4.4 2018-03-15 local
purrr 0.2.4 2017-10-18 CRAN (R 3.4.2)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
Rcrawler * 0.1.7-0 2017-11-01 CRAN (R 3.4.4)
rlang 0.1.6 2017-12-21 CRAN (R 3.4.3)
rstudioapi 0.7.0-9000 2018-01-17 Github (rstudio/rstudioapi@109e593)
selectr 0.3-1 2016-12-19 CRAN (R 3.4.1)
stats * 3.4.4 2018-03-15 local
stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.2)
tools 3.4.4 2018-03-15 local
utils * 3.4.4 2018-03-15 local
withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
XML 3.98-1.9 2017-06-19 CRAN (R 3.4.1)
xml2 1.1.1 2017-01-24 CRAN (R 3.4.1)
yaml 2.1.16 2017-12-12 CRAN (R 3.4.3)
Thanks a lot for all the work!
Several websites include links with relative references (e.g., "page-1.html" instead of "http://domain.com/page-1.html"). The LinkNormalization function works fine for absolute links but fails to correctly normalize relative links. Can you please extend that function so that it correctly recognizes relative links and, if necessary, not only adds the protocol to a link but also the base url.
Best wishes,
Michael
Hi,
Thanks for your great package! I wanted to extract data from the following url, however it throws an error.
Data<-ContentScraper(Url = "https://www.ge.ch/votations/20180304/participation",
CssPatterns = c("li"), ManyPerPattern = T)
Error in LinkExtractor(url = Ur, encod = encod) :
object 'Extlinks' not found
I can get the content with rvest though, so i assume its not an issue related to the page itself.
part <- html("https://www.ge.ch/votations/20150308/cantonal/participation/")
part %>%
html_nodes("li") %>%
html_text()
Do you have an idea what could cause this error and what i can do to avoid it?
Is it somehow possible to crawl a list of URLs?
I tried mapply like in this example:
mapply(dataset['url'], FUN=Rcrawler)
But it throws this error:
Error in (function (Website, no_cores, no_conn, MaxDepth, DIR, RequestsDelay = 0, :
object 'getNewM' not found
P.S.: I know it is possible to scrape a list of URLs with ContentScraper, but I would like to crawl a quite long list of different domains with the Rcrawler function
Hi,
it is a very interesting package you have written here, but I dont really get into it ...
I want to download the data stored in an online database that unfortunately doesnt have a download function itself. Therefore, I would like to use your package.
The database can be accessed via http://gepris.dfg.de/gepris/OCTOPUS?language=en
As a result, I would like to have a dataframe containing the structured data the database contains - either for the complete database or for specific keywords one can filter by.
Can you help me with this issue? That would be great, thanks!
Wonderful package Salim!!!
I would like to know how I can avoid storing the html pages of the websites that I am crawling. I just need the list of urls for each website that are already available in the INDEX data.frame.
Is there an option to do it?
Thank you :)
I'd like to apply Rcrawler to various major news outlets (e.g., BBS, NBC, FOX, etc.) but only scrape articles that are relevant to my topic (e.g., the Volkswagen emissions scandal). Is it possible for me to do this with Rcrawler?
Hello,
I'm trying to scrape press releases from the UN Office of the High Commissioner for Human Rights. The problem is that the website uses the same URL for its news search tool and any specific search that one runs -- it's always http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx. I should note that while the articles themselves have unique URLs, I also need the data from the search tables for my project.
So how can I crawl a website structured like this using Rcrawler? The program doesn't seem to find the table segments even if I specify them using CSS.
I've run the following script for a whole day without the crawler finding any match:
Rcrawler(Website = "http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx", ExtractCSSPat=c("#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblTitle", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblDate", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_NewsType li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_CountryID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_MandateID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_SubjectID li"), ManyPerPattern=T, PatternsNames = c("Title","Date", "News type", "Country ID", "Mandate", "Subject"))
Any help you can provide would be very much appreciated!
Rcrawler doesn't find the links in the filterable grid here:
https://www.crowdpac.com/campaigns
When I run:
Rcrawler(Website = 'https://www.crowdpac.com/campaigns',
no_cores = 4,
no_conn = 4)
It only crawls 5 pages -- the pages linked in the top and bottom banners.
Thank you so much for putting this together!
I would like to scrape a page that has a menu of categories, but the crawler stops at the "show all" or "show more" buttons that load the remaining content of the menu. Is there a workaround solution to this?
How can I download pages faster by specifying KeywordsFilter value in Rcrawler()
For example
Rcrawler(Website ="http://www.salvex.com/listings/index.cfm?catID=1280®ID=0&mmID=0&orderBy=1&order=0&filterWithin=&f",Timeout=7,Encod=Getencoding("http://www.salvex.com/listings/index.cfm?catID=1280®ID=0&mmID=0&orderBy=1&order=0&filterWithin=&f"),no_cores = 2, KeywordsFilter = c("A250","aeroplane","AH-1" ,"AH-64" ,"aircraft","airframe","airplane","Allison","AS365 Dauphin","auction","aviation","Aviation Fueling Directory","Aviation Museums","Avionics","Bell 205 ","Bell 206 ","Bell 212","Bell 214","Bell 412","blades","Blades","Boeing","C20B","CH-47","CH-47 ","CH-53 ","Chinook","Cobra","driveshaft","engine","Eurocopter AS350","FLIR unit","fuel cell","Fuel Control","fuselage","gearbox","Ground Support Equipment (GSE)","Helicopter","hub assembly","Huey","Hughs","J85 ","Jet Ranger","JT8 ","JT9 ","Kamon","Kiowa","Long Ranger","LTS101 ","Lycoming","M250","main rotor blades","MR Blade","main rotor hub","MR Hub","MD500 ","OH-58","OH-58","OH-6","Pratt and Whitney","PT-6","PT6","Rolls Royce","servos","Sikorsky","skids","surplus","swashplate assembly","T53","T53-L-13B","T53-L-703","T55","T56 ","T58","T63","T63 ","T700 ","tail rotor blades","TR Blades","tail rotor hub","TR Hub","tailboom","transmission","turbine","Turboprop","UH1","UH-1","UH-1H","UH-60","UH60 ","vertical stabilizer","wire strike kit"),no_conn = 2,MaxDepth=1)
Hi Recrawler team,
I am new to R and Recrawler. I would like to know if Recrawler can be used to scrape/crawl bilingual sites, let's say I have this English site:
https://government.ae/en
and this is the corresponding Arabic one:
https://government.ae/ar-ae
How can I use Recrawler to get the bitext from them and save the output in tab-delimited file?
Can you crawl only texts based on div tag, CSS selectors or maybe xpath?
Thanks
Thank you for the convenient package to crawl web page data.
I was especially interested in getting hrefs from a web site. In the readme.md I found that it is possible to pass an argument, "ExtractPatterns= c("//*/a/@href")", in the Rcrawler-function which should do the job.
Unfortunately though, this argument has been removed?!
I am usually not the impatient type, but could you tell me what would be the easiest way to get all hrefs from a web page under the current Rcrawler version? Using the LinkExtractor, for instance, does not do it by default.
Thanks for the support!
Hi,
I would like to crawl and scrape the content of a whole website. This is the code:
Rcrawler(Website = URL, no_cores = 4, no_conn = 4, ExtractXpathPat = c("//./div[@class='bodytext']//p", "//./h1[@class='blogtitle']", "//./div[@id='kommentare']//p"), PatternsNames = c("article", "title", "comments"), ManyPerPattern = TRUE)
After retrieving approx. 19% of the data I get the following error message:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
It always happens at the same point, DATA and INDEX are created correctly with all entries crawled until the error message.
Am I doing something wrong or is it something with the website I would like to crawl? I am using Rcrawler version 0.1.9-1.
Thanks for helping me out!
Dear Salim;
thank you so much for such great efforts and useful package!
i have an issue in crawling and data gathering in search result pages.
in ExtractCSSPat mentioned few CSS rules but some of the pages doesn't include the required data and CSS is not available on them.
below error would occured:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
In addition: Warning messages:
1: In UseMethod("xml_remove") :
closing unused connection 7 (<-DESKTOP-K73RC4R:11502)
2: In UseMethod("xml_remove") :
closing unused connection 6 (<-DESKTOP-K73RC4R:11502)
3: In UseMethod("xml_remove") :
closing unused connection 5 (<-DESKTOP-K73RC4R:11502)
4: In UseMethod("xml_remove") :
closing unused connection 4 (<-DESKTOP-K73RC4R:11502)
5: In UseMethod("xml_remove") :
closing unused connection 3 (C:/Users/Hamed/Documents/tripadvisor.com-281027/extracted_data.csv)
****i though its possible to put a conditional statement to check if the CSS tag is not available then return null in data set.
When I try to install I get this error.
> install.packages("Rcrawler")
Installing package into 'W:/R-3.4._/R_LIBS_USER_3.4._'
(as 'lib' is unspecified)
installing the source package 'Rcrawler'
trying URL 'http://cran.rstudio.com/src/contrib/Rcrawler_0.1.tar.gz'
Content type 'application/x-gzip' length 20944 bytes (20 KB)
downloaded 20 KB
* installing *source* package 'Rcrawler' ...
** package 'Rcrawler' successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
Error: package or namespace load failed for 'Rcrawler':
.onLoad failed in loadNamespace() for 'rJava', details:
call: fun(libname, pkgname)
error: No CurrentVersion entry in Software/JavaSoft registry! Try re-installing Java and make sure R and Java have matching architectures.
Error: loading failed
Execution halted
*** arch - x64
ERROR: loading failed for 'i386'
* removing 'W:/R-3.4._/R_LIBS_USER_3.4._/Rcrawler'
The downloaded source packages are in
'W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM\downloaded_packages'
Warning messages:
1: running command '"W:/R-3.4._/App/R-Portable/bin/x64/R" CMD INSTALL -l "W:\R-3.4._\R_LIBS_USER_3.4._" W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz' had status 1
2: In install.packages("Rcrawler") :
installation of package 'Rcrawler' had non-zero exit status
But rJava loads fine
> library(rJava)
I tried running the installation manually
shell("start cmd /k", wait = FALSE)
W:\R-3.4._>"W:/R-3.4._/App/R-Portable/bin/x64/R" CMD INSTALL -l "W:\R-3.4._\R_LIBS_USER_3.4._" W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpSojYP9/downloaded_packages/Rcrawler_0.1.tar.gz'
Warning: invalid package 'W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz''
Error: ERROR: no packages specified
I Checked the contents of (that does exist)
W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\RtmpOGEDmM/downloaded_packages/Rcrawler_0.1.tar.gz
Rcrawler_0.1.tar
Perhaps the contents are not correct? Was the .tar.gz made with "R CMD"?
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.0
Hi there,
I tried to crawl https://www.gov.sg/ using Rcrawler, however it only returns one HTML page in the local directory (and one row in the INDEX object).
Can you please let me know what is going wrong here?
Many thanks,
Tim
I am trying to crawl Polish websites. Here the word can have several inflection endings, it may also exist with slight change of the root of word. Can one use a "keyword" options with wildcards or basic regex? Can boolean logic be used?
I am very grateful for your very useful plugin and I wish you lots of success!!!
Line 516 of the code:
"cat("browser:"+i+" port:"+pkg.env$Lbrowsers[[i]]$process$port)"
returns non-numeric argument to binary operator error.
It should use "," instead of "+". I.e.
"cat("browser:",i," port:",pkg.env$Lbrowsers[[i]]$process$port)"
I want to crawl "https://www.vebeg.de/web/en/verkauf/suchen.htm?DO_SUCHE=1&SUCH_MATGRUPPE=1300" website. This website has content in german language.
Also, How to overwrite/add the content to folder created with domain name. I am unable to crawl 2 different pages of same website at a time because it is giving me warning saying that folder already exists when I start crawling the different page of the same website.
Forexample
First command downloaded pages successfully. Second command is not downloading any pages.
Hi Salim,
When I try to use the LinkExtractor function to crawl the New Zealand government website, I get an error.
Here is the code:
pageinfo <- LinkExtractor(url="https://www.govt.nz")
Here is the error:
Error in if (!is.na(links[t])) { : argument is of length zero
Can you please advise how to resolve this issue? I have tried many different combinations of the url
parameter for this website, but none work (e.g., "https://www.govt.nz/", "https://govt.nz", "www.govt.nz", etc).
Edit: this error is also occurring for other sites. I encountered it when crawling the Canadian government website ("https://www.canada.ca/").
Thanks,
Tim
Dear Salim, Dear Mohamed,
Your tool is awesome.
I would like to propose a big feature.
Let's assume we have a corpus of files we already scraped and found more interesting than others. We built a classification model that we can use on new documents with predict command to get a category or yes, no.
Can the rcrawler functions for crawling or scraping be developed to accept a model as a parameter to
scrape only content that fall in specific category?
Example:
I have collected a couple of texts that are related to individual bad mortgage loans in Swiss franc. It is a controversy in my country. And I collected equal number of articles related to other issues in the same "economy" section.
It would be a marvelous tool if I could tell it to
scrape(starting_point=http://mywebsite/economy, predict_filter=predict(model=my_classification_model)
Wishing you all the best,
Jacek
Hi,
for my rStudio the package installs perfectly, however I can't use "Rcrawler" for some reason. There is no error I get, it's just that rStudio will not respond anymore in the console. I can use rStudio but the function will not return in any way, nothing is displayed (error/warnings) in the console either. Except for this warning sometimes: "unknown timezone 'zone/tz/2017c.1.0/zoneinfo/Europe/Berlin'"
However the folder (project) and "extracted_contents.csv" are created.
Even if I use a one-pager website (without any furter pages, other than the homepage), it will not finish or reply or do anything. Only the rStudio consoles cursor is blinking... and nothing happens.
/edit: Aditional info
I can use "Data<-ContentScraper(Url = "http://glofile.com/index.php/2017/06/08/athletisme-m-a-rome/", CssPatterns = c(".entry-title",".entry-content"))" as expected, the data Variable is created and is containing the info expected.
Plus I updated my R now, still the same issue...
My setup: macOS High Sierra Version 10.13.1
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
how can I save the downloaded html with Rcrawler to a database instead of saving it to the local folder.
When calling function Rcrawler::Rcrawler()
on any website, if I don't run library(Rcrawler)
prior, I get this error:
Rcrawler::Rcrawler("https://www.google.com")
#> Error in get(name, envir = envir) : object 'LinkExtractor' not found
If I simply call library(Rcrawler)
prior, everything works as expected (but I must load via library
to avoid the error).
The source of the error message is this line in file Rcrawlerp.R:
clusterExport(cl, c("LinkExtractor","LinkNormalization"))
If Rcrawler isn't explicitly loaded, the two pkg functions cannot be found/imported by clusterExport.
I'm running R 3.4.2 on Windows 10. I'm getting the error in both the CRAN version and the dev version from GitHub.
Here's a complete workflow that causes the error for me, followed by session_info:
> # Start with a fresh install from GitHub.
> devtools::install_github("salimk/Rcrawler")
Downloading GitHub repo salimk/Rcrawler@master
from URL https://api.github.com/repos/salimk/Rcrawler/zipball/master
Installing Rcrawler
* installing *source* package 'Rcrawler' ...
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (Rcrawler)
>
>
> # Attempt to crawl a website.
> Rcrawler::Rcrawler("https://www.google.com")
Error in get(name, envir = envir) : object 'LinkExtractor' not found
>
>
> # Print session info.
> devtools::session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, mingw32
ui RStudio (1.1.383)
language (EN)
collate English_United States.1252
tz America/Chicago
date 2017-11-25
Packages ------------------------------------------------------------------------------------------------------------------------------------------------
package * version date source
base * 3.4.2 2017-09-28 local
codetools 0.2-15 2016-10-05 CRAN (R 3.4.2)
compiler 3.4.2 2017-09-28 local
curl 3.0 2017-10-06 CRAN (R 3.4.2)
data.table 1.10.4-3 2017-10-27 CRAN (R 3.4.2)
datasets * 3.4.2 2017-09-28 local
devtools 1.13.4 2017-11-09 CRAN (R 3.4.2)
digest 0.6.12 2017-01-27 CRAN (R 3.4.1)
doParallel 1.0.11 2017-09-28 CRAN (R 3.4.2)
foreach 1.4.3 2015-10-13 CRAN (R 3.4.2)
git2r 0.19.0 2017-07-19 CRAN (R 3.4.1)
graphics * 3.4.2 2017-09-28 local
grDevices * 3.4.2 2017-09-28 local
httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
iterators 1.0.8 2015-10-13 CRAN (R 3.4.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.1)
methods * 3.4.2 2017-09-28 local
parallel 3.4.2 2017-09-28 local
R6 2.2.2 2017-06-17 CRAN (R 3.4.2)
Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
Rcrawler 0.1.5 2017-11-26 Github (salimk/Rcrawler@db76deb)
stats * 3.4.2 2017-09-28 local
tools 3.4.2 2017-09-28 local
utils * 3.4.2 2017-09-28 local
withr 2.1.0 2017-11-01 CRAN (R 3.4.2)
xml2 1.1.1 2017-01-24 CRAN (R 3.4.1)
yaml 2.1.14 2016-11-12 CRAN (R 3.4.1)
When trying to use Rcrawler I get an error
Rcrawler(Website = "http://www.nytimes.com", KeywordsFilter = c("Paris accord"), KeywordsAccuracy = 100)
Preparing multihreading cluster .. Error in checkForRemoteErrors(lapply(cl, recvResult)) :
15 nodes produced errors; first error: there is no package called ‘webdriver’
However webdriver package is installed and loaded
Sys.info()
sysname release version
"Linux" "4.15.0-1031-azure" "#32-Ubuntu SMP Wed Oct 31 15:44:56 UTC 2018"
"x86_64"
I appreciate any guidance, thanks!
The following code gives empty result.
pageinfo<-LinkExtractor("https://www.michaelkors.com/blakely-leather-satchel/_/R-US_30S8SZLM6L")
The result is
[[1]]
[[1]][[1]]
[1] 748
[[1]][[2]]
[1] "https://www.michaelkors.com/blakely-leather-satchel/_/R-US_30S8SZLM6L"
[[1]][[3]]
[1] "NULL"
[[1]][[4]]
[1] 628
[[1]][[5]]
[1] ""
[[1]][[6]]
[1] ""
[[1]][[7]]
[1] ""
[[1]][[8]]
[1] ""
[[1]][[9]]
[1] ""
[[2]]
logical(0)
I will need to download data from the following website. It only allows 90 per page though. Can we crawl through these pages and get all of the data?
https://www.michigantrafficcrashfacts.org/querytool/lists/0#q1;0;2016;;
Hi Salim,
i am running Rcrawler on a vector of websites. i have noticed that it is failing to crawl some ex:
http://www.alahleia.com
http://www.almalki.com
tried several depth levels and timeout.
thank you
When running following code
Rcrawler(Website = crawl_page
, no_cores = 2
, no_conn = 2
, RequestsDelay = 1
, MaxDepth = 3
, DIR = 'https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date'
, urlregexfilter = c("/rc/","start="))
There is an issue when there is an ampersand in the web address returned to the Data INDEX.
It is getting converted to &
instead of the real & sign.
so in the INDEX it will show as
https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date&start=40
Instead of
https://www.indeed.com/jobs?q=senior+data+analyst&l=Tampa%2C+FL&sort=date&start=40
So in effect the parameters:
l, sort, and start are not being passed the correct values in the URL.
Let me know if you need any more info to help correct.
Thanks
--Michael
I found that by simply changing a single line in the linkextractor.R from readhtml to renderhtml from the splashr package, one can apparently crawl javascript enforcing sites, too.
Especially interesting is the combo with this docker image, making tor crawls optional too:
https://github.com/TeamHG-Memex/aquarium
Would be nice to see this as optional in a future version. Or even better, mixing the framework with the interactivity options provided by Rselenium, but that would mean larger changes I guess. Anyway, this is as far I can see the most advanced scrapy competitor out there in the R language, would be nice to see it grow as well. Much better than rharvest.
Can you elaborate further on the filter and accuracy? On the section concerning the filter, it is mentioned that providing a keyword, say "keyword" will search for pages that include "keyword" at least one time on the page.
In the accuracy section following the filter guidelines, it's then said that a 50% accuracy rate means that "keyword" occurs at least once, while 100% means "keyword" occurs at least five times.
Is there something I'm not understanding? Webcrawling is a new concept for me, but I still can't seem to make sense of this particular section.
Hello! I love the package! Extremely powerful!
I thought about two possible features that could come in handy while reading about the current functions:
For example, I might want to scrape a set of websites with unknown structures, knowing that each of them has a specific .pdf file somewhere. If I am only interested in these files from each of the websites, it would be much faster for me if I could tell the robot to stop searching a particular website once it finds the file.
Thanks and best!
Hi Salim,
I was testing Rcrawler with 'http://www.emiratesmarsmission.ae/ar//' and I found out that all Arabic characters in the saved HTML files are turned into unicode.
Here is an example:
<U+0628><U+0639><U+062F><U+0627><U+0644><U+0627><U+0646><U+062A><U+0647><U+0627><U+0621><U+0645><U+0646><U+0639><U+0645><U+0644><U+064A><U+0629><U+062A><U+0635><U+0646><U+064A><U+0639><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+0645><U+062C><U+0633><U+0645><U+0627><U+0644><U+0647><U+0646><U+062F><U+0633><U+064A><U+0644><U+0644><U+0642><U+0645><U+0631><U+0627><U+0644><U+0627><U+0635><U+0637><U+0646><U+0627><U+0639><U+064A><U+064A><U+0642><U+0648><U+0645><U+0645><U+0647><U+0646><U+062F><U+0633><U+0636><U+0645><U+0627><U+0646><U+0627><U+0644><U+062C><U+0648><U+062F><U+0629><U+0628><U+0645><U+0639><U+0627><U+064A><U+0646><U+0629><U+0627><U+0644><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+062A><U+064A><U+062A><U+0645><U+062A><U+0635><U+0646><U+064A><U+0639><U+0647><U+0627> <U+064A><U+0642><U+0648><U+0645><U+0645><U+0647><U+0646><U+062F><U+0633><U+0627><U+0644><U+0645><U+0631><U+0643><U+0632><U+0628><U+062A><U+0635><U+0646><U+064A><U+0639><U+0623><U+062F><U+0648><U+0627><U+062A><U+0627><U+0644><U+0645><U+062C><U+0633><U+0645><U+0627><U+0644><U+0647><U+0646><U+062F><U+0633><U+064A><U+0644><U+0644><U+0642><U+0645><U+0631><U+0627><U+0644><U+0627><U+0635><U+0637><U+0646><U+0627><U+0639><U+064A>
Any idea how to fix this and make the Arabic text rendered normally?
Thanks,
Mohamed Zeid
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.