when executed this command bitextor -u https://www.tutor

i get this message "line 590: [: too many arguments" when i try to crawl tutorials-point website about bitextor HOT 6 CLOSED

bitextor commented on May 27, 2024

i get this message "line 590: [: too many arguments" when i try to crawl tutorials-point website

from bitextor.

Comments (6)

lpla commented on May 27, 2024

It is not a warning. I tried to reproduce the problem but the crawling is just massive for that website. Did you try with a smaller website? Take into account that Bitextor starts crawling at the base domain to find parallel documents, in this case, "tutorialspoint.com". So it ignores the full path of a specific page or part of the web ("numpy/numpy_advanced_indexing.htm" here).

Anyway, I cancelled the crawling after some minutes (just pressing Ctrl + C) and the process finished correctly without errors (with empty TMX), so I couldn't reproduce your problem. I don't know exactly why it shows you that error about arguments, I need more information to debug this problem. How did you install Bitextor? Does it happen with other websites? Are your dictionaries in the required format?

from bitextor.

omarkaraksi commented on May 27, 2024

I re-tested it with smaller web site 'f-g.com' same result empty TMX file , i used the installation guide in the repository. i don't know what do you mean by dictionaries in the required format ?
Thanks in advance for your help.
Here are the logs :
logs.tar.gz

from bitextor.

lpla commented on May 27, 2024

Everything in logs look like warnings. I also run Bitextor in my server with that 'f-g.com' site and I only get an empty TMX, but that makes sense because I don't see any parallel pages in that website.

Try with this small random website with parallel content and tell me if you get a TMX with content: https://bordercollies.es

from bitextor.

omarkaraksi commented on May 27, 2024

OK it worked fine with this website , but it failed with 'expedia.com' with this message
robots.txt forbids crawling URL: https://www.expedia.com/Flights-Search?trip=roundtrip&leg1=from%3AORD%2Cto%3AMNL%2Cdeparture%3A11%2F18%2F2018TANYT&leg2=from%3AMNL%2Cto%3AORD%2Cdeparture%3A12%2F03%2F2018TANYT&passengers=children%3A0%2Cadults%3A1%2Cseniors%3A0%2Cinfantinlap%3AN&mode=search&options=sortby%3Aprice&paandi=true

How can i bypass robots.txt .

from bitextor.

lpla commented on May 27, 2024

Our crawler (Creepy) does not have an option to ignore robots.txt right now. It could be a feature or improvement in the future, but that is out of the topic of this issue. Please, if you want us to discuss and evaluate it, open a new issue requesting it and adding the tag "enhancement".

As a temporal workaround you could try another crawler which could ignore robots.txt and manually generate a ETT file from the crawl to run Bitextor from it with the option --ett FILE.

I will close this issue now as the main problems look solved.

from bitextor.

omarkaraksi commented on May 27, 2024

Thanks a lot for your effort.

from bitextor.

i get this message "line 590: [: too many arguments" when i try to crawl tutorials-point website about bitextor HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent