Giter Site home page Giter Site logo

Comments (6)

lpla avatar lpla commented on May 27, 2024

It is not a warning. I tried to reproduce the problem but the crawling is just massive for that website. Did you try with a smaller website? Take into account that Bitextor starts crawling at the base domain to find parallel documents, in this case, "tutorialspoint.com". So it ignores the full path of a specific page or part of the web ("numpy/numpy_advanced_indexing.htm" here).

Anyway, I cancelled the crawling after some minutes (just pressing Ctrl + C) and the process finished correctly without errors (with empty TMX), so I couldn't reproduce your problem. I don't know exactly why it shows you that error about arguments, I need more information to debug this problem. How did you install Bitextor? Does it happen with other websites? Are your dictionaries in the required format?

from bitextor.

omarkaraksi avatar omarkaraksi commented on May 27, 2024

I re-tested it with smaller web site 'f-g.com' same result empty TMX file , i used the installation guide in the repository. i don't know what do you mean by dictionaries in the required format ?
Thanks in advance for your help.
Here are the logs :
logs.tar.gz

from bitextor.

lpla avatar lpla commented on May 27, 2024

Everything in logs look like warnings. I also run Bitextor in my server with that 'f-g.com' site and I only get an empty TMX, but that makes sense because I don't see any parallel pages in that website.

Try with this small random website with parallel content and tell me if you get a TMX with content: https://bordercollies.es

from bitextor.

omarkaraksi avatar omarkaraksi commented on May 27, 2024

OK it worked fine with this website , but it failed with 'expedia.com' with this message
robots.txt forbids crawling URL: https://www.expedia.com/Flights-Search?trip=roundtrip&leg1=from%3AORD%2Cto%3AMNL%2Cdeparture%3A11%2F18%2F2018TANYT&leg2=from%3AMNL%2Cto%3AORD%2Cdeparture%3A12%2F03%2F2018TANYT&passengers=children%3A0%2Cadults%3A1%2Cseniors%3A0%2Cinfantinlap%3AN&mode=search&options=sortby%3Aprice&paandi=true

How can i bypass robots.txt .

from bitextor.

lpla avatar lpla commented on May 27, 2024

Our crawler (Creepy) does not have an option to ignore robots.txt right now. It could be a feature or improvement in the future, but that is out of the topic of this issue. Please, if you want us to discuss and evaluate it, open a new issue requesting it and adding the tag "enhancement".

As a temporal workaround you could try another crawler which could ignore robots.txt and manually generate a ETT file from the crawl to run Bitextor from it with the option --ett FILE.

I will close this issue now as the main problems look solved.

from bitextor.

omarkaraksi avatar omarkaraksi commented on May 27, 2024

Thanks a lot for your effort.

from bitextor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.