Comments (6)
It is not a warning. I tried to reproduce the problem but the crawling is just massive for that website. Did you try with a smaller website? Take into account that Bitextor starts crawling at the base domain to find parallel documents, in this case, "tutorialspoint.com". So it ignores the full path of a specific page or part of the web ("numpy/numpy_advanced_indexing.htm" here).
Anyway, I cancelled the crawling after some minutes (just pressing Ctrl + C) and the process finished correctly without errors (with empty TMX), so I couldn't reproduce your problem. I don't know exactly why it shows you that error about arguments, I need more information to debug this problem. How did you install Bitextor? Does it happen with other websites? Are your dictionaries in the required format?
from bitextor.
I re-tested it with smaller web site 'f-g.com' same result empty TMX file , i used the installation guide in the repository. i don't know what do you mean by dictionaries in the required format ?
Thanks in advance for your help.
Here are the logs :
logs.tar.gz
from bitextor.
Everything in logs look like warnings. I also run Bitextor in my server with that 'f-g.com' site and I only get an empty TMX, but that makes sense because I don't see any parallel pages in that website.
Try with this small random website with parallel content and tell me if you get a TMX with content: https://bordercollies.es
from bitextor.
OK it worked fine with this website , but it failed with 'expedia.com' with this message
robots.txt forbids crawling URL: https://www.expedia.com/Flights-Search?trip=roundtrip&leg1=from%3AORD%2Cto%3AMNL%2Cdeparture%3A11%2F18%2F2018TANYT&leg2=from%3AMNL%2Cto%3AORD%2Cdeparture%3A12%2F03%2F2018TANYT&passengers=children%3A0%2Cadults%3A1%2Cseniors%3A0%2Cinfantinlap%3AN&mode=search&options=sortby%3Aprice&paandi=true
How can i bypass robots.txt .
from bitextor.
Our crawler (Creepy) does not have an option to ignore robots.txt right now. It could be a feature or improvement in the future, but that is out of the topic of this issue. Please, if you want us to discuss and evaluate it, open a new issue requesting it and adding the tag "enhancement".
As a temporal workaround you could try another crawler which could ignore robots.txt and manually generate a ETT file from the crawl to run Bitextor from it with the option --ett FILE
.
I will close this issue now as the main problems look solved.
from bitextor.
Thanks a lot for your effort.
from bitextor.
Related Issues (20)
- Error in rule bicleaner HOT 7
- Install Alcazar HOT 3
- Process completes without error but does not produce any sentence pairs HOT 1
- Urdu sentence alignment HOT 7
- Inconsistent behaviour of paths in .yaml file HOT 1
- How do you compare two different domains HOT 2
- Problem when run bitextor using document aligner NMT HOT 6
- Document aligner happily returns nothing with piped input HOT 1
- Custom Word Tokenizer Error HOT 3
- CMake build failed v8.1.1 HOT 6
- Bitextor crashes if Bicleaner filters all lines
- Hunalign and Bicleaner errors HOT 3
- Bleualign error HOT 4
- custom_translate getting called without externalMT HOT 2
- External embeddings
- Instruction on running bitextor_align_segments.py for Hunalign only? HOT 1
- Only first file in warc file appears to be processed when "directories" is used as data source HOT 4
- New Bicleaner AI full models HOT 1
- Document level granularity of Paracrawl HOT 1
- Bitextor usage HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bitextor.