Giter Site home page Giter Site logo

Fix Duplication for Wikipedia about multidefine HOT 4 CLOSED

musehd avatar musehd commented on September 26, 2024
Fix Duplication for Wikipedia

from multidefine.

Comments (4)

raavann avatar raavann commented on September 26, 2024

I did go through the code and there are couple of points i would like to highlight:

  • You are webScrapping google. Now there are couple of problems with it.. google keeps changing it's code and thus the your code will have to be updated regularly, Also google get's the definition from oxford's dictionary so why don't you directly scrape for the oxford's website?
  • You can also get the definitions from various dictionary APIs thus eliminating need for webscrapping.
    Various dictionary APIs
  • The program needs the chromedriver inside the main directory well, for windows user that might be comfortable but for linux and mac where the chromedriver is already set in the path.. it can be made easier.

thanks,
I can make these changes and revert back? are you okay with it?

from multidefine.

museHD avatar museHD commented on September 26, 2024

Hi @raavann,
Firstly, thank you for your interest in the project! ⭐

  • I originally scraped Google's results because they provide either a website, a Wikipedia snippet or a snippet from Oxford dictionary and display information from other sources if a certain source didn't have the definition. I haven't changed the xPath for the past year or so and it still seems to function for the time being. But you are right; it's bound to break one day.

    Back when I first implemented it, I found that going to the dictionary website was much slower than scraping it directly from Google which is why I chose that approach however, please correct me if I'm wrong and if this isn't the case anymore.

  • Initially, I was going to use an API, however some of these were not free and/or rate limited the user. Also, the user would have to create their own API key, which reduce user friendliness.

  • Unfortunately, I don't have access to a full Linux environment (I have WSL but that doesn't really count), so making it more accessible would be fantastic! I recently added a feature to update the chromedriver to the latest version in Windows. This would also need to be adapted for Linux as I believe Google Chrome on Linux doesn't auto update. I was wondering if this could also help fix #1 as I believe most Linux distros come with a Firefox based browser which would require the gecko driver.

Sorry for the delay. Please let me know what you think :)
Cheers!

from multidefine.

raavann avatar raavann commented on September 26, 2024

Hello @museHD ,
Okay I understand, hmm so what we could do is,

  • First use API-s (several api-s) like Urban dictionary and wordnik and they would async-ly get definitions of the words!
    ps: these are free APIs and do not have rate limit.
  • If these API-s couldn't find the word then we can webScrap Google? Like the one you already implemented, as if there is no definition then get the text of the top link!
    ps: i don't know why google is faster in webscrapping maybe it's because google takes less time to load than Oxford.
    btw since both google's search and oxford dictionary are static, you can use BeautifulSoup to scrape these websites, this will increase the speed
  • We can also check for typos with textBlob API-s

Okay so the complete process would look like,

  • get the words
  • check for typos
  • search via API
  • return result and inform that this was the typo, The user can then opt that he meant for the typo to occur, in which case we'll scrape the web.
  • if there is no typo
  • returns APIs results
  • if APIs results are not found then it would scrape the web

pheww.. that was a long description!
I am currently participating in hacktoberfest, please revert back i could implement some of these changes!

thanks :)

from multidefine.

museHD avatar museHD commented on September 26, 2024

Hi,

  • Wordnik is limited to 100 calls/hour and will require the user to create their own API key, which is not ideal. I had a look online and this API seems to be promising
    Adding UrbanDictionary as a source is definitely interesting, however it should be the queried last as it may not provide accurate definitions. Also the UrbanDictionary API that you linked has not been maintained for the past 2 years and seems to have some problems with the current website, so we will have to create our own method for scraping UD or use another library.

  • Yup.
    Actually Google's search is not static. It makes AJAX calls after loading the page, which is why it can't be scraped with BeautifulSoup. I originally used BS but it gave me errors when trying scrape Google. For Oxford, the content is probably static but I switched everything to Selenium just to be consistent. If Oxford is static, then its performance can definitely be improved by using BS. I removed a lot of old code that uses the requests library and replaced that with selenium.

  • Checking typos also seems interesting! Currently the program only relies on Google's autocorrect so having this built in would be great as it would also improve queries to Oxford.

Regarding the structure, I would leave out informing that a word was a typo as it would slow down the process of retrieving definitions and the program prints the word being searched in the output anyway. Also, UrbanDictionary should be the last query as it is not very reliable or accurate.

It's great that you're taking part in Hacktoberfest! Feel free to look at my other projects and suggest changes cuz there are a lot of easy fixes that you will be able to do haha. ⭐s are appreciated :)

Let me know if you are happy to make the changes
Happy Hacktoberfest!

from multidefine.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.