Definitions often break when searching for specific phrases due to google's new design

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi, Wordnik is limited to <a href="https://deve

Fix Duplication for Wikipedia about multidefine HOT 4 CLOSED

musehd commented on September 26, 2024

Fix Duplication for Wikipedia

from multidefine.

Comments (4)

raavann commented on September 26, 2024

I did go through the code and there are couple of points i would like to highlight:

You are webScrapping google. Now there are couple of problems with it.. google keeps changing it's code and thus the your code will have to be updated regularly, Also google get's the definition from oxford's dictionary so why don't you directly scrape for the oxford's website?
You can also get the definitions from various dictionary APIs thus eliminating need for webscrapping.
Various dictionary APIs
The program needs the chromedriver inside the main directory well, for windows user that might be comfortable but for linux and mac where the chromedriver is already set in the path.. it can be made easier.

thanks,
I can make these changes and revert back? are you okay with it?

from multidefine.

museHD commented on September 26, 2024

Hi @raavann,
Firstly, thank you for your interest in the project! ⭐

I originally scraped Google's results because they provide either a website, a Wikipedia snippet or a snippet from Oxford dictionary and display information from other sources if a certain source didn't have the definition. I haven't changed the xPath for the past year or so and it still seems to function for the time being. But you are right; it's bound to break one day.

Back when I first implemented it, I found that going to the dictionary website was much slower than scraping it directly from Google which is why I chose that approach however, please correct me if I'm wrong and if this isn't the case anymore.
Initially, I was going to use an API, however some of these were not free and/or rate limited the user. Also, the user would have to create their own API key, which reduce user friendliness.
Unfortunately, I don't have access to a full Linux environment (I have WSL but that doesn't really count), so making it more accessible would be fantastic! I recently added a feature to update the chromedriver to the latest version in Windows. This would also need to be adapted for Linux as I believe Google Chrome on Linux doesn't auto update. I was wondering if this could also help fix #1 as I believe most Linux distros come with a Firefox based browser which would require the gecko driver.

Sorry for the delay. Please let me know what you think :)
Cheers!

from multidefine.

raavann commented on September 26, 2024

Hello @museHD ,
Okay I understand, hmm so what we could do is,

First use API-s (several api-s) like Urban dictionary and wordnik and they would async-ly get definitions of the words!
ps: these are free APIs and do not have rate limit.
If these API-s couldn't find the word then we can webScrap Google? Like the one you already implemented, as if there is no definition then get the text of the top link!
ps: i don't know why google is faster in webscrapping maybe it's because google takes less time to load than Oxford.
btw since both google's search and oxford dictionary are static, you can use BeautifulSoup to scrape these websites, this will increase the speed
We can also check for typos with textBlob API-s

Okay so the complete process would look like,

get the words
check for typos
search via API
return result and inform that this was the typo, The user can then opt that he meant for the typo to occur, in which case we'll scrape the web.
if there is no typo
returns APIs results
if APIs results are not found then it would scrape the web

pheww.. that was a long description!
I am currently participating in hacktoberfest, please revert back i could implement some of these changes!

thanks :)

from multidefine.

museHD commented on September 26, 2024

Hi,

Wordnik is limited to 100 calls/hour and will require the user to create their own API key, which is not ideal. I had a look online and this API seems to be promising
Adding UrbanDictionary as a source is definitely interesting, however it should be the queried last as it may not provide accurate definitions. Also the UrbanDictionary API that you linked has not been maintained for the past 2 years and seems to have some problems with the current website, so we will have to create our own method for scraping UD or use another library.
Yup.
Actually Google's search is not static. It makes AJAX calls after loading the page, which is why it can't be scraped with BeautifulSoup. I originally used BS but it gave me errors when trying scrape Google. For Oxford, the content is probably static but I switched everything to Selenium just to be consistent. If Oxford is static, then its performance can definitely be improved by using BS. I removed a lot of old code that uses the requests library and replaced that with selenium.
Checking typos also seems interesting! Currently the program only relies on Google's autocorrect so having this built in would be great as it would also improve queries to Oxford.

Regarding the structure, I would leave out informing that a word was a typo as it would slow down the process of retrieving definitions and the program prints the word being searched in the output anyway. Also, UrbanDictionary should be the last query as it is not very reliable or accurate.

It's great that you're taking part in Hacktoberfest! Feel free to look at my other projects and suggest changes cuz there are a lot of easy fixes that you will be able to do haha. ⭐s are appreciated :)

Let me know if you are happy to make the changes
Happy Hacktoberfest!

from multidefine.

Fix Duplication for Wikipedia about multidefine HOT 4 CLOSED

Comments (4)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent