Giter Site home page Giter Site logo

clessn / clessn-blend Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 32.22 MB

Code used to construct the data ETL pipelines between the various data sources required for the CLESSN and the datamarts providing datasets needed for research or visualization.

R 95.34% Python 3.37% Shell 1.15% Dockerfile 0.13%

clessn-blend's People

Contributors

adriclout avatar alexbouillon avatar camilletremblayantoine avatar clementcadieux avatar cloegp avatar dave-doume avatar hubcad25 avatar jergil avatar judith-bourque avatar kadufresne avatar p2xcode avatar patoscope avatar skelalex avatar

Stargazers

 avatar

Watchers

 avatar

clessn-blend's Issues

Article de retour à la Une CBC

Un article ayant été à la Une est revenu à la Une, et le scraper semble le détecter comme un nouvel article, le hashed_html est différent.

Capture d’écran, le 2023-05-12 à 15 59 17 Capture d’écran, le 2023-05-12 à 16 00 55 Capture d’écran, le 2023-05-12 à 15 59 30

Timestamp CBC

Sur CBC, on enregistre une modification qui n'a pas nécessairement eu lieu. De plus, on a un time stamp bizzare (le début et la fin sont identique)
Capture d’écran, le 2023-05-12 à 12 24 47
Capture d’écran, le 2023-05-12 à 12 25 32

Bogue CBC

On rapporte plusieurs changements à l'article de CBC qui est pourtant intact depuis une heure

Capture d’écran, le 2023-05-12 à 15 39 40 Capture d’écran, le 2023-05-12 à 15 40 36

Several headlines for the same frontpage - Toronto Star

Please let me know if I should separate this issue in three issues. The pattern I noticed is exactly the same.

I do not know if this is specific to the Toronto Star or a more general problem.

In four instances (one of which will instead be discussed in a separate issue because there are other challenges), I find there are two or more headline Hublot elements for the same frontpage. Interestingly, some of these headlines do include duplicates, and some frontpages in the past day were associated with only one headline as they should, so the issue does not happen all the time. The headline elements associated with the same frontpage have only a few differences with each other: (1) the hashed_html is different; (2) timestamps are different and non-overlapping (but, collectively, they do match with the frontpage); (3) the lake item's final numbers are different.

Instance #1:

Instance #2:

Instance #3:

Screenshots from instance #3 only (since the pattern is very similar for instances #1 and #2; happy to add more screenshots if needed):
Capture d’écran, le 2023-05-12 à 13 22 45
Capture d’écran, le 2023-05-12 à 13 22 26
Capture d’écran, le 2023-05-12 à 13 22 36

Timestamp of lake item name change in headline vs. frontpage

As explained in issue #39 some Radar+ Hublot frontpage elements are associated with two or more headline elements. This is the case for this story, but there are two additional issues.

  1. The article seems to be about the same story, but the title eventually changes, creating two frontpages. I am not sure if this is an issue or is something that we want to keep. @ClementCadieux ?
  2. The transition between both titles is 10 minutes earlier in the frontpages than in the timestamps. Not a big issue but a little odd to me.
  3. And, just like in issue #39 there should be one headline per frontpage.

Screenshots of frontpage 1, frontpage 2, headline 4 and headline 5 (this is where the title transition happens):
Capture d’écran, le 2023-05-12 à 13 34 17
Capture d’écran, le 2023-05-12 à 13 34 25
Capture d’écran, le 2023-05-12 à 13 34 37
Capture d’écran, le 2023-05-12 à 13 34 45

Toronto Star - 2-min late scraping

Yesterday at 1:51 p.m. no scraping occurred for the TorStar. It instead occurred 2 min later, at 1:53. Not a big issue but just wanted to flag this for both the headline and frontpage. Headline: https://clhub.clessn.cloud/admin/core/lake/56980/change/?_changelist_filters=p%3D3. Frontpage: https://clhub.clessn.cloud/admin/core/lake/56979/change/?_changelist_filters=p%3D3.

Looking at the channel 77x_clessn-blend-radar-plus, it seems the Slack channel was updated even later (1:54) and the issue also happened at 2:03 (2 min late) and 2:12 (1 min late).

Screenshots:
Capture d’écran, le 2023-05-16 à 12 19 28
Capture d’écran, le 2023-05-16 à 12 19 33

start_timestamp/end_timestamp chronology issue

This issue affects both the frontpage (https://clhub.clessn.cloud/admin/core/lake/55761/change/?_changelist_filters=p%3D2%26path%3Dradarplus%252Ffrontpage) and the headline (https://clhub.clessn.cloud/admin/core/lake/55762/change/?_changelist_filters=p%3D3%26path%3Dradarplus%252Fheadline) of a Radar+ Hublot element. The start_timestamp is 2023-05-11, 13:53 while the end_timestamp is 2023-05-11, 9:55. So the timestamp "ends before it starts." The issue could either be with the start_timestamp, end_timestamp or both.

Capture d’écran, le 2023-05-12 à 13 01 05 Capture d’écran, le 2023-05-12 à 13 01 13

Même bug pour CBC

Encore une fois, un article est enregistré deux hashed_html alors qu'il n'a pas été modifié depuis 11 heures

Capture d’écran, le 2023-05-13 à 14 45 14 Capture d’écran, le 2023-05-13 à 14 46 11

CBC duplication

Encore une fois, un article enregistré avec 2 hashed_html qui n'a pas été modifié depuis 8 heures.
Capture d’écran, le 2023-05-14 à 12 00 01
Capture d’écran, le 2023-05-14 à 12 01 37
Capture d’écran, le 2023-05-14 à 12 01 59

Toronto Star - wrong CSS element scraped?

Is it possible that the scraper for Toronto Star scraper scrapes the GTA's headline & frontpage instead of the general headline & frontpage? When I visit the Toronto Star's webpage, the headline/frontpage that appear are not the same ones as in Radar+. I currently see "STAR INVESTIGATION: Ontario’s top pathologist was accused of abusing his power. Now, judges say the bitter dispute never should have made it to their court" as the first article on the Toronto Star's webpage. On the other hand, "Man wanted after woman sexually assaulted on Toronto walking trail" is the current headline/frontpage as per Radar+. It only appears further down the page as the main element in the "GTA" section.

Radar+ headline: https://clhub.clessn.cloud/admin/core/lake/57273/change/
Radar+ frontpage: https://clhub.clessn.cloud/admin/core/lake/57272/change/

Screenshots:
Capture d’écran, le 2023-05-16 à 12 36 38
Capture d’écran, le 2023-05-16 à 12 36 49

Capture d’écran, le 2023-05-16 à 12 38 14 Capture d’écran, le 2023-05-16 à 12 38 21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.