Giter Site home page Giter Site logo

Comments (5)

KartikSoneji avatar KartikSoneji commented on September 26, 2024 1

a. Who is scraping the catchup page?

We've already had one project that does it (https://github.com/mihikagaonkar/OTC-Dashboard), so let us not make any assumptions and keep things open for the future.

There was no need to scrape the website, all the data was availabe in the repo.

b. The idea is to expose an api, which people can use instead of scraping.

That is an alternative, but it requires additional effort. What is your plan for this API?

No, that is a side effect of implementing infinite scroll.
The endpoint that will be called to get the next set of summaries will be the same one that someone might use to scrape them.

How much detail will it include? Will it send over the entire file or will it provide options to get dates, durations and other specific parts of the content?

Just the <section> tags that currently contain each summary in the combined summary page.

-e, --embedded
Output an embeddable document, which excludes the header, the footer, and everything outside the body of the document. This option is useful for producing documents that can be inserted into an external template.
We shouldn't need a new parser, just the -e flag.

Also more importantly, how would we let someone who wanted to scrape our pages know that we have such a feature available?

Hmm maybe add a page, but most likely someone who wants to scrape the page will analyze network requests.
Or ask us about it.

But in general, there are very few reasons to scrape the summaries from the website.
If someone wants to run static analysis, individual files in the repo are better for that.
The only other reason might be to integrate with another website, but in that case an api would be easier.

from catchup.

sreekaransrinath avatar sreekaransrinath commented on September 26, 2024

Will make it a pain in the ass to scrape ;-;

from catchup.

KartikSoneji avatar KartikSoneji commented on September 26, 2024

a. Who is scraping the catchup page?
b. The idea is to expose an api, which people can use instead of scraping.

from catchup.

HarshKapadia2 avatar HarshKapadia2 commented on September 26, 2024

a. Who is scraping the catchup page?

We've already had one project that does it (https://github.com/mihikagaonkar/OTC-Dashboard), so let us not make any assumptions and keep things open for the future.

b. The idea is to expose an api, which people can use instead of scraping.

That is an alternative, but it requires additional effort.
What is your plan for this API? How much detail will it include? Will it send over the entire file or will it provide options to get dates, durations and other specific parts of the content? (This API will also act as a blocker if we have to change any file formatting in the future, as we will have to handle different scenarios of file formattings to be parsed and returned.)
Also more importantly, how would we let someone who wanted to scrape our pages know that we have such a feature available?

from catchup.

HarshKapadia2 avatar HarshKapadia2 commented on September 26, 2024

Makes sense. Thank you.

We should add a note somewhere for scrapers though, just to inform them about the API. (Maybe in the API response?)
We will also have to document the API.

from catchup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.