Giter Site home page Giter Site logo

certification-data-scraper's People

Contributors

carolinejli avatar ellasoderberg avatar emmahag avatar

Stargazers

 avatar

Watchers

 avatar  avatar

certification-data-scraper's Issues

Read number of pages in TED script

Now we have to manually check how many pages the TED script should iterate trough. Would be much better to read all the numbers in the pagination element and find the highest number among these, and iterate trough that number of pages.

Fix zip issue

Sometimes it's impossible to unzip files. Causes the script to just stand still and do nothing instead of crashing. Add some kind of timeout for when this happens, so that the script can restart instead.

Implement tests

Build test modules to test the script. For example, implement type checking for the scraped data.

Enhance the file structure

Enhance the file structure, in order to make the code maintainable and structured in a more logical way.

Add tender type info

Add a column for tender type, to see if what is published is before, during or after the tender process.

Ensure the script can start without intervention

Currently in order to start the script, we have to manually check where the scraper should stop scraping. By looking at the title of the first scraped tender from last week, we can have the scraper automatically start without intervention.

Add comments

Comment everything in order to make the code more readable.

Ensure the script can fail and restart with minimal intervention

Now when the script fails, we manually have to look at the log-files to see where to restart the script and find the current google docs ID. It would be better if this information is saved and read when the script is restarted, to reduce manual intervention.

Improve logging

Improve the logging to give more specific information about errors happening. It should be possible to just look at the log and immediately see why the script has crashed.

Discuss and fix the GDPR issue

We will not store contact details connected to any European citizens due to GDPR. However, the contact details are part of what makes the scraping so valuable. We need to find a way to extract valuable information while still complying with GDPR.

Read more here: https://www.zyte.com/blog/web-scraping-gdpr-compliance-guide/
https://www.zyte.com/blog/gdpr-public-personal-data-update/
https://www.zyte.com/blog/solution-architecture-part-3-conducting-a-web-scraping-legal-review/

Include information from salesforce

Add information from salesforce in the data delivery. Ideas:

  • information of whether or not a person or company has already been in contact
  • information of whether a company has used TCO Certified as a requirement in a tender

Retrieve documents from TED

See if it would be possible to retrieve document from TED tenders, or at least from a few of the TED tenders.

Improve the duplicate contact check for TED

As of now, we match all the contacts that have been scraped historically to a static list to see if we have already seen them previously or not. Could be done in a better way. Potentially by connecting to the salesforce API.

Add functionality to gracefully shut down script from an input

Now if the script stops, it is not done gracefully (chrome driver might be running still, the cache is not saved). It would be good to be able to run a command that would run the shutdown function, and the possibility to specify what should be saved to cache (run successful or not) using flags.

Create algorithm to sort tenders on relevance

For many databases, the search functions suck. It would be very useful to create our own relevance indexing by looking at for example keywords, CPV and other codes, etc. to sort tenders on relevance in the google sheet ourselves.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.