thegreenwebfoundation / greencheck-api Goto Github PK

The green web foundation API

Home Page: https://www.thegreenwebfoundation.org/

License: Apache License 2.0

PHP 83.53% HTML 4.74% CSS 6.91% Shell 1.85% Python 0.36% Jinja 2.61%

greencheck-api's Introduction

Green Web Foundation API

In this repo you can find the source code for the API and checking code that the Green Web Foundation servers use to check the power a domain uses.

Overview

Following Simon Brown's C4 model this repo includes the API server code, along with the green check worker code in packages/greencheck.

Apps - API Server at api.thegreenwebfoundation.org

This repository contains the code served to you when you visit http://api.thegreenwebfoundation.org.

When requests come in, symfony accepts and validates the request, and creates a job for enqeueue to service with a worker.

The greenweb api application running on https://api.thegreenwebfoundation.org

This provides a backend for the browser extensions and the website on https://www.thegreenwebfoundation.org

This needs:

an enqueue adapter, like fs for development, amqp for production
php 7.3
nginx
redis for greencheck library
ansible and ssh access to server for deploys

Currently runs on symfony 5.x

To start development:

Clone the monorepo git clone [email protected]:thegreenwebfoundation/thegreenwebfoundation.git
Configure .env.local (copy from .env) for a local mysql database
composer install
bin/console server:run
check the fixtures in packages/greencheck/src/TGWF/Fixtures to setup a fixture database

To deploy:

bin/deploy

To test locally:

Go to http://127.0.0.1:8000 for homepage
Go to http://127.0.0.1:8000/greencheck/www.nu.nl to test www.nu.nl
If this keeps loading, everything is correctly setup, Now run bin/console enqueue:consume in a seperate terminal to process the checks

Packages - Greencheck

In packages/greencheck is the library used for carrying out checks against the Green Web Foundation Database. Workers take jobs in a RabbitMQ queue, and call the greencheck code to return the result quickly, before passing the result, RPC-style to the original calling code in symfony API server.

Packages - public suffix

In packages/publicsuffix is a library provides helpers for retrieving the public suffix of a domain name based on the Mozilla Public Suffix list. Used by the API Server.

greencheck-api's People

Contributors

Stargazers

Watchers

Forkers

fossabot janeklb prototypefund

greencheck-api's Issues

Update rabbit config to reject messages when queue is too large

At present, we have Rabbit MQ set to keep accepting messages even when we have too many backed up.

Given we've had something like a 20x increase in traffic in the last 6 months, this turns out to have been a bad idea, and as a result, RabbitMQ keeps falling over.

I'm not that well versed in RabbitMQ, but it looks like we'll need to implement some kind of backpressure to reject jobs that are coming in when we can't service them.

https://www.rabbitmq.com/maxlength.html

This will also involve thinking through a helpful error message for consumers of the API, so they know to back off.

Images are not being cached the same way as regular API requests are

We're seeing some strange behaviour with the API, that I think it is related to our caching strategy (or rather the various strategies as we add ever more layers between us and the poor MySQL database)

We sometimes have results like this when doing a web check at https://www.thegreenwebfoundation.org/green-web-check:

I think this is down to us serving images and direct API requests from different cache stores.

Looking in the code

This line here - where we are doing a greencheck when serving an image inside the greencheckimageAction call - I think this is returning a grey result:
https://github.com/thegreenwebfoundation/greencheck-api/blob/master/src/Controller/DefaultController.php#L187

I'm not clear on why the API would return green thoough, elsewhere, as the regular greencheck API call we make does the same call too:
https://github.com/thegreenwebfoundation/greencheck-api/blob/master/src/Controller/DefaultController.php#L408

The only thing I can think of off the top of my head would be nginx caching images longer than the regular API requests.

I thought the code might look on the filesystem for an image as a short cut inside greencheckimageAction, and this might mean we don't ever hit the Redis cache. That doesn't seem to be the case though.

Any other ideas @arendjantetteroo ?

I wonder if it's being cached by cloudflare after it's left the server now, and end users are seeing that result instead.

Difference in output for sites with www. and without

Where both sites return the same ip address and the ip address itself also returns a green result.

The only recent change on the api side is using nginx caching for the result.

Add public IP ranges for cloud giants - Microsoft

After reading over the Github Actions documentation I found out that they list their public IP ranges too of the machines running in Azure's various regions.

These are updated each week, so like we do with Amazon in #18, I think we can do the same for Microsoft's cloud too. This would mean people building in these regions would get the smiley green badges without loads of tiresome manually updates.

The link to the JSON list of IP ranges and services is below:

https://www.microsoft.com/en-us/download/details.aspx?id=56519

What we do for AWS, we should do for Microsoft too, with a different JSON parser as the data is likely to be structured differently, even we we're still dealing with regions, ASNs and IP ranges.

Set up release of data from the green web DB

We need to make release of the content in the greenweb database, which would ideally be the same as the API results, to make documentation easier:

For reference responses look like so (abridged, to remove the bits for the browser extension we don't use):

{
  "green": true,
  "url": "www.thegreenwebfoundation.org",
  "data": true,
  "hostedby": "LeaseWeb",
  "hostedbyid": 156,
  "hostedbywebsite": "www.leaseweb.com",
}

We'd need to query the greencheck table, then for each url we have we'd ideally need the result of the latest check.

Todo

transfer across GreencheckDataDumpCommand CLI command to dump to a csv file (we may need to update this)
transfer across GreencheckCsvCheckerCommand to make sure we have the top 1m urls in redis (we might make this work a separate queue to avoid messing with performance)
update current caching code to write to the datastructure we'd pull from for the CSV file
Decide about adding the checked_on value

Add simple redirect

Right now, hitting /greencheck with no slash at the end triggers a 404, giving a traceback like so:

Symfony\Component\HttpKernel\Exception\NotFoundHttpException: No route found for "GET /greencheck

This fills up our error catcher, so we want to stop this happening.

one solution - add a redirect url for `/greencheck` to `/greencheck/`

We could add a redirect route. This would stop triggering the 404.

https://github.com/thegreenwebfoundation/thegreenwebfoundation/blob/master/apps/api/src/Controller/DefaultController.php#L98

It's not the only solution, but would be a good first issue.

Handle timeouts more gracefully in the API when doing DNS looksups

At present, when you use the multi domain API, if a domain is taking a long time to resolve, we make a use wait for tens of seconds, before eventually serving a 500 error.

I saw this when debugging requests made for the new sitespeed plugin, and was eventually able to trace it down doing a multi-look-up with this API url:

http://api.thegreenwebfoundation.org/v2/greencheckmulti/[%22srv-2020-02-13-22.pixel.parsely.com%22]

What happens now

Right now, when look up and what I assume is a DNS look takes a very long time, we wait for nearly a minute, and then we serve a 500 page as html.

What I'd expect to happen

Because we're relying on a bunch of DNS lookups in a multidomain API call, and each could take an undefined amount of time, I would expect a response listing:

the results where we could see green hosted infrastructure (green domains)
the ones where we have no evidence of green infra (grey)
some kind of representation where DNS lookups timed out or otherwise failed (this isn't the same as a grey hosted domain, as the domain might be malformed, and never able to resolve to an ip address)

Think about adding way to check for IP behind CDNs, like cloudflare

We had a user ask for the first time ask about getting a smiley green face instead of a grey face as they want their host to use green power.

This is the opposite of what people normally ask for!

This also bring up the issue that CDNs that provide DDOS protection like cloudflare make it harder for us to see the 'real' ip address a website is hosted behind.

Can we find the real IP address in a safe, non-intrusive way?

I don't think it is,, but it's not black and white - it looks like it's possible to do this in some cases, if you read these links:

https://securitytrails.com/blog/ip-address-behind-cloudflare
https://support.cloudflare.com/hc/en-us/articles/115003687931-Warning-about-exposing-your-origin-IP-address-via-DNS-records

After reading through these, it doesn't seem like a good idea, a we're more interested in knowing the original hosting organisation, than the IP address - that's just the way we current look it up.

This might actually be a good use case for describing the original host with something like a carbon.txt file, do decouple the hosting from the IP address.

Add fixtures and phpunit/phpspec tests for greencheck library

We have tests for greencheck, but before we can run these on CI, we need to be able to run them locally.

Running them locally

Install the dependencies:

composer install

Set up the connections and config

Update the config with the connection details for Redis and MySQL/ MariaDB.

You will need to set up database tables in a mysql database you have created my calling:

php ./tests/doctrine-cli.php orm:schema-tool:create

Then run the tests

bin/phpunit -c configuration.xml

Add params for CSV Checker console command

When reviewing the PR #26, AJ said this about the GreencheckCsvCheckerCommand:

could improve this with a InputArgument on the configure() above so you can define which file to import instead of hardcoding it. Might make the python script easier?

I'm not sure how to pass in args in php, but I agree it would be useful to have.

@arendjantetteroo I'm parking this here as a nice first issue for someone 👍

Archive this repo?

@mrchrisadams should we archive this repo? As the api stuff moved to the new django admin repo right?

On a side note, with the huge number of repositories, shouldn't we start on moving to a monorepo?

Adapt logger to update the green urls table as well as the logger.

We have a list of updated urls that we make available at the link below:

https://www.thegreenwebfoundation.org/green-web-datasets/

Sadly, the way we update this table sucks.

We end up doing a nasty query to update all the urls and the stored procedures, which never really worked that well.

What would be nicer would be to update the single domains table as part of the logger saving process, as we're making a bulk update anyway with them. This would give us an easy to export single table we could use as a cache, or read only version of the API:

https://github.com/thegreenwebfoundation/greencheck-api/blob/master/src/Greencheck/Logger.php#L95

The other alternative would be to compress the values in redis using something liek snappy or lzw. It would be slight hike in CPU usage, but we'd likely be able to store all the domains in memory.

https://docs.redislabs.com/latest/ri/memory-optimizations/

Investigate database migration options

Significant parts of the TGWF database rely on MySQL and MyISAM tables.

Investigate our options for switching away.

If we stay with MySQL

Possible drawbacks migrating to postgres

https://wiki.postgresql.org/wiki/How_to_make_a_proper_migration_from_MySQL_to_PostgreSQL
Possible changes to the php application code?
We need more disk space to execute the migration.

are the trade-offs, and possible pitfalls

If we consider other database options

like Postgres, (timescale db, etc)

Migration guide

Why this issue appeared and which needs to be resolved.

We tried creating a new table Hostingcommunication (innodb) that references hostingproviders (MyISAM). That foreign key constraint couldn't be created because the two tables were different.

The change from MyISAM to InnoDB can be done by altering the table to use InnoDB instead. See the following answer: https://stackoverflow.com/a/30648414/

Migrating from MyISAM to InnoDB requires us to be aware of the following issues described here: https://mariadb.com/kb/en/library/converting-tables-from-myisam-to-innodb/

Parts of the TGWF API app we would need to change

(AJ, I'm using the wrong terminology, but can you list the bits we'd need, and any open PRs for them? I'll look over them, comment and merge if appropriate)

logger
worker
main resource

Add public IP ranges for cloud giants - AWS

At the moment, we rely on Amazon being nice enough to update their green regions themselves.

This rarely happens, but they do expose their IP ranges for each region at the the url below

https://ip-ranges.amazonaws.com/ip-ranges.json

More info here:

https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Amazon have different green and non green regions, so we might represent the green regions as separate green hosters, or as one huge host, with an absolutely massive set of IP ranges available.

https://aws.amazon.com/about-aws/sustainability/

{
  "syncToken": "1559746744",
  "createDate": "2019-06-05-14-59-04",
  "prefixes": [
    {
      "ip_prefix": "18.208.0.0/13",
      "region": "us-east-1",
      "service": "AMAZON"
    },
    {
      "ip_prefix": "52.95.245.0/24",
      "region": "us-east-1",
      "service": "AMAZON"
    },
    {
      "ip_prefix": "52.194.0.0/15",
      "region": "ap-northeast-1",
      "service": "AMAZON"
    }]
}

https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

AWS's own docs say these change a few times a week, so we'd likely need this running on a cronjob to stay accurate.

Set up deploy on merge into master with github actions

We currently deploy with ansible playbooks, but this is manual process.

It would be great to be able to trigger deploys from a github action, maybe with an action like this ansible one, that refers to playbooks we already have:

https://github.com/saubermacherag/ansible-playbook-docker-action

The other approach might be to trigger logging into the server, and running ansible from the machine itself.

https://github.com/Greening-Digital/greening-digital-ghost/blob/production/.github/workflows/cd.yml

Controlling currency of deploys

Github Actions doesn't have great concurrency control, and running two ansible playbooks together could be really messy.

It's worth looking at turnstyle for limiting the number of concurrent jobs:

https://github.com/softprops/turnstyle

Update data set to reflect new renewable energy guidance for AWS Regions

Following the release of the Amazon annual sustainability report the Amazon public website was updated and now details an expanded list of AWS Regions that were powered by over 95% renewable energy in 2021.

"
To achieve our goal of powering our operations with 100% renewable energy by 2025—five years ahead of our original 2030 target—Amazon contracts for renewable power from utility scale wind and solar projects that add clean energy to the grid. These new renewable projects support hundreds of jobs while providing hundreds of millions of dollars of investment in local communities. We also may choose to support these grids through the purchase of environmental attributes, like Renewable Energy Certificates and Guarantees of Origin, in line with our Renewable Energy Methodology.

As a result, in 2021, the following AWS Regions were powered by over 95% renewable energy:

US East (Northern Virginia)
GovCloud (US-East)
US East (Ohio)
US West (Oregon)
GovCloud (US-West)
US West (Northern California)
Canada (Central)
Europe (Ireland)
Europe (Frankfurt)
Europe (London)
Europe (Milan)
Europe (Paris)
Europe (Stockholm)
"

https://sustainability.aboutamazon.com/environment/the-cloud?energyType=true

Make the batch logger quicker by avoiding doctrine orm + allowing a table with concurrent inserts

The batch logger is now using doctrine and a flush every 50 records. This works but is rather slow and gets slower with each record to be processed by the logger. If we can avoid doctrine and just use sql inserts which we can batch we should be able to process way more messages with one worker.

At a later point we could see if we can make the table we write all these requests to allow multi writes so we can process with multiple workers. Currently we use one to avoid concurrency issues on the greencheck table

Remove http:// api and let it use https always

We currently provide the api on both http and https. Given the privacy implications and for caching practices we should move everything to https.

It's unclear which sites/extension still use the non secure one, i think our current extensions are all on the secure one by default.

Lets see if we can redirect to https with keeping the systems that use it still working.

Add docs on troubleshooting with DNS

We just spent some time investigating why some domain names weren't showing up, and after the investigation, it turned out it was down to the DNS servers in the datacentre we run the production servers in.

The DNS servers we lookup couldn't see specific domains, and the fix was to change the DNS servers we refer to.

We should document this.

Switch out cloud build in favour of travis

Travis is nice and easy to set up and much easier to make build status public.

Find way to support runaway memory usage with workers

We had an incident today where runaway memory usage with workers consuming from the queue with RabbitMQ would eat so much memory in production that it would free the whole box.

We have a few options to catch runaway memory usage to avoid this, but given that we're using supervisord to maintain a pool of workers it's worth looking at superlance, an extension to supervisord that tracks memory usage, to automatically catch process that are using too much memory.

You can see some more guidance here on setting it up an installing, but generally speaking, the approach is:

install with pip install superlance
add a stanza like the one below to the supervisor config file at /etc/supervisor/conf.d/enqueue_greencheck.conf

[eventlistener:memmon]
command=memmon -p <program_name>=3GB
events=TICK_60

We probably need to do this for a group rather than a single process, as we have pool of workers that we care about.

More here:

https://thepracticalsysadmin.com/quicktip-manage-memory-usage-with-supervisord/

https://github.com/corvus-ch/rabbitmq-cli-consumer

Add ci setup to run tests from #1

Right now we have a test suite, but we'd really want to be able to run them as part of a continuous integration pipeline.

CircleCI is probably the most popular service, and fast and free to use, but it's not clear where they run their servers (my guess probably a mix of AWS and Google, and maybe their own boxes if Travis is anything to go by)
Google Cloud Build runs on renewable power as it's using all of Google's infrastructure, and integrates easily enough with Github, so we're not spending ages doing undifferentiated stuff.

Steps needed

Decide provider (probably CircleCI, unless Google Cloud Build is also really easy to set up)
Get fixtures set up using appropriate data
Get tests running locally
Set up run on each new commit
Set up runnable option in the CI environment (we may need to faff around in docker, as they mostly use this now, and we have a at least three processes - MySQL, Redis and PHP jut for the green check library
Document setup up for using cloud build in the repo, and how to set it up on a local dev instance

Move phpstan into seperate build step (and include in MR review?)

we run one for mysql and one for postgres. However both do run the phpstan stuff. That might be a bit wastefull, something to optimise later?

Originally posted by @arendjantetteroo in #52

[greencheck] Update codestyle to PSR2/Symfony

Some parts still look like the php4 age with _ for properties.

add a pre-commit hook, maybe using pre-commit

Rename this project to represent its role

@arendjantetteroo the name thegreenwebfoundation / thegreenwebfoundation isn't all that helpful for new users, and is a bit confusing when managing issues too.

Given that this repo contains the greencheck API, and we have a separate admin repo (with an equally poor repo name right now…), how do you feel about renaming this to something more descriptive- maybe greencheck-api?

I'd like to rename the django based admin project something different than thegreenwebfoundation/greenwebfoundation-admin/, as that's not all that descriptive either, but I'm not sure yet.

Suggestions for both welcome…

Check why we see occasional timeouts with AMQP in the workers

We're seeing timeouts along the lines of when some domains are being checked:

PhpAmqpLib\Exception\AMQPTimeoutException: The connection timed out after 3 sec while awaiting incoming data

It doesn't seem to be tied to the domain. Maybe we need to check the max connections in rabbitmq?

Split website from api fpm pools

Currently all sites go down if for some reason our workers can't answer for greencheck checks as we saw this morning.

One solution to keep the other sites properly running is making different php fpm pools, so the api is not affecting the others.

Switch our rollbar for Sentry

We keep bumping against the free limit with Rollbar, so I suggest we switch to using Sentry.

It's open source, so we can host it ourselves, but I have a paid account, so we have a higher message limit.