18f / analytics-reporter Goto Github PK

View Code? Open in Web Editor NEW

626.0 41.0 153.0 18.97 MB

Lightweight analytics reporting and publishing tool for Digital Analytics Program's Google Analytics 360 data.

Home Page: https://analytics.usa.gov/

License: Other

JavaScript 91.87% Shell 6.73% Procfile 0.01% Gherkin 1.40%

analytics google-analytics

analytics-reporter's Introduction

Analytics Reporter

A lightweight system for publishing analytics data from the Digital Analytics Program (DAP) Google Analytics 4 government-wide property. This project uses the Google Analytics Data API v1 to acquire analytics data and then processes it into a flat data structure.

The project previously used the Google Analytics Core Reporting API v3 and the Google Analytics Real Time API v3, also known as Universal Analytics, which has slightly different data points. See Upgrading from Universal Analytics for more details. The Google Analytics v3 API will be deprecated on July 1, 2024.

This is used in combination with analytics-reporter-api to power the government analytics website, analytics.usa.gov.

Available reports are named and described in api.json and usa.json. For now, they're hardcoded into the repository.

The process for adding features to this project is described in Development and deployment process.

Local development setup

Prerequistites

NodeJS > v20.x
A postgres DB running and/or docker installed

Install dependencies

npm install

Linting

This repo uses Eslint and Prettier for code static analysis and formatting. Run the linter with:

npm run lint

Automatically fix lint issues with:

npm run lint:fix

Install git hooks

There are some git hooks provided in the ./hooks directory to help with common development tasks. These will checkout current NPM packages on branch change events, and run the linter on pre-commit.

Install the provided hooks with the following command:

npm run install-git-hooks

Running the unit tests

The unit tests for this repo require a local PostgreSQL database. You can run a local DB server or create a docker container using the provided test compose file. (Requires docker and docker-compose to be installed)

Starting a docker test DB:

docker-compose -f docker-compose.test.yml up

Once you have a PostgreSQL DB running locally, you can run the tests. The test DB connection in knexfile.js has some default connection config which can be overridden with environment variables. If using the provided docker-compose DB then you can avoid setting the connection details.

Run the tests (pre-test hook runs DB migrations):

npm test

Running the unit tests with code coverage reporting

If you wish to see a code coverage report after running the tests, use the following command. This runs the DB migrations, tests, and the NYC code coverage tool:

npm run coverage

Running the integration tests

The integration tests for this repo require the google analytics credentials to be set in the environment. This can be setup with the dotenv-cli package as described in "Setup Environment" section above.

Note that these tests make real requests to google analytics APIs and should be run sparingly to avoid being rate limited in our live apps which use the same account credentials.

# Run cucumber integration tests
dotenv -e .env npm run cucumber

# Run cucumber integration tests with node debugging enabled
dotenv -e .env npm run cucumber:debug

The cucumber features and support files can be found in the features directory

Running the application as a npm package

To run the utility on your computer, install it through npm:

npm install -g analytics-reporter

Running the application locally

To run the application locally with database reporting, you'll need a postgres database running on port 5432. There is a docker-compose file provided in the repo so that you can start an empty database with the command:

docker-compose up

Setup environment

See "Configuration and Google Analytics Setup" below for the required environment variables and other setup for Google Analytics auth.

It may be easiest to use the dotenv-cli package to configure the environment for the application.

Create a .env file using env.example as a template, with the correct credentials and other config values. This file is ignored in the .gitignore file and should not be checked in to the repository.

Run the application

# running the app with no config
npm start

# running the app with dotenv-cli
dotenv -e .env npm start

Configuration and Google Analytics Setup

Enable Google Analytics API for your project in the Google developer dashboard.
Create a service account for API access in the Google developer dashboard.
Go to the "KEYS" tab for your service account, create new key using "ADD KEY" button, and download the JSON private key file it gives you.
Grab the generated client email address (ends with gserviceaccount.com) from the contents of the .json file.
Grant that email address Read, Analyze & Collaborate permissions on the Google Analytics profile(s) whose data you wish to publish.
Set environment variables for analytics-reporter. It needs email address of service account, and view ID in the profile you authorized it to:

export ANALYTICS_REPORT_EMAIL="[email protected]"
export ANALYTICS_REPORT_IDS="XXXXXX"

You may wish to manage these using autoenv. If you do, there is an example.env file you can copy to .env to get started.

To find your Google Analytics view ID:

Sign in to your Analytics account.
Select the Admin tab.
Select an account from the dropdown in the ACCOUNT column.
Select a property from the dropdown in the PROPERTY column.
Select a view from the dropdown in the VIEW column.
Click "View Settings"
Copy the view ID. You'll need to enter it with ga: as a prefix.

You can specify your private key through environment variables either as a file path, or the contents of the key (helpful for Heroku and Heroku-like systems).

To specify a file path (useful in development or Linux server environments):

export ANALYTICS_KEY_PATH="/path/to/secret_key.json"

Alternatively, to specify the key directly (useful in a PaaS environment), paste in the contents of the JSON file's private_key field directly and exactly, in quotes, and rendering actual line breaks (not \n's) (below example has been sanitized):

export ANALYTICS_KEY="-----BEGIN PRIVATE KEY-----
[contents of key]
-----END PRIVATE KEY-----
"

If you have multiple accounts for a profile, you can set the ANALYTICS_CREDENTIALS variable with a JSON encoded array of those credentials and they'll be used to authorize API requests in a round-robin style.

export ANALYTICS_CREDENTIALS='[
  {
    "key": "-----BEGIN PRIVATE KEY-----\n[contents of key]\n-----END PRIVATE KEY-----",
    "email": "[email protected]"
  },
  {
    "key": "-----BEGIN PRIVATE KEY-----\n[contents of key]\n-----END PRIVATE KEY-----",
    "email": "[email protected]"
  }
]'

Make sure your computer or server is syncing its time with the world over NTP. Your computer's time will need to match those on Google's servers for the authentication to work.
Test your configuration by printing a report to STDOUT:

./bin/analytics --only users

If you see a nicely formatted JSON file, you are all set.

(Optional) Authorize yourself for S3 publishing.

If you plan to use this project's lightweight S3 publishing system, you'll need to add 6 more environment variables:

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=[your-key]
export AWS_SECRET_ACCESS_KEY=[your-secret-key]

export AWS_BUCKET=[your-bucket]
export AWS_BUCKET_PATH=[your-path]
export AWS_CACHE_TIME=0

There are cases where you want to use a custom object storage server compatible with Amazon S3 APIs, like minio, in that specific case you should set an extra env variable:

export AWS_S3_ENDPOINT=http://your-storage-server:port

Other configuration

If you use a single domain for all of your analytics data, then your profile is likely set to return relative paths (e.g. /faq) and not absolute paths when accessing real-time reports.

You can set a default domain, to be returned as data in all real-time data point:

export ANALYTICS_HOSTNAME=https://konklone.com

This will produce points similar to the following:

{
  "page": "/post/why-google-is-hurrying-the-web-to-kill-sha-1",
  "page_title": "Why Google is Hurrying the Web to Kill SHA-1",
  "active_visitors": "1",
  "domain": "https://konklone.com"
}

Use

Reports are created and published using npm start or ./bin/analytics

# using npm scripts
npm start

# running the app directly
./bin/analytics

This will run every report, in sequence, and print out the resulting JSON to STDOUT.

A report might look something like this:

{
  "name": "devices",
  "frequency": "daily",
  "slim": true,
  "query": {
    "dimensions": [
      {
        "name": "date"
      },
      {
        "name": "deviceCategory"
      }
    ],
    "metrics": [
      {
        "name": "sessions"
      }
    ],
    "dateRanges": [
      {
        "startDate": "30daysAgo",
        "endDate": "yesterday"
      }
    ],
    "orderBys": [
      {
        "dimension": {
          "dimensionName": "date"
        },
        "desc": true
      }
    ]
  },
  "meta": {
    "name": "Devices",
    "description": "30 days of desktop/mobile/tablet visits for all sites."
  }
  "data": [
    {
      "date": "2023-12-25",
      "device": "mobile",
      "visits": "13681896"
    },
    {
      "date": "2023-12-25",
      "device": "desktop",
      "visits": "5775002"
    },
    {
      "date": "2023-12-25",
      "device": "tablet",
      "visits": "367039"
    },
   ...
  ],
  "totals": {
    "visits": 3584551745,
    "devices": {
      "mobile": 2012722956,
      "desktop": 1513968883,
      "tablet": 52313579,
      "smart tv": 5546327
    }
  },
  "taken_at": "2023-12-26T20:52:50.062Z"
}

Options

--output - write the report result to a provided directory. Report files will be named with the name in the report configuration.

./bin/analytics --output /path/to/data

--publish - Publish to an S3 bucket. Requires AWS environment variables set as described above.

./bin/analytics --publish

--write-to-database - write data to a database. Requires a postgres configuration to be set in environment variables as described below.
--only - only run one or more specific reports. Multiple reports are comma separated.

./bin/analytics --only devices
./bin/analytics --only devices,today

--slim -Where supported, use totals only (omit the data array). Only applies to JSON, and reports where "slim": true.

./bin/analytics --only devices --slim

--csv - Formats reports as CSV instead of the default JSON format.

./bin/analytics --csv

--frequency - Run only reports with 'frequency' value matching the provided value.

./bin/analytics --frequency=realtime

--debug - Print debug details on STDOUT.

./bin/analytics --publish --debug

Saving data to postgres

The analytics reporter can write data is pulls from Google Analytics to a Postgres database. The postgres configuration can be set using environment variables:

export POSTGRES_HOST = "my.db.host.com"
export POSTGRES_USER = "postgres"
export POSTGRES_PASSWORD = "123abc"
export POSTGRES_DATABASE = "analytics"

The database expects a particular schema which will be described in the API server that consumes and publishes this data.

To write reports to a database, use the --write-to-database option when starting the reporter.

Upgrading from Universal Analytics

Background

This project previously acquired data from Google Analytics V3, also known as Universal Analytics (UA).

Google is retiring UA and is encouraging users to move to their new version Google Analytics V4 (GA4). UA will be deprecated on July 1st 2024.

Migration details

Some data points have been removed or added by Google as part of the move to GA4.

Deprecated fields

browser_version
has_social_referral
exits
exit_page

New fields

bounce_rate

The percentage of sessions that were not engaged. GA4 defines engaged as a session that lasts longer than 10 seconds or has multiple pageviews.

file_name

The page path of a downloaded file.

language_code

The ISO639 language setting of the user's device. e.g. 'en-us'

session_default_channel_group

An enum which describes the session. Possible values:

'Direct', 'Organic Search', 'Paid Social', 'Organic Social', 'Email', 'Affiliates', 'Referral', 'Paid Search', 'Video', and 'Display'

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

analytics-reporter's People

Contributors

Stargazers

Watchers

Forkers

polastre mckoss miguelpaz damonclifford benjaminma ilaowu mktantum zhaoshouren minoptimera caradoxical lazysoundsystem payne dorianjp vchabs mercime laurenancona learn-usertoken mabroor lingani mw-jonas azeswitz courbanize nathan929 damianof lablayers wsofidelhi geramirez sourcemedium tiegz budparr j-cortes thedrean cfpb diylogos cellularhacker dereksheppard lromang markedmondson1234 somehow79 margreetnannenberg stvnrlly jayhawk leroyg onsdigital chadco kernjackson ktbishop222 codeforokc mbelle cityofboston gaybro8777 coreyerickson bekterra arun-enquizit drewno-design ntulip municipalityofanchorage itd-nd-gov schlos irfanj cforlando ericschles massdotgov nationalparkservice adamjmaines ampg99 401ode eric1599 rogeruiz mgracy camfindlay madhukarreddy cookcountygov govtmirror joshuaeveleth unifiedgov garrettcadams danmon6 stvhanna jmcarp connectthefuture sfbart shoghon psalvitti paolomainardi bserna-usgs theusnavy linebreaker huangdehui2013 grangier matamehta justingeeslin wgb128 bcartier cityofaustin resqs thezedwards firefoxxy8 samnan darrelld05

analytics-reporter's Issues

TODO

rearrange project structure to fit NodeJS best practices
load/cache GA data into database properly. Probably fix a line here
better documentation
create a way of loading data initially
create a way to loop through both specific and general queries when needed.

Combine the two commands into one

One command that either prints to STDOUT, writes to files, or publishes to S3. Defaults to all, can use --only to limit it to one report.

Currently, two commands are report and all-reports. Let's make it just reports.

Access to raw data

Great to see that this data is being aggregated. Is there any chance to get access to the raw data so that more complex reports can be created. I want to look into top pages by domain.

Rendering View Data

I am able to get well-formatted json output running the command:
./bin/analytics

When I run the web server, I receive the error
Error: Failed to lookup view "index" in views directory "../" at Function.app.render (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/application.js:516:17) at ServerResponse.res.render (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/response.js:900:7) at /home/ec2-user/apps/analytics-reporter/app/routes.js:13:11 at Layer.handle [as handle_request] (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/layer.js:82:5) at next (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/route.js:100:13) at Route.dispatch (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/route.js:81:3) at Layer.handle [as handle_request] (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/layer.js:82:5) at /home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/index.js:234:24 at Function.proto.process_params (/home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/index.js:312:12) at /home/ec2-user/apps/analytics-reporter/node_modules/express/lib/router/index.js:228:12

Where are the views rendered?

Do you have any examples of views that I could test out?

Crontab or equivalent

Make a cron or cron-like system for running the report tasks as they need to be run. That could definitely be a small .js file that uses node-schedule, and is run with forever.

Every-1-hour data for visits throughout the day

This will be visits accumulated by that time in the day, not "people onsite", and will need to be represented appropriately. But we can the entire (up to) 1,440 minutes of the day for a given day, through the Core Reporting API.

Datestamp all output

Date of generation needs to be included in report metadata, for every report.

See if we can isolate the IE technical preview

We have to whitelist individual browsers and versions, so to keep on top of IE we'll want to proactively whitelist "12", "13", "14", etc. I think the Technical Preview will be "TP", but it's not unlikely we'll find an example already in the analytics.

Confirm that app works under Node 4.X

It's up to Node 4.2.1 at the moment, after the iojs merger.

https://nodejs.org

Ask for non-sampled data where possible

Investigate asking for non-sampled data -- or in some other way address the problem in 18F/analytics.usa.gov#85.

Update Setup Documentation

totals in realtime.json should be an object, not array

The data structure in realtime.json is a bit off. Rather than:

  "totals": [
    {
      "active_visitors": "24619"
    }
  ]

I think it should be:

  "totals": {
    "active_visitors": "24619"
  }

as in users.json, et al.

Explore creating a machine user as a Google profile vs a Google group for authentiation

Look at after launch.

Deployment infrastructure: fabric and sendak

We should add a fabric deploy file, and a Sendak container for it. It doesn't have to be there for launch -- we can always ssh in and git pull to do updates for now. But it's necessary in the medium run.

Rename repo

It'd be nice to drop the -nodejs, and I also like that this project could be more generally usable than just a proxy (though it can do that too).

How about analytics-reporter? Other ideas?

Evaluate 90 day ranges for devices, browsers/IE, OS/Win

May need some query adjustment.

Deploy somewhere

I suspect this will be in 18F's AWS account, VPC TBD.

gzip is now optional for S3 buckets backed by a CloudFront distribution

Amazon announced that CloudFront will now gzip responses at the CloudFront level:

It's ostensibly free of charge, though I guess there's an increase in cost in bytes transferred between CloudFront and the origin. At analytics.usa.gov, we have caching turned to 0 seconds to facilitate immediate deploys for a site which doesn't get a ton of traffic, so we may actually see this increase in cost more than most users in our situation. However, it should still be very small given the amount of traffic we get there.

The small S3 integration code we have should probably let gzipping be optional (and opt-in), to let users take advantage of this who wish to.

add OAuth2 requirement - ntp sync

A key requirement leading to 'invalid grant' errors from Google is making sure you have NTP sync'd up. It would be good to mention this as a prerequisite.

Transform Google format into friendlier common format

We should use a provider-agnostic format, and something we're fine with people downloading and starting to potentially write code against.

Gov.UK's schema is cleaner than Google's, but aesthetically, I'm not a fan of the underscore prefixes and colons.

The schema may vary for different metrics. For weekly visitors, how about:

{
  "name": "weekly_visitors",
  "site": "all",
  "description:": "Weekly visitors to all .gov sites tracked by the U.S. federal government's Digital Analytics Program.",
  "documentation": "http://www.usa.gov/performance",
  "data": [
    {
      "date": "2014-12-22",
      "visitors": 2340123
    },
    {
      "date": "2014-12-23",
      "visitors": 2045623
    }
  ],
  "totals": {
    "start_date": "2014-12-22",
    "end_date": "2014-12-23",
    "visitors": 4385746
  }
}

Thoughts?

Data publication URL

We're working this out inside OCSIT now - but this needs nameserver hostnames generated, and an IT ticket created.

Downloads report

Top downloads report, which contains the page title and page downloaded from for

realtime
7 days
30 days

petitions URL trailing text getting stripped in link on homepage

There's is a We The People petition listed in the "Top 20 Pages" list that appears to be getting some text stripped out, causing an improper link.
This:
https://petitions.whitehouse.gov//petition/take-out-alejandro-garc%C3%ADa-padilla-governorship-puerto-rico-now/B0jgB8Sv
is being turned in to this in the top 20 list:
https://petitions.whitehouse.gov//petition/take-out-alejandro-garc%C3%ADa-padilla-governorship-puerto-rico-now/

Publish to npm

Using the name analytics-reporter. I'll do it under my account, konklone, and I can grant others ownership access who want it. (@RAMIREZG, let me know your npm username, if you have one.)

Lightweight admin notification and instrumentation

This may shift as we change our deployment strategy, but there needs to be some kind of error reporting here for if/when things go wrong, such as a Google outage or we somehow hit our quota.

Automate Data Snapshots to www.gsa.gov/data.json

So this data will also be available ideally in daily updated bulk (see #52) on Data.gov and cataloged in accordance with the Federal Open Data Policy.

Bulk, dated snapshots

We're currently overwriting JSON files in place every minute and every day.

We should also, as a byproduct, save time/date versioned JSON files so that anyone (ourselves included) can use the bulk data at any time to produce reports other than the ones we query Google for directly.

clarify environment variables in installation instructions

Really awesome tool. Thanks so much for getting it out and available.

I would just suggest a tiny bit of clarification around the installation instructions. Specifically, I got stuck at two spots:

the path .pem file apparently needs to be absolute? (which the path you provide hints to - i just didn't see it).
I was unfamiliar with where to find ANALYTICS_REPORT_IDS - and originally had the ID of my account instead of my web-site.

Maybe something along the lines of:

Set the following environment variables:

export ANALYTICS_REPORT_EMAIL="[email protected]"
export ANALYTICS_REPORT_IDS="ga:XXXXXX"
export ANALYTICS_KEY_PATH="/path/to/secret_key.pem"

ANALYTICS_REPORT_EMAIL is the generated client email address from the API service account that you have granted read & analyze permissions to within Google Analytics for the View you want to extract data from.

ANALYTICS_REPORT_IDS is the Google Analytics ID for that View. You can find it in the Admin panel for your GA account under "View Settings". Prefix it with ga:to make it work.

ANALYTICS_KEY_PATH is the absolute path to the .pem file generated above.

Just a suggestion!

Also, you may consider outputting the full error message, when errors occur. That was useful for tracking down my problems.

Thanks again!

Gzip all files before upload to S3

And publish them without the .gz extension, but with the appropriate gzip header and the JSON mimetype.

For reference: https://gist.github.com/konklone/f681524082bb77b89ad3#uploading

https://github.com/18F/open-source-policy

Deployable to Heroku

Go through the process of making this deployable on Heroku, so that it can be closer to deployable in any given PaaS environment.

How do I set this up?

How do I set this up to run? (I am interested in this.)

config flag for custom domain prefix

Add config var to specify a custom domain prefix (e.g. phila.gov) to be appended to pagePath. This should at least help single-domain implementations (read: state/municipal gov) solve for broken links in Top 20 pages report. Then let any exceptions defined in analytics.js override prefix, if present.

Test script

Some sort of sanity check, that Travis will execute for us, seems merited.

Load config from environment variables instead

Instead of from a config.js. This will allow us to install the publisher as a command line tool without having to check it out from git, like npm install -g analytics-publisher, and set configuration values in the environment. And while I anticipate this will run on AWS somewhere, it does broaden the pool of possible deployment locations.

Update crontab for --head and --csv

Right now we're still only publishing full JSON data. We need to update the crontab to publish --head versions (suitable for live pull-down from a web client) and --csv versions (so that a CSV is available for download as well). We will still publish full JSON data, which people can download on-demand.

CSV versions of the data

A --csv flag that, once given the transformed JSON, runs them into a CSV version (and then publishes, writes, or prints them as it ordinarily would).

General architecture

From the readme I could not fully understand the architecture. Why my guess is:

each site installs the analytics-reporter
each site's analytics reporter pushes to the same s3 bucket
the portal (analytics.usa.gov) reads multiple entries from the s3 bucket

Is that correct? Anyway, could you describe that succinctly in the readme (of one of the two projects)? Or point me to the place where it is described, which I've missed.

Ditch cron for a lightweight alternative

cron is an old, brittle piece of software. It also can't do sub-minute tasks.

All the "Node cron" libraries I've found are for running raw Node code at given intervals, not running unix commands. I'd like a simple, stable cron alternative.

Clean executable for processing a given report

I should be able to run:

report weekly-visitors

And have this print out the report JSON to STDOUT. I could then e.g. save the JSON to disk with:

report weekly-visitors > weekly-visitors.json

Implement Digital Analytics Program

It does not appear the dashboard has any analytics of its own. I would suggest considering joining GSA's Digital Analytics Program.

Here's some more information: http://www.digitalgov.gov/services/dap/

Publish to an S3 bucket

This could definitely be taken care of with straight unix tools, but there might be an argument for a --publish flag that uses the AWS Node SDK to do the upload to S3 from within this app. We'll figure it out when we get there, I guess.

Realtime count of visitors

Not sure what the API ramifications are, but we should make an attempt at getting real time data from the GA API.

Countries Report

Top countries report for:

realtime
7 days
30 days

Git tags and releases for npm versions

Also probably time to move to a 1.0 release, and begin doing semantic versioning, now that people are depending on this in production.

Cities Reports

Top cities report for

realtime
7 days
30 days

Expand report.json to hold all desired reports

Make reports.json have report data for the list we're going for:

unique visitors over a week by date
sessions by device category over a week by date
sessions by device category over a week
top pages in the last week
top pages in the last 30 days
top pages in the last 90 days
breakdown of sources in the last week

I'm blurring this slightly - the first two metrics say "by date", but let's plan to do "the last 7 days" for now.

Dockerfile

This should be done in a Docker environment, so it's containerizable and deployable in a more flexible way.

Expand reports to include more agencies

Allow analytics reporter to work with views for multiple agencies.

JSON encoding not specified

Regarding a potential fix for 18F/analytics.usa.gov/issues/149, it looks like app/routes.js isn't specifying the character set for the returned JSON. Pulled from StackOverflow #16268244:

Set the res.header("Content-Type", "application/json; charset=utf-8"); so that the browser knows what character set to use.

Installable via npm

Publish to npm, and indicate a runnable command in the package.json, so that you can do npm install analytics-report [or something] and then run report [report-name] to generate/publish a given report.