Google Summer of Code 2023

Contributor : Manas Pratim Biswas

Description

Year: 2023
Organisation: CERN-HSF
Project Title: Estimating the Energy Cost of Scientific Software
Project Proposal: Estimating the Energy Cost of Scientific Software
Repository: baler-collaboration/baler
Mentor: Caterina Doglioni (@urania277) & Benedikt Hegner (@hegner)
Project Size: Large

Project Details

Estimate the energy efficiency and performance of a scientific software - Baler and attempt to identify where this efficiency can be improved.

Background: The Large Hadron Collider (LHC) experiments generate massive datasets composed of billions of proton-proton collisions. The analysis of this data requires high-throughput scientific computing that relies on efficient software algorithms. In today’s world, where energy crisis and environmental issues are becoming more pressing concerns, it is crucial that we start taking action to develop sustainable software solutions. As scientific software is being used more and more in high-throughput computing, there is a growing need to optimize its energy efficiency and reduce its carbon footprint.

Project Report

The Baler project is a collaboration between 12 research physicists, computer scientists, and machine learning experts at the universities of Lund, Manchester, and Uppsala. Baler, is a tool that uses machine learning to derive a compression that is tailored to the user’s input data, achieving large data reduction and high fidelity where it matters. Read more about Baler

Throughout the summer, I have spent most of my time exploring profilers and learning about profiling small code snippets and softwares in general.

Initially, I profiled Baler with multiple profilers with varied techniques. Some profilers like codecarbon had to be used as wrappers or invoked as APIs while others like cProfile, pyinstruments, powermetrics had to be used as standalone commands with optional -flags directly from the terminal.

💡 Visualizing cProfile logs

How to profile Baler using cProfile ?

Baler can be profiled with cProfile by adding the flag --cProfile while training

Example:

poetry run baler --project CFD_workspace CFD_project_animation --mode train --cProfile

The profile logs will be stored at: workspaces/CFD_workspace/CFD_project_animation/output/profiling/

Note: A Keyboard Interrupt is necessary to stop and exit from the SnakeViz server

cProfile profiles visualized using SnakeViz

call stack generated by SnakeViz from cProfile profiles

🕓 The majority time is taken by the optimizer for performing the gradient descent

Directed Graphs (Di Graphs):

A directed graph (digraph) is a graph that is made up of vertices (nodes) connected by directed edges.
In the context of profiling, a directed graph can represent the flow of program execution. Nodes may correspond to functions or code blocks, and edges indicate the direction of execution from one node to another.

Call Graphs:

A call graph is a type of directed graph that represents the relationships between functions in a program.
In the context of profiling, a call graph visualizes how functions call each other during program execution. Each node in the graph typically corresponds to a function, and directed edges show the flow of control from one function to another.

Usage:

Di graphs and call graphs are powerful tools for profiling and optimization, offering insights into the structure and behavior of the code.
Visualization aids in comprehending complex relationships between functions, making it easier to comprehend the call order and time taken by the various parts of the function calls.

This is the call graph generated, rooted at the perform_training() function when baler is trained for 2000 epochs on the CFD Dataset

🕓 The majority time is taken by the optimizer for performing the gradient descent
🕓 The Back Propagation takes more time than Forward Propagation

Hence, the results are in compliance with each other

💡 Powermetrics with InfluxDB

Powermetrics is a built-in tool for macOS which monitors CPU usage and determines how much CPU time and CPU power is being allocated to different quality of service classes (QoS).

InfluxDB is a time-series database that can be used to visualize the real-time logs generated by Powermetrics

Install InfluxDB

brew install influxdb

In case you don't have homebrew, get homebrew from here

Starting the InfluxDB service

brew services start influxdb

InfluxDB runs on port 8086 by default. Assuming that you have not made any changes to its default configuration, you can see if everything is properly installed and running by accessing it on localhost at port 8086

http://127.0.0.1:8086 or http://localhost:8086

If InfluxDB welcome page loads up, then everything is setup properly. The official documentation to setup the credentials and bucket can be found in the official documentation

Scripting to save the powermetrics log into InfluxDB

This script should be executed before running Baler. The script runs continuously in the background while baler runs in one of the modes in a separate terminal.

from datetime import datetime
import signal
import sys
import subprocess

from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS

def powermetrics_profile():

    token = "<Your API Token>"
    org = "<Your Organization Name"
    bucket = "<Your Bucket Name>"

    client = InfluxDBClient(url="http://localhost:8086", token=token, org=org)
    write_api = client.write_api(write_options=SYNCHRONOUS)

    process = subprocess.Popen("/usr/bin/powermetrics -i 300 --samplers cpu_power -a --hide-cpu-duty-cycle", shell=True, stdout=subprocess.PIPE, bufsize=3)
    while True:
        out = process.stdout.readline().decode()
        if out == '' and process.poll() != None:
            break
        if out != '':
            if ' Power: ' in out:
                metrics = out.split(' Power: ')
                point = Point(metrics[0]) \
                    .tag("host", "host1") \
                    .field("power", int(metrics[1].replace('mW', ''))) \
                    .time(datetime.utcnow(), WritePrecision.NS)

                write_api.write(bucket, org, point)
                sys.stdout.flush()


    with subprocess.Popen("/usr/bin/powermetrics -i 300 --samplers cpu_power -a --hide-cpu-duty-cycle | grep -B 2 'GPU Power'", shell=True, stdout=subprocess.PIPE, bufsize=3) as p:
        for c in iter(lambda: p.stdout.readline(), b''):
            sys.stdout.write(c)
        for line in p.stdout.read():
            metrics = line.split(' Power: ')
            print(line)

powermetrics_profile()

sudo access is needed to run the script. Suppose you have saved the script with the file name influxdb.py then run the script as below, depending on your environment.

sudo python3 influxdb.py

sudo poetry run influxdb.py

InfluxDB dashboard setup

Open bucket on the InfluxDB UI
Create a new empty dashboard in the bucket
Add three widgets for CPU Power and GPU Power respectively
Now that we have three gauges added, we need to define the data to be used by the gauges.
Values that are stored in the time series database are stored in mW. It more intuitive to use Watts which makes it easier to compare the data with other machines online.
Edit each gauge by clicking on the Settings(Cog) button and clicking Configure. This will open the Query window. Query is used to pull data from the time series database.

Query to view the power consumption in real-time

Note: Here the bucket is named baler

CPU Power Consumption

from(bucket: "baler")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "CPU")
  |> map(fn: (r) => ({ r with _value: float(v:r._value)/1000.00}))
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> yield(name: "last")

GPU Power Consumption

from(bucket: "baler")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "GPU")
  |> map(fn: (r) => ({ r with _value: float(v:r._value)/1000.00}))
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> yield(name: "last")

Save the queries in the respective gauges and reload the page. The gauges should update values in realtime. The auto-refresh option can be set to indefinite and 10s.

Sample Outputs from InfluxDB dashboard. All the reading are in $Watt$

Baler training starts

Baler training ongoing

Baler training ends

Note: A Keyboard Interrupt is necessary to stop and exit from the influxdb.py script.

💡 Visualizing codecarbon logs

How to profile Baler using codecarbon ?

Baler can be profiled with codecarbon by adding the flag --energyProfile while training

Example:

poetry run baler --project CFD_workspace CFD_project_animation --mode train --energyProfile

The profile logs will be stored at: workspaces/CFD_workspace/CFD_project_animation/output/profiling/

Conventions and Units used by codecarbon

Setup to estimate energy using codecarbon

Training was done on the CFD Dataset with my Apple MacBook Air having the following specifications

A scaling factor of 1e6 or $10^6$ was used to generate the plots for 50 baler runs each with 1000 epochs with the following configuration:

Plots for Train

Plots for Compression

Plots for Decompression

Summarizing the results

$Mode$	$CO_{2} Emission$ ($CO_{2}$ $Eqv\ in\ Kg$)	$Energy\ Consumed$ ($kWh$)
Train	$6.25$	$15.35$
Compress	$0.024$	$0.075$
Decompress	$0.022$	$0.063$

Note - Scaling factor was introduced simply because the numbers generated were small in magnitude and were difficult to plot. Hence, each values were scaled up by a factor of $10^6$. So, apart from the time axis, if a particular value on any axis is read as $V_{plot}$ from the plot, the value should be scaled down as : $$V_{actual} = V_{plot} * 10^{-6}$$

🔍 List of all Tools used

No.	Profiler/Tool	Description
1	cProfile	cProfile provides Deterministic Profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed. Is measures the CPU time
2	pyinstruments	pyinstruments provides Statistical Profiling of Python programs. It doesn’t track every function call that the program makes. Instead, it records the call stack every 1ms and measures the Wall Clock time
3	experiment-impact-tracker	experiment-impact-tracker tracks energy usage, carbon emissions, and compute utilization of the system. Currently, on Linux systems with Intel chips (that support the RAPL or powergadget interfaces) and NVIDIA GPUs. It records power draw from CPU and GPU, hardware information, python package versions and estimated carbon emissions information
4	scalene	Scalene is a high-performance CPU, GPU and memory profiler for Python that incorporates AI-powered proposed optimizations
5	memory-profiler	memory-profiler is a python module for monitoring memory consumption of a process as well as line-by-line analysis of memory consumption for python programs
6	memray	Memray is a memory profiler for Python. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself. It can generate several different types of reports to analyze the captured memory usage data.
7	codecarbon	codecarbon is a Python package that estimates the hardware electricity power consumption (GPU + CPU + RAM) and apply to it the carbon intensity of the region where the computing is done. The methodology behind the package involves the use a scheduler that, by default, call for the measure every 15 seconds and measures the CO₂ as per the formula Carbon dioxide emissions $(CO₂eq) = C * E$ Here, `C` = Carbon Intensity of the electricity consumed for computation: quantified as g of CO₂ emitted per kilowatt-hour of electricity and `E` = Energy Consumed by the computational infrastructure: quantified as kilowatt-hours.
8	Eco2AI	Eco2AI is a Python library for CO₂ emission tracking. It monitors energy consumption of CPU & GPU devices and estimates equivalent carbon emissions taking into account the regional emission coefficient.
9	powermetrics with InfluxDB	powermetrics gathers and displays CPU usage statistics (divided into time spent in user mode and supervisor mode), timer and interrupt wake-up frequency (total and, for near-idle workloads, those that resulted in package idle exits), and on supported platforms,interrupt frequencies (categorized by CPU number), package C-state statistics (an indication of the time the core complex + integrated graphics, if any, were in low-power idle states), as well as the average execution frequency for each CPU when not idle. This comes with as a default with UNIX and therefore can be considered a standard. influxDB is a time-series database that can be used to visualize the logs generated by powermetrics

Contributions

I have incorporated some of the profilers into the baler codebase and made multiple commits spread across the subsequent Pull Requests at baler-collaboration/baler and are listed in the reverse chronological order

Integrate scalene, eco2ai and memray to profile baler on the basis of a conditional flag ($_{WIP}$)

Visualize cProfile logs and dumps (PR #331) 🟨
- Implemented SnakeViz visualizations from cProfile prof through wrappers and subprocess
- Implemented Di-graph generation using yelp-gprof2dot.
- Utilized graphviz to parse the dot files and render into various formats like .SVG, .PNG, .PDF

Implement codecarbon plots (PR #330) 🟨
- Dumped the codecarbon logs into a .CSV file
- Subsequently parsed the .CSV file to generate plots

Added -m flag while training baler. Fixes Import Error (PR #286) 🟥
- Minor Bug Fix. It was resolved by a previous PR from a fellow contributor before mine could get merged

MacOS Installation Issues (PR #280) 🟥
- The instruction to install poetry, the dependency manager used in this project, was specific to a particular system that caused confusion and ambiguity while installation. This was later fixed and poetry installation guide was changed in (Commit #c17e684)

Apart from this, most of my work and experiments can be found unorganized, across the various branches of the repositories sanam2405/baler and sanam2405/SoftwareEnergyCost and inside the profling folder of this repository

References

[1] Baler - Machine Learning Based Compression of Scientific Data (LINK🔗)

[2] Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning (LINK🔗)

[3] Green Software Foundation (LINK🔗)

[4] Green Algorithms: Quantifying the Carbon Footprintof Computation (LINK🔗)

License

Summary

Participating in Google Summer of Code (GSoC) for the very first time was an exhilarating experience for me. I'm immensely grateful to my mentor, Caterina Doglioni, for this opportunity. I am thankful for her invaluable guidance, feedback and understanding during my tough times through the project.

Special Thanks to Leonid Didukh (@neogyk) for providing immense support and help throughout the program and Anirban Mukerjee (@anirbanm1728) & Krishnaneel Dey (@Krishnaneel) for their valuable feedback on the proposal.

Beyond GSoC, I'm committed to ongoing contributions to the organization.Feel free to connect on LinkedIn for any suggestions and feedback! 😄

sanam2405 / softwareenergycost Goto Github PK

softwareenergycost's Introduction