Giter Site home page Giter Site logo

pegeler / eddington2 Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 2.0 853 KB

Calculating your Eddington Number for Cycling 🚲📈

Home Page: https://eddington.bike

License: GNU General Public License v3.0

R 36.13% C++ 28.76% C 9.54% Makefile 0.74% PHP 0.40% Python 10.54% Shell 0.43% Fortran 1.67% Awk 0.22% SAS 0.24% APL 0.08% Julia 1.17% VBA 0.63% Java 0.89% Scala 6.03% Perl 0.54% Rust 2.00%
cycling eddington r-package vba-excel eddington-number algorithm calculation h-index

eddington2's Introduction

eddington2

A collection of code for calculating your Eddington number for cycling. This will also calculate an author's h-index, which is mathematically equivalent but applied to a different field of interest. Both of these numbers are specific applications of computing the side length of a Durfee square.

This repo is a spin-off of a different project where I had the occasion to write a php script for calculating my Eddington number. From there, I have treated it as a sort of programming chrestomathy project a la Rosetta Code.

If you're here because you just need a quick way to calculate your Eddington number, check out the online calculator here.

Table of contents

Languages

More to come. Please feel free to contribute!

The algorithms

The algorithms used throughout this repo fall into two general categories. The first, which I creatively call Algorithm A, relies on sorting the data. Although this is fast enough for most use-cases, there are rare situations where the sorting operation can be too computationally expensive. Also, this is not conducive to calculation of a cumulative Eddington number, which would require separate sorting operations for every new element of the dataset. This is trivial for small datasets but can have a huge impact on larger datasets.

Therefore, I have developed a second algorithm (named Algorithm B) which does not require initial sorting of the data. The algorithm goes through the data only once, keeping a running tally of E, and computing the summary value in linear time. Since a running E is calculated as the algorithm iterates over the data, it is an efficient way of computing the cumulative E.

The two algorithms, A (slow) and B (fast) are shown below in python.

A (slow)

def E_num(rides):
    if not rides:
        return 0

    for E, ride in enumerate(sorted(rides, reverse=True), 1):
        if ride < E:
            E -= 1
            break

    return E

B (fast)

def E_num(rides):

    n, E, above = len(rides), 0, 0
    H = [0]*(n + 1)

    for r in rides:
        ride = int(r)
        if ride > E:
            above += 1
            if ride < n:
                H[ride] += 1
            if above > E:
                E += 1
                above -= H[E]

    return E

Time and space complexity

Algorithm B is generally faster than the conventional approach, which uses a comparison sort (as illustrated in Algorithm A), even when used just for summary statistics. Whereas we would expect Algorithm A to complete in O(n log n) asymptotic time, Algorithm B has θ(n) time complexity and uses θ(n) auxiliary space. This is accomplished through exploiting three constraints peculiar to the metric:

  1. Inputs are integers or can be truncated to be so.
  2. A tally (histogram) of past data suffices to track the necessary state information. Elements of the tally can be accessed in constant time when needed.
  3. We know in advance the required auxiliary space is at most n.

Improvement over sorting

The next obvious improvement over using comparison sort would have been an integer sorting algorithm such as counting sort. In fact, Algorithm B is very comparable to, but a little faster than, counting sort. They both use a tally. Compared to most comparison sorting algorithms' expected O(n log n) time complexity and Algorithm B's θ(n), counting sort falls in between the two with O(n + k) asymptotic time complexity, where k is the range of the input vector. As a bonus, k will be sufficiently small as to be meaningless in real-world cycling data. But there are some considerations to review. In pathological cases, a large k can be problematic. Also, counting sort tends to iterate through the data several times, meaning that it could be a couple of factors slower for any given n.

For summary statistics, using a counting sort algorthim is practically equivalent to Algorithm B with respect to speed and efficiency. But that is not the whole story: Algorithm B is remarkable because of how it handles cumulative statistics. Neither out-of-the-box comparison sort nor counting sort facilitate efficient computation of the cumulative statistic since previous effort of sorting is not conserved as the we iterate through the growing vector of data. As a result, time complexity grows to ∑O(n + k) for counting sort, which approximately reduces to O(n^2) if we ignore k. Of course, using comparison sort would be much worse! Meanwhile, Algorithm B remains θ(n)---this is where the true value of Algorithm B lies.

Disclaimer

When speaking of sorting algorithms, I am speaking in generalities. Many sorting algorithms may be exceptions to certain rules. But when all things are considered, the overall thrust is that Algorithm B is still an improvement over all other approaches because it uses specific information about the output metric to eliminate unnecessary computations.

Variations and applications

Although Algorithm B is shown above with a simple Python list serving as the histogram, other data structures were considered. Either a binary tree or hashmap could be used as a subsitute.

If space complexity is an issue, they may be preferable. However, my findings are that the overhead of implementing the algorithm with a more complex data structure results in a slower overall runtime. Meanwhile, space concerns are not relevant with most "realistic" datasets.

There are still other reasons a binary tree or hashmap may be considered. Either data structure facilitates the transformation of this algorithm into an online algorithm, meaning inputs can be streamed into the algorithm without prior knowledge of the data length. Although it is unlikely that an online algorithm will be necessary to process a cyclist's Eddington number or an author's h-index, this may be used for computing the side length of a Durfee square for partitions of large integers or be used in some yet undiscovered application.

Using binary trees or hashmaps also presents new opportunities for OOP with Eddington numbers. The classes defined in the Python and C++ folders illustrate this. One benefit is that a class can be instantiated with an initial dataset and then updated with more data at a later time.

It also is possible to make a database schema so that row insertions can calculate the new Eddington number in constant time. At the expense of adding an extra table and a few columns, the E number for new entries can be computed using a single row that captures the algorithm state at the previous iteration, rather than using a rider's entire history for each new row. This has the potential to provide benefit to fitness tracking software backends, where scale is relevant.

Deprecation

The entries in this repo are a mix of Algorithms A and B. If both A and B are present for a given langauage, I append -deprecated to the file name containing Algorithm A.

eddington2's People

Contributors

pegeler avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

kampfsca meisterp

eddington2's Issues

Why R >=4.3?

The minimum R >=4.3.0 is unusual. Why is it necessary?

R6 Eddington class cannot be saved without data loss.

The hash map of ride history is just a pointer (Rcpp::XPtr) to a std::unordered_map. So if a user attempts to save the Eddington class using save() or saveRDS(), they will be disappointed when reloading data since the pointer will be nil and the data will be lost.

This means that in current-state, several functions will not work as intended when attempting to manipulate the same object between sessions. Some will error silently (like the update() method) and others will raise an error (like attempting to access the H active field).

Clone method in R6 Eddington class causes undesired behavior

Using the clone() or clone(deep = TRUE) method will result in the internal hashmap pointing to the same memory address. Therefore, changes made to any of these objects will propagate to others.

# Random data
rides <- rgamma(15, shape = 2, scale = 10)

# Make shallow and deep clones
e <- Eddington$new(rides)
e$H
##   ride_length ride_count
## 1          11          1
## 2          12          2
## 3          13          1
## 4          16          2
## 5          19          1
## 6          36          1
## 7          37          1
## 8          47          1

foo <- e$clone()
bar <- e$clone(TRUE)

# See that the pointer for each hashmap is the same
e$.__enclos_env__$private$.H
## <pointer: 0x0123456789ab>
foo$.__enclos_env__$private$.H
## <pointer: 0x0123456789ab>
bar$.__enclos_env__$private$.H
## <pointer: 0x0123456789ab>

# Update data in 'e' ONLY
e$update(rep(50,50))

# Hashmap has been updated for clones as well
e$H
## [1] ride_length ride_count 
## <0 rows> (or 0-length row.names)
foo$H
## [1] ride_length ride_count 
## <0 rows> (or 0-length row.names)
bar$H
## [1] ride_length ride_count 
## <0 rows> (or 0-length row.names)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.