Giter Site home page Giter Site logo

hpjansson / fornalder Goto Github PK

View Code? Open in Web Editor NEW
95.0 95.0 10.0 79 KB

Visualize long-term trends in collections of Git repositories.

License: GNU General Public License v3.0

Rust 100.00%
analysis community community-health developers git history metrics-visualization plots statistics

fornalder's People

Contributors

3v1n0 avatar echolon avatar federicomenaquintero avatar figsoda avatar gasche avatar hpjansson avatar razzeee avatar romainreignier avatar wjt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fornalder's Issues

GNOME module list suggestions

Which modules are included in the definition of GNOME is one of the trickiest aspects of measuring activity in the project, and is something I've struggled with myself in the past. That said, looking at the list of modules, I'd suggest a few changes:

Core dependencies

The list includes a number of core dependencies which, while they are vital and important to GNOME, have a significant life beyond the project. This includes:

gtsreamer
wayland
NetworkManager
cairo
pipewire
ModemManager

The concern here is that a project like gstreamer or wayland could skew the results.

My suggestion would be to remove these modules from the analysis, and possibly conduct a separate analysis for this set of modules.

There are a bunch of libraries which have a similar status, including:

grilo
lvfs
fwupd
flatpak

It might be good to include those in the "core dependencies" group.

Edit: actually, there's a huge list of additional core dependencies we could consider. Certainly udisks2 and upower should be in there. I also wonder about buildstream, cups, WebKit, meson...

Questionable apps

The module list includes a few apps which are somewhat questionable.

The first is f-spot and banshee. These were never 100% official, which is maybe fine if the goal is to have a somewhat fuzzy definition of GNOME based on the main areas of activity and interest in the community as opposed to a strict product definition.

Of course, if we go down that road, then we might also ask whether other apps should be included, like shotwell, geary and polari. (shotwell is a bit tricky because it became an Elementary app at some point.)

One app that I would argue fairly strongly to remove is GIMP and its associated modules (gimp-web and gegl). While GIMP has a historical association with the GNOME project it is fairly independent and has been for some time.

Missing modules

From a cursory inspection, these seem like obvious omissions which it would be good to include:

clutter
polari
gnome-builder
simple-scan
sushi
dconf-editor
gnome-online-accounts

@felipeborges @neilmcgovern

Distribution of commits/changes per author over time

Looking at the blog post, there's no mention of the distribution of commits/changes per author over time. It would be really interesting to know if the GNOME project has become more or less dependent on its core developers.

This could be charted as mean/standard deviation over time, or it might be interesting to do distribution charts for specific years.

@felipeborges @neilmcgovern

dropping the "lines changed" measurement for performance reason?

Thanks again for fornalder, it's a very interesting tool.

I was wondering if you had opinions about the idea of dropping the "lines changed" measurement from the tool, and related interface options. There are two different aspect that make me think this could be a good idea:

  1. It's not a very good metric: In practice I find that the "line changed" count is extremely noisy, not useful to study repositories over time. It gets worse when you aggregate different repositories of a community, that may have different practices. (One way to phrase this: I was never able to succesfully interpret the results of --unit changes in an interesting way for the repositories I was looking at.)

  2. Performance: If we didn't have to compute this information at ingest time, then we could ask git to clone the repository but don't actually download the trees, just the history. This can be done with Partial Clones, I think that git clone --filter=blob:none would work. This would massively decrease the cloning time to start the analysis, which takes a substantial part of the total analysis time.

Concrete example: for a long list of repositories I wanted to look at, cloning everything in parallel takes about 30mn, ingesting also takes 30mn (presumably this could be parallelized). Doing blob:none partial clones completes in 3mn. (The difference could be arbitrarily large for repositories that require downloading large amount of data.)

One could of course consider adding a --no-changes option to the tool that would not collect this information and be compatible with those partial clones, and possibly even record this fact in the database to later fail properly if users try to plot changes. But:

  • This could be invasive, while removing changes altogether makes the script simpler, which is good.
  • Git partial clones don't fail nicely if you try to ask for changes (or anything blob-related) after the fact, they try to re-download it on demand and the performance is terrible (I've observed one request for each commit when asking git log --stat). This means that if we fail to protect some blob-depending parts of the script in the codebase, the user experience is going to be horrible in this case. I would find it easier if we could just globally enforce that only the history, not the patches, are accessed by the fornalder ingestion logic.

error: Gnuplort reported error

Hi,

After reading your blog post I wanted to try this for myself. I get the following error on the scummvm repository: https://github.com/scummvm/scummvm.

error: Gnuplot reported error

Here is my full log. I'm using fornalder commit: 95f40e4

roland@MiX ~/wb8> git clone https://github.com/scummvm/scummvm
Cloning into 'scummvm'...
remote: Enumerating objects: 1012387, done.
remote: Counting objects: 100% (3359/3359), done.
remote: Compressing objects: 100% (1321/1321), done.
remote: Total 1012387 (delta 2183), reused 2985 (delta 2030), pack-reused 1009028
Receiving objects: 100% (1012387/1012387), 723.47 MiB | 3.33 MiB/s, done.
Resolving deltas: 100% (831372/831372), done.
Updating files: 100% (15528/15528), done.
roland@MiX ~/wb8> fornalder/target/debug/fornalder ingest db.sqlite scummvm
scummvm: 2021-08 (124143 commits)
roland@MiX ~/wb8 [1]> fornalder/target/debug/fornalder plot db.sqlite graph.png
error: Gnuplot reported error
roland@MiX ~/wb8 [1]> gnuplot -V
gnuplot 5.4 patchlevel 1

Missing GNOME modules

Giving a quick look at https://apps.gnome.org/ one can find a bunch of missing core apps

Then on developer tools, there are:

One has to wonder whether these should be included, but dconf-editor is included.

Rust Bindings: Should these be included? I mean vala and gjs are included. There are a bunch of bindings at https://gitlab.gnome.org/World/Rust besides what can be seen at https://github.com/gtk-rs.

Python bindings at https://gitlab.gnome.org/GNOME/pygobject are also missing.

a workflow to remove a repository after ingestion?

It has happened to me several times now that I ingest a large set of repositories, I look at the data, and I notice oddities caused by a repository that should not have been there in the first place.

Is there a workflow to remove a repository from the database, and rerun the plotting?

Currently I don't know of such a workflow, so I manually remove the repository, delete the database, and restart ingestion from scratch. This is ok, but it can be annoying when ingestion is slow (several minutes on large repository sets).

I thought about running sqlite on the database and doing a DELETE operation on all raw_commits coming from this directory. However, if I understand correctly, the plotting data comes from the authors table that I would need to update with new aggregates, and I don't know how to do it easily.

Assuming this does not currently exist, my proposal would be to have a command fornalder reanalyze foo.db that would drop the current authors table and recompute it from the raw_commits table as it currently exists.

(Another option of course would be to have a fornalder repo-remove foo.db repo.git command that removes a repository from a table, instead of adding it as fornalder ingest foo.db repo.git does. But that sounds like more work.)

Help: I don't get the steps

Hello,

unfortunately I am quite inexperienced with getting those repo files running. And I don't exactly understand the instructions.

  • I cloned and build fornalder successfully
  • I cloned the repo of interest to my repository folder
    -- /repositories/fornalder
    -- /repositories/monal

How do I run these command exactly now?

$ target/debug/fornalder --meta projects/project-meta.json \
                         ingest db.sqlite repo-1.git repo-2.git ...

Sorry, I am a bit limited with my capabilities here... :(

Readme missing out-path

Hey there, it seems like the readme is missing the out-path that seems mandatory :)

specifically in this command

target/debug/fornalder --meta project-meta.json \
                         plot db.sqlite \
                         --cohort firstyear \
                         --interval year \
                         --unit authors

Filter out translation and documentation commits

It's not that translation or documentation commits are unimportant - of course they are - it's just that they are a different type of activity which deserves its own analysis. Particularly including translation commits in with the general analysis makes the results hard to analysis, since they could be a high proportion of the overall number of commits. It's helpful to be able to look at the results and charts and know that it is just code changes.

@felipeborges @neilmcgovern

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.