hpjansson / fornalder Goto Github PK
View Code? Open in Web Editor NEWVisualize long-term trends in collections of Git repositories.
License: GNU General Public License v3.0
Visualize long-term trends in collections of Git repositories.
License: GNU General Public License v3.0
Which modules are included in the definition of GNOME is one of the trickiest aspects of measuring activity in the project, and is something I've struggled with myself in the past. That said, looking at the list of modules, I'd suggest a few changes:
The list includes a number of core dependencies which, while they are vital and important to GNOME, have a significant life beyond the project. This includes:
gtsreamer
wayland
NetworkManager
cairo
pipewire
ModemManager
The concern here is that a project like gstreamer or wayland could skew the results.
My suggestion would be to remove these modules from the analysis, and possibly conduct a separate analysis for this set of modules.
There are a bunch of libraries which have a similar status, including:
grilo
lvfs
fwupd
flatpak
It might be good to include those in the "core dependencies" group.
Edit: actually, there's a huge list of additional core dependencies we could consider. Certainly udisks2
and upower
should be in there. I also wonder about buildstream
, cups
, WebKit
, meson
...
The module list includes a few apps which are somewhat questionable.
The first is f-spot
and banshee
. These were never 100% official, which is maybe fine if the goal is to have a somewhat fuzzy definition of GNOME based on the main areas of activity and interest in the community as opposed to a strict product definition.
Of course, if we go down that road, then we might also ask whether other apps should be included, like shotwell
, geary
and polari
. (shotwell
is a bit tricky because it became an Elementary app at some point.)
One app that I would argue fairly strongly to remove is GIMP and its associated modules (gimp-web and gegl). While GIMP has a historical association with the GNOME project it is fairly independent and has been for some time.
From a cursory inspection, these seem like obvious omissions which it would be good to include:
clutter
polari
gnome-builder
simple-scan
sushi
dconf-editor
gnome-online-accounts
Looking at the blog post, there's no mention of the distribution of commits/changes per author over time. It would be really interesting to know if the GNOME project has become more or less dependent on its core developers.
This could be charted as mean/standard deviation over time, or it might be interesting to do distribution charts for specific years.
Thanks again for fornalder, it's a very interesting tool.
I was wondering if you had opinions about the idea of dropping the "lines changed" measurement from the tool, and related interface options. There are two different aspect that make me think this could be a good idea:
It's not a very good metric: In practice I find that the "line changed" count is extremely noisy, not useful to study repositories over time. It gets worse when you aggregate different repositories of a community, that may have different practices. (One way to phrase this: I was never able to succesfully interpret the results of --unit changes
in an interesting way for the repositories I was looking at.)
Performance: If we didn't have to compute this information at ingest
time, then we could ask git to clone the repository but don't actually download the trees, just the history. This can be done with Partial Clones, I think that git clone --filter=blob:none
would work. This would massively decrease the cloning time to start the analysis, which takes a substantial part of the total analysis time.
Concrete example: for a long list of repositories I wanted to look at, cloning everything in parallel takes about 30mn, ingesting also takes 30mn (presumably this could be parallelized). Doing blob:none
partial clones completes in 3mn. (The difference could be arbitrarily large for repositories that require downloading large amount of data.)
One could of course consider adding a --no-changes
option to the tool that would not collect this information and be compatible with those partial clones, and possibly even record this fact in the database to later fail properly if users try to plot changes. But:
git log --stat
). This means that if we fail to protect some blob-depending parts of the script in the codebase, the user experience is going to be horrible in this case. I would find it easier if we could just globally enforce that only the history, not the patches, are accessed by the fornalder ingestion logic.Hi,
After reading your blog post I wanted to try this for myself. I get the following error on the scummvm repository: https://github.com/scummvm/scummvm.
error: Gnuplot reported error
Here is my full log. I'm using fornalder commit: 95f40e4
roland@MiX ~/wb8> git clone https://github.com/scummvm/scummvm
Cloning into 'scummvm'...
remote: Enumerating objects: 1012387, done.
remote: Counting objects: 100% (3359/3359), done.
remote: Compressing objects: 100% (1321/1321), done.
remote: Total 1012387 (delta 2183), reused 2985 (delta 2030), pack-reused 1009028
Receiving objects: 100% (1012387/1012387), 723.47 MiB | 3.33 MiB/s, done.
Resolving deltas: 100% (831372/831372), done.
Updating files: 100% (15528/15528), done.
roland@MiX ~/wb8> fornalder/target/debug/fornalder ingest db.sqlite scummvm
scummvm: 2021-08 (124143 commits)
roland@MiX ~/wb8 [1]> fornalder/target/debug/fornalder plot db.sqlite graph.png
error: Gnuplot reported error
roland@MiX ~/wb8 [1]> gnuplot -V
gnuplot 5.4 patchlevel 1
Giving a quick look at https://apps.gnome.org/ one can find a bunch of missing core apps
Then on developer tools, there are:
One has to wonder whether these should be included, but dconf-editor is included.
Rust Bindings: Should these be included? I mean vala and gjs are included. There are a bunch of bindings at https://gitlab.gnome.org/World/Rust besides what can be seen at https://github.com/gtk-rs.
Python bindings at https://gitlab.gnome.org/GNOME/pygobject are also missing.
It has happened to me several times now that I ingest a large set of repositories, I look at the data, and I notice oddities caused by a repository that should not have been there in the first place.
Is there a workflow to remove a repository from the database, and rerun the plotting?
Currently I don't know of such a workflow, so I manually remove the repository, delete the database, and restart ingestion from scratch. This is ok, but it can be annoying when ingestion is slow (several minutes on large repository sets).
I thought about running sqlite
on the database and doing a DELETE
operation on all raw_commits coming from this directory. However, if I understand correctly, the plotting data comes from the authors
table that I would need to update with new aggregates, and I don't know how to do it easily.
Assuming this does not currently exist, my proposal would be to have a command fornalder reanalyze foo.db
that would drop the current authors
table and recompute it from the raw_commits
table as it currently exists.
(Another option of course would be to have a fornalder repo-remove foo.db repo.git
command that removes a repository from a table, instead of adding it as fornalder ingest foo.db repo.git
does. But that sounds like more work.)
Hello,
unfortunately I am quite inexperienced with getting those repo files running. And I don't exactly understand the instructions.
How do I run these command exactly now?
$ target/debug/fornalder --meta projects/project-meta.json \
ingest db.sqlite repo-1.git repo-2.git ...
Sorry, I am a bit limited with my capabilities here... :(
Hey there, it seems like the readme is missing the out-path
that seems mandatory :)
specifically in this command
target/debug/fornalder --meta project-meta.json \
plot db.sqlite \
--cohort firstyear \
--interval year \
--unit authors
It's not that translation or documentation commits are unimportant - of course they are - it's just that they are a different type of activity which deserves its own analysis. Particularly including translation commits in with the general analysis makes the results hard to analysis, since they could be a high proportion of the overall number of commits. It's helpful to be able to look at the results and charts and know that it is just code changes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.