Python is fantastic for what we've been doing so far, which consists of making queries to our GHTorrent database and a Flask API to those queries.
Now that we are going to begin working on metrics that require analyzing repos, I'm not sure Python+GitPython will meet our needs well. GitPython has a beautiful API but it is probably too slow to make a usable web app that is able to deliver the more complicated metrics without some serious caching.
PyGit2 would probably be fast enough for our needs, but then the installation requirements for Windows users would be more involved. We could containerize it, but then we lose the benefits of having a Python library. Also it's GPL, and a modified GPL at that, so I'm not even sure we can use it.
One solution is to store every repo we analyze and keep them up to sync, analyzing them when they change and storing the results. Essentially we'd be keeping a database up to date with our metrics. That would allow for practically instant response when users look for a repo, but limit the scope of which repos they can look at, much like http://gittrends.io/
I think a better solution would be switching to Go.
I was looking for solutions when I stumbled upon go-git which is written in Go, and would let us download and manipulate git repos entirely in memory. It is created by a company that is doing some ML stuff on the code from every GitHub repo out there, so it was made to tackle a problem similar to ours. We would still cache, and we'd likely want to still download large/popular repos to avoid re-downloading them over and over, but for smaller, more obscure repos our Go package could download them and analyze them in seconds. The entire GitHub ecosystem would be open to our users, while still being practical for us and fast for them.
We would also get the other benefits of Go, including easy cross-platform binaries with no dependencies, and much better general performance vs Python. This would make it trivial for our users to host their own ghdata or use it behind firewalls.
The disadvantage is that go-git doesn't have close to the same breadth as GitPython, so we'd have to reimplement a lot of GoPython's functionality. One important example is it lacks git blame
What do you all think?
If Go is not an option, any thoughts on the best way to deal with the slowness of GitPython?