buildinspace / peru Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 81.0 4.01 MB

a generic package manager, for including other people's code in your projects

License: MIT License

Shell 1.41% Makefile 0.31% Python 98.16% Batchfile 0.11%

dependency-manager package-manager packaging plugin-manager toolchain

peru's People

Contributors

Stargazers

Watchers

Forkers

olson-sean-k tml arvindhm mikeatm nomaddevw lamegame sheesha90 noulik23 nedeterma buldingblocks kindagirl loadrunner98 llazzaro newsoft12 mightwriter8 outlawit pnsk tst2005 eramosfigueroa mjgorman nivertech scalp42 ierceg kaos-addict atomatt jmbrads22 xyproto spasarok panickervinod edbrannin liuliangsir shs96c glennneiger stefb965 happy-ferret anima-os-stash pombredanne lyyyd nkabir zeroc0d3 pixel-perfect-metodology bulletsilver hongtanhao foobarquaxx langelgjm mexisme librechain khaledhassan felipefoz rkx-forks colindean 00mjk shism2 booniepepper norgate-av mtcoin406 thierryhumphrey trellixvulnteam sthagen icodein araxeus maxwell-k kevinvdkaverna jankap

peru's Issues

improve the README

Some things we should probably mention:

The general rule fields: build, export, and files.
"Treat downloaded files like generated files."
A little bit about named rules.

allow imports to be a list of pairs

This would allow the user to import the same target twice without hacks. It would also let them control the merge order, which I don't know why you'd want to do that, but maybe.

peru overrides list

Should just work after I add it to docopt.

`peru copy` should support `--all`

I should be able to build (and force a build) of any target, not just local rules. Likewise, I should be able to export any tree, including the local imports. Once export can do that, our validate_third_party.sh script can use it and be simpler/faster.

parallel fetching can cause cache conflicts

Our parallelism uses module object locking to avoid fetching the same module twice. But there's nothing preventing two different modules from using the same URL. Those two modules could get fetched in parallel, and then you have two instances of the git plugin (or whatever) trying to write to the same directory.

We definitely don't want to shove any locking responsibilities down to the plugins. What we should do is create more granular plugin cache directories (instead of one big global one) and use a lock in peru itself to prevent two fetches from touching one cache at the same time. I'm tempted to use the full hash of a module's fields to name this directory, but we don't want to invalidate a git clone when the user changes rev for example. We could use the name+type of a module (because a module should definitely get a clean plugin cache if it changes type), but that could still get confused if one module swaps names with another. Maybe the solution is to name/lock the cache dir with a hash of all plugin fields, but also allow plugin.yaml to restrict the list of fields that get hashed. So the git plugin for example could say, "Only use my url field for the purposes of plugin caching." Is that too complicated? It might even make sense to make this configuration semi-mandatory, so that plugins that don't specify their cacheable fields get /dev/null as their PERU_PLUGIN_CACHE. Random upside to all this: we can get rid of the urlencoding that the plugins are doing now.

Related: You could have two modules with exactly the same fields. Ideally the second one should be a cache hit. But if they're fetched in parallel, they might both be cache misses, and then they would duplicate work. The solution to this would be to take module locks by cache key, rather than just by module object instance. (This should've been obvious from the beginning, since the read-write that we're protecting is done on that key.) Unlike the plugin issue above, this distinction is just a duplicated-work issue in a weird corner case, rather than a serious correctness issue. But since we already have to do module-level locking (to cover the case where both A and B depend on C), we might as well do it right.

All together, here's what that locking is going to look like. All of this lives in RemoteModule.get_tree, though RemoteModule.reup will probably want to do it too, so hopefully we can share it cleanly.

Take a module lock keyed off of the module's cache key (the hash of all module fields). Think of this as the "don't fetch the same module twice" lock, though it will also handle identical modules with different names.
Check the module cache and exit early if it's a hit.
Take a plugin lock keyed off the relevant-to-plugin-caching fields specified in plugin.yaml. Think of this as the "only one job at a time using a given plugin cache directory" lock. If the plugin hasn't configured these fields, there's no lock here, and we don't provide a cache dir at all.
Take the max-parallel-fetches semaphore. Think of this as the "even though we could run infinity jobs in parallel, let's be sensible and only run 10" semaphore.
Actually shell out to the plugin.

make hg, svn, and curl play nicely with the fancy display

Right now they only output one line.

add support for hg, svn

These need plugins. We should probably refactor some shared logic out of the plugin main functions when we do this.

support whitespace in plugin field names?

We tend to use spaces in our field names, because it's just nicer to read (required fields vs required_fields). We should probably allow plugins to do the same with the names they define. For example, suppose the curl plugin wanted a field called "fallback url". We should probably let them call it fallback url with the space. But we'd want to pass it along as $PERU_MODULE_FALLBACK_URL rather than allowing whitespace into an env var name.

`peru diff`

When doing a reup, it would be nice to see the before/after diff. One way to do this would be to support some kind of peru diff FILE [FILE2]. Note that FILE could be something like

<(git show HEAD^:peru.yaml

Peru could prepare the imports tree and then do some kind of git diff between that and the current tree.

One way to hack around this right now is to git add -A --force and commit all your imported files in a temp branch, do the reup, make another commit, and then compare those two.

should reup do a sync automatically?

If so, presumably we'd want a flag to suppress it.

replace the plugin command line protocol with plugin.yaml and environment variables

Right now plugins have to do some nontrivial parsing to separate out plugin fields from command arguments. This gets duplicated in every plugin, even though some of it is shared. A fairly trivial plugin like cp, which should be one line, ends up being four or five (let along the Bash rsync plugin), and also the sets of mandatory and optional fields get duplicated between fetch and reup scripts.

One of the reasons we didn't use more environment variables earlier is that it's difficult for the plugin to recognize invalid fields if it doesn't its fields in a list. But it shouldn't be the plugin's responsibility to recognize invalid fields -- that's more duplicated logic that should live in peru core. We should create a plugin.yaml convention that lets the plugin declare what fields it supports. (And possibly other stuff in the future, who knows.)

Once that's done, there's no reason not to pass the url field as e.g. PERU_FIELD_URL or something. Then the plugin never needs to parse anything.

allow different builds for different platforms

We should probably let the build field optionally take a mapping of system names to build commands. Complicated build commands can already do their own uname testing on posix systems, so this will almost exclusively be intended to support Windows. We should probably use sys.platform and the .startswith() idiom (https://docs.python.org/3.4/library/sys.html#sys.platform), but it might be nice to also check against os.name, so that users could specify posix without needing to duplicate things for each different posix-like os. Should we match against an ordered list?

add a `filter` field to rules and modules

It's been fairly common for me to use the build field to do something like

mkdir out && cp myfile out/

when I want to export only part of a directory. That feels pretty hacky, and it will be very inconvenient in builds that need to support Windows or even just cp -r (Mac requires the -R flag instead).

It would be better to have some explicit filter field. git add supports * and ** globs natively, so it shouldn't be too much trouble to expose this through Cache.import_tree.

My guess is that it would make sense to apply the filter step after export, which means that filter paths would be relative to the export dir rather than relative to the module root. That would save the user from duplicating the export path in the filter spec. The order of application of rule fields would then be:

build
export
filter

Add a test to make sure third-party matches peru.yaml (was: bootstrap with pip)

We should add a requirements.txt file for Python dependencies, and also a peru.yaml file that does the same thing, so that we can be self-hosting without having to bootstrap ourselves.

How should we keep these two things in sync. Can we generate one from the other?

add some logging

.peru/log seems like a reasonable place. It would be nice to record entries like

module foo cached: 5d5fb9a5c41a0bca34af6fcb1e554b79af6534ea

so that when I want to clear the cache for just one module, I can find its cache key in the log. And of course, we should be logging errors.

stop complaining on deleted imports

We want peru sync to be careful about overwriting the user's files. Peru doesn't pave over preexisting files, or any changes that have been made since a file was created. This is to avoid accidentally deleting users' work, and also to try to catch some of the cases where users have done the Wrong Thing with peru (like checking in synced files (#dowhatisaynotwhatido)). But currently we also freak out if the user has deleted files that peru synced, and I think that might be overzealous. Consider this scenario:

I want to clean junk out of my repo, but I don't want to lose my peru cache.
So I run git clean -dfx --exclude .peru. Maybe I have an alias for that.
Now I call peru sync. Peru absolutely refuses until I use -f.

I think it would be better if peru stopped complaining here. There's no risk of losing work, and it's not really catching any Wrong Things. If a user sees this error all the time, they're not going to be paying attention when it eventually catches a real mistake. (Especially if we get them in the habit of using -f.)

tldr: peru sync should consider deleted files "clean".

print stacktraces from PrintableErrors if --verbose

make sure our .peru dir is versioned

We want to be able to make changes to the format without needing everyone to git clean their projects. Possibly also version the plugin caches?

make it possible to keep .peru under .git

It's often convenient to store metadata inside the .git directory. See http://tbaggery.com/2011/08/08/effortless-ctags-with-git.html. We should make it easy to move .peru to .git/peru, without breaking the ability to run peru commands outside the project root. Probably some modification to how we handle $PERU_DIR or some new variable.

support user-defined plugins

Maybe support a plugins: field in peru.yaml, given as a path or a list of paths.

backup peru.yaml during a reup

There's no reason we shouldn't keep backups of peru.yaml under the .peru dir when we modify that file. It wouldn't take very much disk space, and it could be helpful for users who aren't under version control. (As, for example, our future workspace feature might not be.)

specify the cache on a per-project basis

Something in .peru. Maybe a pseudo symlink like git uses.

The PERU_CACHE env var is a little too broad. You might want to have some projects share the cache but not others.

peru workspaces

It would be nice to be able to manage a big ecosystem of projects with peru. We'd probably build on the existing overrides feature to do it. One idea we had was generating a peru.yaml file (not version controlled) that refers to your project repository as an overridden remote module. We might recursive peru to make this work.

Windows compatibility

We need to make sure we're using the ProactorEventLoop on Windows. The default loop doesn't support subprocesses.
We need to use create_subprocess_shell instead of create_subprocess_exec to execute e.g. .py plugins.

Should the cache directory live in $HOME by default?

As I'm writing the README, I find myself telling new users to set the $PERU_CACHE variable to avoid recloning things after they clean. When new users need to configure some random setting, that's usually a sign that the default is bad. Should we be storing the cache in a centralized spot by default?

Pros:

This is what Maven and Ivy do.
This makes the fastest setup the default for new users.
Different projects with the same dependencies would share their networking and disk space by default.

Cons:

This is what Maven and Ivy do.
This is not what git does by default, even though it can be configured to.
This default might be bad for complicated disk setups.
- Say your build machine uses an NFS mount for /home, and uses /var/local or something for the actual local disk. You might want to do all your clones and builds in /var/local for speed, but behind the scenes peru is doing big git operations over the network in /home.
This can be confusing for modules without an explicit rev.
- If two projects use the same dependency, the rev that one of them is getting will be affected by the other. A "new" dependency could be very stale because the other caller cached it in the past. We will probably also have a --skipcache flag or something in the future to force plugin fetches, and doing that would update the cache for all callers.
- Similar problem for nondeterministic build commands.
This is a band-aid for our slow, serial plugin fetching. We should make it faster instead.
When we start actually using locks for our cache writes, this default could create more lock contention and stale lock issues.

Is it really a good idea for the imports list to be separate?

This is a consistent question I get when I demo peru for people. Why not put the import path for a module in the module's declaration? There are two decent reasons and one bad reason:

It's nice to be able to look in one place and see everything that peru is going to do when you sync.
Imports are not necessarily one-module-one-path. A module could be imported multiple times with different rules.
Bad reason: this is a holdover from when remote modules were more involved, with potentially their own imports, when some symmetry between the remote modules and the toplevel module made sense.

Allowing import paths as part of a module declaration definitely simplifies the hello world example. I could go either way on examples that are more complicated than that. I really want to avoid having two different ways to do the same thing, like allowing both an imports list and inline import paths. I think the biggest question for me right now is whether point (1) is really true, or whether I just think it's true because I'm used to it...

fix Sean's dotfiles

https://github.com/olson-sean-k/dot-config
https://gist.github.com/oconnor663/fffb0f2fcd7ca472b589

Doing a peru sync in that repo causes YouCompleteMe to get fetched ten times. It's probably something to do with submodules. Weirdly, in the end everything seems to work fine.

use libgit2 via pygit2 instead of shelling out

The main blocker for this one is Cache.merge_trees(). We use the --prefix flag for git-read-tree, and libgit2 doesn't seem to support a similar feature. Tracking issue: libgit2/libgit2#154 We could use the treebuilder feature to build a prefixed tree, and then use git_merge_trees() on that, but that function isn't exposed through pygit2 anyway.

Other features we need in pygit2 that we've already implemented:
setting the working dir: https://github.com/oconnor663/pygit2/commit/a063867fe0e4506e29f22c45dd403d805e3fb1b7
setting a detatched HEAD: https://github.com/oconnor663/pygit2/commit/b190169f5e83cbdb2346acd52cea30e14a205eb5
EDIT: These were pushed as part of pygit2 v0.21.0 libgit2/pygit2#377

recursive peru

Remote modules should be able to include their own peru.yaml files. This should allow default rules, as well as referencing rules and modules defined in the remote.

add missing tests for svn plugin

The svn plugin does not have adequate test coverage. The initial version of the plugin didn't even check out the correct revision, as seen in this change: https://phabricator.buildinspace.com/D37

The tests will likely need some refactoring for this.

get a cert for buildinspace.com

I've opened up http:// on port 80, but our .arcconfig and commit logs are still pointing to https://, so we should really fix this.

implement PERU_WORK_DIR

The validate_third_party.sh script has to copy peru.yaml around and then clean it up. That's annoying. We also have hacks in tests to handle peru.yaml when we're comparing contents of directories. Make all this cleaner.

consider validating our markup with Schema

https://github.com/halst/schema Same guy who wrote Docopt.

Others:
https://github.com/alecthomas/voluptuous
https://github.com/Julian/jsonschema
https://github.com/podio/valideer

Cache behaves badly around .gitignore files

Export loses it's safety guarantees hen the stomped files are gitignored. Import (probably) fails to import gitignored files.

peru.sh appends to PYTHONPATH rather than prepending

Otherwise when we install peru, we won't be able to run it out of the repo anymore.

remove the build command

sync only ever syncs one thing (everything). That's a good thing. It means you don't have a lot of state that you need to worry about. You're either synced, or you're not.

build has a similar restriction, but it seems to make a lot less sense. Almost all builds need to support multiple different invocations, like make and make install. To be useful in anything but the most trivial cases, build would need to start taking parameters that it passes along to build commands, and the target syntax would need to support this too.

Rather than trying to patch up a bad model, I think we should scrap the build command. We should encourage the pattern where other build tools call peru sync.

One question this raises: Projects can have a toplevel build: field. Previously peru sync ignored this, and only peru build triggered it. With this change, the only way to invoke a toplevel build field will be to have another module depend on you as a recursive project (not implemented yet). Is that a world that makes sense?

Actually, it's no different from export: and files:, neither of which is meaningful at the top level unless someone depends on you as a recursive project. Maybe it's good that build: would be more like those.

But that raises another question: Does it really make sense for build, export, and files to be first-class, toplevel fields? Maybe we should cordon them off in a section of their own?

set up code coverage (on plugins)

https://coveralls.io/ Should work well with Travis. Here's a Python API for coveralls that also showcases the coverage badge: https://github.com/z4r/python-coveralls

use shallow fetching for the git plugin

It can fall back to an --unshallow fetch if the needed rev is still missing after a standard fetch. (Or clone?)

add a `peru status` command

should remote modules even have imports?

Maybe that's more complicated than it's worth. (Especially when it comes to overridden modules, where we have to stick a .peru dir in them.) I'm not sure I can think of a good use case. We want to encourage nontrivial build commands to come out of the peru.yaml file anyway, right? Maybe only the toplevel project (and hypothetical recursive projects) should have imports.

run fetches in parallel

We'll use asyncio for this, from the 3.3-compatible "tulip" library. Some things to remember:

Make sure modules and rules lock their get_tree methods. We don't want to allow multiple fetches to happen at once for the same module.
Use a semaphore to limit the number of parallel jobs.
Refactor resolver.py so that not everything needs to become a coroutine.

refactor test code

The test harness is kind of a mess. In particular, the plugin tests have some inflexible scaffolding that doesn't work well for anything but distributed VCS plugins like git and hg. Until this is done, it may be difficult and hacky to test plugins like svn.

write a bash_cp plugin

This will help us force the plugin interface to stay simple, and to give an example of how plugins in other languages should be written.

checking cache keys should not cause a build

We can compute the cache key for a rule without building it. So we should really be able to do that without building its dependencies either. The current approach has the benefit of noticing when we've run a rule on the same inputs before though. Can we get both?

don't let builds print straight to stdout

That conflicts with the fancy display. Maybe the displays could be extended to provide a different kind of output writer, which works like the print method does now.

cancel subprocesses when a job fails?

We don't do anything special to cancel existing jobs when something fails. What actually happens? Presumably we should be sending a kill signal to existing jobs. What if a job fails to die?

allow plugin scripts to be one file

Right now we force plugins to separate their fetch and reup scripts, at least to some degree. This forces all of our plugins to use the *_shared idiom, which is pretty annoying. That layout used to make sense before we have plugins.yaml, but now maybe it doesn't. It should be easy enough for that file to tell us what to invoke for fetch and reup, and there's no reason those couldn't be the same thing. (We could use another env var like PERU_PLUGIN_COMMAND to make it possible-but-not-required to use one script for both.) @olson-sean-k what do you think?

Changing the cache can cause "failed to unpack tree object" errors

Say you run peru sync and then you change the value of PERU_CACHE and run peru sync again. The lastimports file will contain a reference to a tree that's not in your new cache, and you'll get a git error. We should detect this case ("hey, it looks like your last imports tree is no longer in cache") and allow the --force flag to just pave over everything.

move .peru/cache/tmp to .peru/tmp

Users should be able to set PERU_CACHE to their home dir without causing peru to write a ton of temp files there. Honestly, there really shouldn't be a reason to set PERU_PLUGINS_CACHE instead of PERU_CACHE.