fermitools / jobsub_lite Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 7.0 1.44 MB

jobsub_lite is a wrapper for HTCondor job submission

License: Apache License 2.0

Python 75.25% Shell 20.28% Batchfile 3.94% Makefile 0.53%

jobsub_lite's Issues

jobsub_q doesn't support the --role flag

Make sure that something like jobsub_q -G dune --role=production works

Handle DESIRED_usage_model=OPPORTUNISTIC,DEDICATED for backward compatibility

Put something in jobsub lite where if someone said DESIRED_usage_model=OPPORTUNISTIC,DEDICATED on a jobsub_submit line, it would trigger DESIRED_Sites="Fermigrid". This value should be overrideable by an environment variable.

Note that this is a stopgap for backward compatibility, and is not something we want to encourage. Once this is done, we need to open another issue to plan its phase-out, along with a possible replacement (maybe an --onsite flag that's configurable in the environment?)

Check dependencies of jobsub_lite RPM

When I tried to install the jobsub_lite RPM on a fresh node, I found that I had to install a number of other RPMs to make it work. Add these as dependencies in the spec file. Partial list includes:

voms-clients, globus-gssapi-gsi, osg-ca-certs, vo-client

jobsub_transfer_data does not respect the JOBSUB_GROUP envrionment variable

If I set JOBSUB_GROUP to say dune and then try jobsub_transfer_data myjobid, I get a failure:

$ jobsub_transfer_data [email protected]
sh: line 1: _Production_50762: command not found
Checking if /tmp/x509up_u50762 can be reused ... yes
...........++++++
.....++++++
sh: line 1: _Production_50762: command not found
sh: line 2: /Role=Production: No such file or directory
sh: line 1: _Production_50762: command not found
htgettoken: Kerberos negotiate with https://fermicloud543.fnal.gov:8200/v1/auth/kerberos-dunepro_default/login failed: HTTPError: HTTP Error 400: Bad Request: missing client token
sh: line 1: -r: command not found

But all works happily with jobsub_transfer_data -G dune [email protected].

Need to write unit tests for jobsub_lite

Now that most of the code is finished and only corrections are needed, we should write unit tests to aid present and future development.

New bug in master wrt template rendering

https://github.com/marcmengel/jobsub_lite/blob/250859b2c8066d375d398bf0850bbbcbc22a86c9/templates/simple/simple.sh#L323

This line is an example of a bug introduced in f6fbb3d, which occurs when any template variable that is set to "None" by default (like timeout) is checked like this. The intended behavior is that if the variable is undefined or None, that we omit the portion of the script that invokes it. Instead, since the variable is defined, as None, the template tries to substitute None for the value of the variable, which usually causes problems. Look through the various templates and fix this.

This breaks even the simplest of jobs.

jinja2 to jinja3 transition

The current major version of jinja is jinja3. Currently, we use jinja2 because its RPM is available on EPEL. Once jinja3 is available, we'll want to be ready to transition our templating to jinja3 templates.

Still an issue with --group when submitting to DUNE global pool

We need to figure out why this happens:

[sbhat@fifeutilgpvm01 jobsub_lite_sb_fork]$ jobsub_q --group dune [email protected]
Attempting OIDC authentication with https://fermicloud543.fnal.gov:8200

htgettoken: Initiating authentication to https://fermicloud543.fnal.gov:8200 failed: HTTPError: HTTP Error 400: Bad Request: missing client token


-- Schedd: dunegpschedd02.fnal.gov : <131.225.240.252:9615?... @ 12/21/21 14:30:11
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 6 jobs; 0 completed, 0 removed, 6 idle, 0 running, 0 held, 0 suspended

But this works:

[sbhat@fifeutilgpvm01 jobsub_lite_sb_fork]$ jobsub_q [email protected]
sed: can't read /tmp/bt_token_sbhat: No such file or directory
htgettoken: Initiating authentication to https://fermicloud543.fnal.gov:8200 failed: HTTPError: HTTP Error 400: Bad Request: missing client token


-- Schedd: dunegpschedd02.fnal.gov : <131.225.240.252:9615?... @ 12/21/21 14:31:14
OWNER BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
sbhat ID: 6       12/21 14:27      _      _      1      _      1 6.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 6 jobs; 0 completed, 0 removed, 6 idle, 0 running, 0 held, 0 suspended

They should both work.

DEFAULT_SINGULARITY_IMAGE in get_parser.py should be latest tag

For testing, we've frozen the tag of the DEFAULT_SINGULARITY_IMAGE in lib/get_parser.py. This needs to be switched to latest before going to production.

-G flag should set the environment variables GROUP and JOBSUB_GROUP

Currently, the -G flag to jobsub_lite commands sets the group of the submission, which is only used in tarfile upload. We need to also populate the GROUP environment variable with this value for submission to the DUNE global pool, since otherwise, token generation (and thus authorization to submit) fails.

The current set of steps needed to submit a job to GPGrid is:

export GROUP=dune
jobsub_submit -G dune  --resource-provides=usage_model=OPPORTUNISTIC,DEDICATED  file:///home/sbhat/TestJobs/basicsleep.sh 123

For the DUNE pool, the process is:

export _condor_COLLECTOR_HOST=dunegpcoll02.fnal.gov
export GROUP=dune
jobsub_submit -G dune  --resource-provides=usage_model=OPPORTUNISTIC,DEDICATED  file:///home/sbhat/TestJobs/basicsleep.sh 123

The third line is unnecessary, since we're already specifying group.

bare condor_submit should return help, not a traceback

$ condor_submit
Traceback (most recent call last):
  File "/opt/jobsub_lite/bin/condor_submit", line 83, in <module>
    main()
  File "/opt/jobsub_lite/bin/condor_submit", line 71, in main
    proxy, token = get_creds(args)
NameError: name 'args' is not defined

Document release build steps

We need to have a clear procedure for developers on how to build a release. This should be in the wiki.

DAG issues

A few issues noted by vito:

I got small-ish issue in the DAG setup, the dagbegin.cmd and dagend.cmd are inheriting resource requests from the jobsub_submit command, which could mean few memory GB and 10s GB of disk for jobs just starting/ending the SAM project, I think in current jobsub_client those jobs have their own pre-sets for resource requirements
3:09
also, all user jobs in the DAG are named WORKER_0 , in current jobsub_client, the index suffix is increasing
New
3:09
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
761.0 vito 8/24 20:20 0+02:41:04 R 0 0.0 dagman_wrapper.sh -p 0 -f -l . ...
763.0 |-WORKER_0 8/24 20:22 0+00:00:00 I 0 0.0 simple.sh --debug --find_setups ...
763.1 |-WORKER_0 8/24 20:22 0+00:00:00 I 0 0.0 simple.sh --debug --find_setups ...
763.2 |-WORKER_0 8/24 20:22 0+00:00:00 I 0 0.0 simple.sh --debug --find_setups ...

f.close should be f.close()

Currently, on this line:
https://github.com/marcmengel/jobsub_lite/blob/c26e3631f7412a6876bc9bc6b99539d51b1428ad/lib/creds.py#L29

there's a call to f.close. This should be f.close()

Drop jobsub_transfer_data for condor_transfer_data

jobsub_transfer_data should be dropped from the package. Make that condor_transfer_data, and have it do the same as jobsub_transfer_data does. This should be rolled out at the same time as jobsub_fetchlog is implemented.

When rendering templates, we should be strict about checking for variables

Any time we're rendering a template, we should make sure that a variable the template expects is present in the arguments passed in. If not, it should return an error.

This is branching off Issue #45 , and was mostly implemented by adding undefined=jinja2.StrictUndefined to the jinja environment. Doing that broke a lot of the jobsub_submit unit tests, and a lot needs to be added to utils.py and test_unit.py to make the tests work again. Most of this work is done, but thorough unit testing and general testing is needed, which is why this being split off.

Look into running github actions (pre-commit, tests)

Look into running github actions (pre-commit, unit tests, integration tests). First pass, we definitely want pre-commit running. Then I'd say unit tests and integration tests, though the latter might be a bit tougher.

Implement pre-commit with linting and autoformatting (black)

We need pre-commit actions like linting and autoformatting (black)

Move setting of BEARER_TOKEN into ifdh and remove from jobsub_lite

Move setting of BEARER_TOKEN into ifdh and remove from jobsub_lite. This relies on ifdh setting BEARER_TOKEN in the environment by reading from BEARER_TOKEN_FILE. This change should mostly need to take place in the wrapper scripts. (see #30)

AlmaLinux 9 testing

We need to make sure that jobsub_lite works well with CentOS8. Preferably, this should happen before our go-live.

Implement jobsub_lite schedd classad

Nick Peregonow created a classad for jobsub_lite schedds. We need to implement this in jobsub_lite.
Classad is "SCHEDD.IsJobsubLite = True" for jobsub_lite schedd picker.

jobsub_lite should use tokens for authentication to RCDS frontend webserver

jobsub_lite should use tokens for authentication to RCDS frontend webserver. This needs to wait until the RCDS service has tokens enabled. Then we can test this.

RPM should make sure that 00_htcondor_9.0.config gets moved upon installation

spec should have a line where if 00_htcondor_9.0.config is installed by the condor RPM, it gets moved. The ONLY condor config files we should have should be the ones jobsub_lite installs.

Replacement for jobsub_fetchlog

We need in the new jobsub_lite a replacement for jobsub_fetchlog. @retzkek is already working on this, so this is simply a placeholder issue to document the progress.

Create a replacement for jobsub_fetchlog

We need some replacement for jobsub_fetchlog, whatever that may look like in the end.

condor_submit has a call to the variable "args" that doesn't exist

In the following line:

https://github.com/marcmengel/jobsub_lite/blob/e6e43df0a095a851f48a88e83b7faebcd716cd1b/bin/condor_submit#L71

There's a call to args that isn't declared. This causes direct condor_submit calls to fail, such as the following example:

$ condor_submit -name jobsubdevgpvm01.fnal.gov simple_pro.cmd
Traceback (most recent call last):
  File "/opt/jobsub_lite/bin/condor_submit", line 83, in <module>
    main()
  File "/opt/jobsub_lite/bin/condor_submit", line 71, in main
    proxy, token = get_creds(args)
NameError: name 'args' is not defined

It looks like we'll also run into the same problem on line 80, with the call
https://github.com/marcmengel/jobsub_lite/blob/e6e43df0a095a851f48a88e83b7faebcd716cd1b/bin/condor_submit#L80
In this case, the variable varg is not previously defined.

Assignee should probably look to jobsub_submit to see how the function was intended to be implemented.

Schedd selection should take into account SupportedVOList, InDownTime, and Test classad

Currently, the schedd selection code filters based on SupportedVOList. We also want to check InDownTime (if it's true, we shouldn't use that schedd), and Test classad.

As far as the Test classad, we should discuss with group idea of having a test flag that triggers this.

Change all "multi-word" flags from underscores to dashes

We decided at a meeting a few weeks ago that all flags should use dashes, rather than underscores. For example, --tar_file_name should become --tar-file-name. This will result in a more consistent look across all the flags.

Add --version flag to jobsub commands

jobsub_ --version should give the version of jobsub_lite.

Clean up error messages in various jobsub command output

For example, if I run something like jobsub_q MYJOB@MYSCHEDD, as much as possible, I should only get the output I need, unless of course there is an actual error we care about.

[sbhat@fifeutilgpvm01 jobsub_lite_sb_fork]$ jobsub_q  [email protected]
sed: can't read /tmp/bt_token_sbhat: No such file or directory
htgettoken: Initiating authentication to https://fermicloud543.fnal.gov:8200 failed: HTTPError: HTTP Error 400: Bad Request: missing client token


-- Schedd: dunegpschedd02.fnal.gov : <131.225.240.252:9615?... @ 12/17/21 14:26:48
OWNER BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
sbhat ID: 4       12/17 14:09      _      _      1      _      1 4.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended

--role value is not passed to templates, resulting in submission failures

When I try to submit a job using the production role, the following happens (cut some of output here):

jobsub_submit -G dune --role=production   --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --devserver file:///usr/bin/printenv
...
Running: _condor_CREDD_HOST=jobsubdevgpvm01.fnal.gov BEARER_TOKEN_FILE=/tmp/bt_token_dune_Analysis_10610 /usr/bin/condor_submit -spool -pool gpcollector04.fnal.gov -remote jobsubdevgpvm01.fnal.gov /home/sbhat/.jobsub_lite/dune/sbhat/2022_02_09_135318.3792bd86-4bf1-46b3-a639-09114db72cb4/simple.cmd
...
Checking if -s dune credentials exist
Attempting to get tokens for dune
Credkey from /home/sbhat/.config/htgettoken/credkey-dune-default: sbhat
Attempting to get bearer token from https://fermicloud543.fnal.gov:8200
  using vault token from /tmp/vt_u10610-dune
  at path secret/oauth/creds/dune/sbhat:default
Storing bearer token in /tmp/bt_token_dune_Analysis_10610
Submitting job(s).
1 job(s) submitted to cluster [email protected]
Output will be in /home/sbhat/.jobsub_lite/dune/sbhat/2022_02_09_135318.3792bd86-4bf1-46b3-a639-09114db72cb4 after running jobsub_transfer_data.

So the submission works, but note that the bearer token filename has "Analysis" in there. Looking at the bearer token itself (again omitting some lines):

$ httokendecode /tmp/bt_token_dune_Analysis_10610
{
  "sub": "[email protected]",
  "nbf": 1644436373,
  "scope": "storage.create:/dune/scratch/users/sbhat compute.create compute.read compute.cancel compute.modify storage.read:/dune",
  "wlcg.groups": [
    "/dune"
  ],

}

the correct /dune/production group hasn't been included in the wlcg.groups entry. This is because the line here (and corresponding ones in other templates)

https://github.com/marcmengel/jobsub_lite/blob/9a25dfa5711fafc547274467111398c1afcec9ee/templates/simple/simple.cmd#L62

doesn't include the role that we passed in. Thus, the correct role doesn't get passed to the submit file. I think that there's another, related bug, with fake_ifdh, but I'll put that in a separate issue. For now, we'll address this.

RPM should move (and save) vanilla condor config files to avoid our configs getting preempted

Currently, vanilla condor (a dependency of jobsub_lite), installs /etc/condor/config.d/00-htcondor-9.0.config and /etc/condor/config.d/40-vault-credmon.conf.

The RPM spec file should make sure that these files are either deleted, moved, or renamed to have later priority than our jobsub_lite condor configs.

Enable jobsub_lite to use tokens for RCDS/tarball uploads

Currently, on a few lines in https://github.com/marcmengel/jobsub_lite/blob/master/lib/tarfiles.py, such as this:

https://github.com/marcmengel/jobsub_lite/blob/041160f5702f7fe0206d22873844114c49abd3d2/lib/tarfiles.py#L97

jobsub_lite uses a proxy to query and upload files to the Rapid Code Distribution Service (RCDS) at Fermilab. Now that RCDS is token-enabled, we need to switch jobsub_lite to using tokens for RCDS interaction.

It looks like @marcmengel already built into the various RCDS-interacting functions the ability to use tokens, so we should just need to switch some function calls.

jobsub_lite should make sure that any role is lower-cased even if the user gives a different capitalization

In the tokens world, a lot of roles are lowercased (like "production", etc). We need to make sure that however the user capitalizes the role, we generate the correct capitalization.

Whoever does this should check with the tokens folks about other roles, like Data, Calibration, MARS, and the like and make sure that the convention will be lower-casing the role.

Implement role support for get_creds()

https://github.com/marcmengel/jobsub_lite/blob/8d22209b58a004bd17acc6c44e47686a765a5b60/lib/creds.py#L20

Right now, get_creds() will always call fake_ifdh with no arguments, which will allow the role-selection in fake_ifdh to rely on the environment. We want to make this more intelligent, and allow for us to pass a role in. Either check for it in args, or specifically allow for a role argument that gets passed into fake_ifdh, depending on if it's the default "Analysis" role or not.

If we know the schedd, we should set _condor_CREDD_HOST to it in the environment

Currently, jobsub_submit automatically sets the environment variable _condor_CREDD_HOST based on the schedd it picks for submission. We need to ensure that for lightly-wrapped condor commands like jobsub_transfer_data, jobsub_rm, etc., where we know the schedd a priori because the user has to specify it, that we set _condor_CREDD_HOST to the schedd. Otherwise, we run into OIDC issues with the token where a user has to reauthenticate because condor_vault_storer (run by all condor commands) thinks there's no credentials stored. I believe the line referenced here is where we would have to set the environment variable:

https://github.com/marcmengel/jobsub_lite/blob/dbd77b3bb04a0c98b00aa0f32812cea478582a1e/bin/jobsub_cmd#L76

Reproduce issue:

On a managed tokens node:

[dunepro@hostname ~]$ export _condor_COLLECTOR_HOST=<collector_host>
[dunepro@hostname ~]$ export HTGETTOKENOPTS="--credkey=<credkey>"
[dunepro@hostname ~]$ condor_vault_storer -v dune_production

Checking if -s dune_production credentials exist
Account: <current> (dunepro)
CredType: oauth

Operation failed.
    Make sure your ALLOW_WRITE setting includes this host.
Removing /tmp/vt_u50762-dune_production because there are no dune_production credentials stored
Attempting to get tokens for dune_production
Initializing kerberos client for host@<vault_server>
Negotiating kerberos with <vault_server>
  at path auth/kerberos-dune_production
Attempting to get bearer token from <vault_server>
  at path <path>
Read token from <URL> failed: HTTPError: HTTP Error 403: Forbidden: 1 error occurred:
        * permission denied


No ssh-agent keys found
htgettoken: Failure getting token from <vault server>
Authentication needed for dune_production
Initializing kerberos client for <vault server>
Negotiating kerberos with <vault server>
  at path auth/kerberos-dune_production
Attempting to get bearer token from <vault server>
  at path <path>
Read token from <path> failed: HTTPError: HTTP Error 403: Forbidden: 1 error occurred:
        * permission denied


No ssh-agent keys found
Attempting OIDC authentication with <vault server>

Complete the authentication at:
    https://cilogon.org/device/?user_code=<code>
No web open command defined, please copy/paste the above to any web browser
Waiting for response in web browser
^C

But this works:

[dunepro@hostname ~]$ export _condor_COLLECTOR_HOST=<collector host>
[dunepro@hostname ~]$ export HTGETTOKENOPTS="<credkey>"
[dunepro@hostname ~]$ export _condor_CREDD_HOST=<schedd>
[dunepro@hostname ~]$ condor_vault_storer -v dune_production

Checking if -s dune_production credentials exist
Attempting to get tokens for dune_production
Attempting to get bearer token from <vault server>
  using vault token from /tmp/vt_u50762-dune_production
  at path <path>
Storing bearer token in /run/user/50762/bt_u50762-dune_production
Copying bearer token to /run/user/50762/bt_u50762

Bug in submit_dag where if we have pre-existing .sub file, we can't run condor_submit_dag

https://github.com/marcmengel/jobsub_lite/blob/8d22209b58a004bd17acc6c44e47686a765a5b60/lib/condor.py#L172

On that line, if the previous if clause isn't run, then cmd is never populated. Then, the subprocess.run bit will fail with a NameError.

Indent entire try-except clause, because we only run condor_submit_dag command if the .sub file doesn't exist.

jobsub_q @jobsubdevgpvm01.fnal.gov does not give detailed queue information

Output from jobsub_q @jobsubdevgpvm01.fnal.gov is not complete. The specifics are missing.

Here is an example of the output that one gets:
-- Schedd: jobsubdevgpvm01.fnal.gov : <131.225.240.23:9615?... @ 08/24/22 14:52:10
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 28 jobs; 9 completed, 0 removed, 17 idle, 0 running, 2 held, 0 suspended

Here is an example of the output that one SHOULD get:
-- Schedd: jobsubdevgpvm01.fnal.gov : <131.225.240.23:9615?... @ 08/24/22 14:51:34
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
goodenou ID: 736 8/23 13:54 _ _ _ _ 1 736.0
goodenou ID: 737 8/23 15:16 _ _ _ _ 1 737.0
goodenou ID: 739 8/24 14:51 _ _ 1 _ 1 739.0

Total for query: 3 jobs; 2 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for goodenou: 3 jobs; 2 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 28 jobs; 8 completed, 0 removed, 18 idle, 0 running, 2 held, 0 suspended

jobsub_q/condor_q wrapper needs to be converted to python

Currently, jobsub_q is a shell script. We should convert this to Python for better future maintainability.

Integration tests needed for jobsub_lite

Now that most of the code is finished, we should start writing integration tests for jobsub_lite. @marcmengel indicated that he'd prefer this happen before unit tests are written.

jobsub_lite should default to fnal_wn SL7 singularity container

Jobsub_lite should default to adding the following singularity image for all jobs:

+SingularityImage="/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:osg3.

When we move to production, it should be the following (add a commented line or something like that)

+SingularityImage="/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest"

This image should be overridable with a flag like "--singularity-container"

Traceability needed in jobsub_lite

Eventually, we'd like to build traceability (using Jaeger or some similar framework) into jobsub_lite. This will help us in future troubleshooting.

Get rid of bin/fake_ifdh

We've now duplicated fake_ifdh - in bin/ and lib/. We want to keep the lib/ version, so just get rid of bin/fake_ifdh.

Dune global pool no longer transferring tokens to job

We noticed on 3/17/22 that even for regular user jobs, tokens weren't getting transferred to the job. That is, BEARER_TOKEN_FILE was set

But httokendecode couldn't find the file. Thus, later file transfers failed.

Implement DESIRED_USAGE_MODEL flags

Currently, jobs don't run in jobsub_lite unless a flag like --resources_provides=DESIRED_USAGE_MODEL=DEDICATED (I'm going from memory here - the assignee should check for the actual correct flag) is provided. We need to implement two flags:

--onsite, which will add DEDICATED,OPPORTUNISTIC
--offsite, which will add OFFSITE

And have no flag default to DESIRED_USAGE_MODEL=DEDICATED,OPPORTUNISTIC,OFFSITE.

The actual flag names and implementation here are a suggestion - the assignee is of course free to come up with a better scheme.

Move tarball discovery to the wrapper script

Need to have the wrapper script do tarball discovery using the token, much like the current jobsub wrapper does. Right now in jobsub_lite, submission will have to wait for a tarball to publish to complete.

See here:
https://github.com/marcmengel/jobsub_lite/blob/fe436ffa5eda1df2308a0f8e00617b1d777bde7e/lib/tarfiles.py#L100

We try to publish a tarball, and then wait for the RCDS server to give us a location before moving on. We want to publish and move on, and then have the wrapper script find the location.

Managed Tokens-Related Changes

This issue/document follows a large discussion with @DrDaveD, who maintains htgettoken and condor_vault_storer, both of which we use to get credentials.

Managed Tokens Idea

We want to have a script/service running on a crontab that generates vault tokens for production users (e.g. dunepro, mu2epro, etc.). It will then transfer these tokens to the experiments' submit nodes, from which they can run jobsub_submit and other grid tools.

Issue with extra authentication

While testing small-scale versions of this service for DUNE, Ken Herner and I have found that we, acting as simulated end-users, often have to reauthenticate via CILogon website and our SERVICES credentials to obtain a bearer token. The way that htgettoken works, since we already would have a valid vault token, we should not have to authenticate to get a bearer token from the vault.

Root cause

Currently, our Managed Tokens script uses a special Kerberos group principal and invokes condor_vault_storer, which generates two copies of the vault token:

/tmp/vt_u$(id -u)
/tmp/vt_u$(id -u)-SERVICE where SERVICE is simply a concatenation of issuer_role. So the full path would be something like /tmp/vt_u12345-dune or tmp/vt_u12345-dune_production

Crucially, condor_vault_storer looks in the second location to figure out whether or not it needs to re-generate a vault token and store it in the condor_credmon. This will become important shortly.

One of the first actions jobsub_submit does is to run htgettoken (via get_creds()):

https://github.com/marcmengel/jobsub_lite/blob/250859b2c8066d375d398bf0850bbbcbc22a86c9/bin/jobsub_submit#L174

By default, htgettoken will only create the first copy of the vault token above. Thus, when we go to submit the job, condor_submit, invoked by jobsub_submit, runscondor_vault_storer, which will not see a vault token in the second location and will try to regenerate another vault token, even though one exists in the first location.

Further complication

In the troubleshooting of this issue, Ken and I would often regenerate the vault token manually by rerunning the Managed Proxy protoype script. This script currently clears the kerberos cache of our user kerberos ticket, and puts the special group kerberos ticket into the cache. If we then try to run jobsub_submit, this causes the obtaining of an X509 proxy to fail.

Solution, and the work item for this issue

We think we can solve both issues by simply having jobsub_lite run condor_vault_storer with the appropriate argument instead of htgettoken on https://github.com/marcmengel/jobsub_lite/blob/master/lib/fake_ifdh.py#L76. This would result in condor_vault_storer being invoked twice on each condor_submit (first manually, second automatically by condor_submit), which is OK (we currently similarly invoke htgettoken twice).

The Managed Tokens service would run as is, generating both vault tokens, and transferring them both to the end user's submit machine. When the user then runs jobsub_submit:

For an Analysis user, condor_vault_storer will get the correct vault and bearer tokens, and when condor_submit is invoked, this second invocation of condor_vault_storer will see the correct vault token and proceed.
For a production user, the first invocation of condor_vault_storer will see that the vault tokens are already in place (due to the Managed Tokens service), and simply get the bearer token. The second invocation of condor_vault_storer will simply do the same as in the Analysis case.

Additional Notes

The kerberos ticket the end user has (their own) when logging into the production account should not matter for the purpose of getting a token. The way I describe the use case here, the vault token along with the correct credkey passed to $HTGETTOKENOPTS should be sufficient for authentication to get the proper bearer token.
The Managed Tokens service should possibly NOT create the bearer token when it runs condor_vault_storer. This can also be done by passing --nobearertoken in $HTGETTOKENOPTS.
I tested the end-effect of doing this using fifeutilgpvm01 and my dev node, fermicloud525. On fifeutilgpvm01, I ran the Managed Tokens prototype script, and transferred the files to fermicloud525 with the same paths. I then logged into fermicloud525 as dunepro, and after setting up the environment variables ($HTGETTOKENOPTS and $_condor_COLLECTOR_HOST, was able to submit jobs using those vault tokens.
We should explore reverting to the Managed Tokens service running as a service account (rexbatch) and operating from there, rather than needing to su as the various users.
For the case of managed tokens only (We'd have to figure out how to discern that), @DrDaveD also suggested having --noidc added to the HTGETTOKENOPTS for jobsub_submit.

Implement clearing out of troublesome environment variables

There are a number of known environment variables that cause issues when running grid jobs in singularity containers. We should, by default, clear those, and provide a flag to keep those variables in place.

This list of variables should be well-documented.

When a non-Analysis role is given, make sure we set BEARER_TOKEN_FILE (and thus BEARER_TOKEN) correctly in wrapper scripts

When a non-Analysis role is given, make sure we set BEARER_TOKEN_FILE (and thus BEARER_TOKEN) correctly in wrapper scripts. For example, for role production, experiment dune, the jobsub_submit line will be something like:

jobsub_submit -G dune --role production <args> file:///<path>/<to>/<executable>

In this case, in the wrapper scripts, (like https://github.com/marcmengel/jobsub_lite/blob/master/templates/simple/simple.sh#L10), we need to make sure that not only is the group templated, but the role. The convention is that the bearer token file for a production token will be named something like dune_production.use. So in general, _.use.

This should be pretty similar to the changes that had to be made for #13 .

Condor wrappers should use condor CLI style schedd specification

Currently, our condor wrappers that query the schedd (condor_q, condor_rm, etc.) use the jobsub-style convention for specifying a schedd: cluster.job@schedd. To allow our condor wrappers to look as similar to condor as possible, we should use the same convention as HTCondor (-name <schedd> jobid).