Giter Site home page Giter Site logo

dmwm / wmcore Goto Github PK

View Code? Open in Web Editor NEW
45.0 26.0 107.0 68.57 MB

Core workflow management components for CMS.

License: Apache License 2.0

Python 92.97% Shell 0.84% HTML 1.51% CSS 0.38% JavaScript 4.28% Erlang 0.03%
dmwm cms workload-management python

wmcore's Introduction

WMCore

WMCore is a meta project providing multiple software for the CMS Workload Management system, such as WMAgent, ReqMgr2, Global WorkQueue, WMStats and ReqMgr2 MicroServices. It also provides core libraries to other Data Management and Workload Management (DMWM) CMS services, like CRABServer, DBS3, Tier0, Unified.

Documentation

General information, as well as development/testing documentation, can be found in our wiki pages.

Contributing

For guidance on setting up a development/testing environment and how to contribute to WMCore, please refer to the CONTRIBUTING guidelines.

Getting Support

For project support and discussions, please join us on our Slack support channel; or send an email to our public mailing list cms-oc-dmwm AT cern DOT ch.

wmcore's People

Contributors

alexanderrichards avatar amaltaro avatar anpicci avatar bbockelm avatar belforte avatar cinquo avatar d-ylee avatar dballesteros7 avatar dtnrm avatar emaszs avatar ericvaandering avatar evansde77 avatar germanfgv avatar giffels avatar goughes avatar hufnagel avatar jha2 avatar juztas avatar khurtado avatar lucacopa avatar mapellidario avatar mmascher avatar perilousapricot avatar samircury avatar stuartw avatar ticoann avatar todor-ivanov avatar tsarangi avatar vkuznet avatar yuyiguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wmcore's Issues

Develop workflow summary monitoring page

We need to have something that summarizes the result of a workflow:

Files/Jobs/Lumi sections that failed and error
Run time, stage out time
Output datasets
etc

I'll write a command line tool and develop the couch views, Seangchan can port that to the webtools stuff.

Fix duplicate requests in WorkQueue

2010-08-16 07:44:47,231:DEBUG:WorkQueueManagerReqMgrPoller:[('Batavia_Rickrollers', 'cmsdataops_100815_212516', 'http://cmssrv113.fnal.gov:1025/download?filepath=WMWorkload/ReReco/flatulent/cmsdataops/cmsdataops_100815_212516-WMWorkload.py'), ('Batavia_Rickrollers', 'cmsdataops_100815_213343', 'http://cmssrv113.fnal.gov:1025/download?filepath=WMWorkload/ReReco/nonflatulent/cmsdataops/cmsdataops_100815_213343-WMWorkload.py'), ('Batavia_Rickrollers', 'cmsdataops_100815_213008', 'http://cmssrv113.fnal.gov:1025/download?filepath=WMWorkload/ReReco/flatulent/cmsdataops/cmsdataops_100815_213008-WMWorkload.py')]
2010-08-16 07:44:47,231:INFO:WorkQueueManagerReqMgrPoller:Processing request cmsdataops_100815_212516
2010-08-16 07:44:47,825:ERROR:WorkQueueManagerReqMgrPoller:Error: There are duplicate wmspec: cmsdataops_100815_212516
2010-08-16 07:44:47,825:INFO:WorkQueueManagerReqMgrPoller:Processing request cmsdataops_100815_213343
2010-08-16 07:44:48,221:ERROR:WorkQueueManagerReqMgrPoller:Error: There are duplicate wmspec: cmsdataops_100815_213343
2010-08-16 07:44:48,221:INFO:WorkQueueManagerReqMgrPoller:Processing request cmsdataops_100815_213008
2010-08-16 07:44:48,818:ERROR:WorkQueueManagerReqMgrPoller:Error: There are duplicate wmspec: cmsdataops_100815_213008
2010-08-16 07:44:48,818:INFO:WorkQueueManagerReqMgrPoller:0 element(s) obtained from RequestManager
2010-08-16 07:44:58,842:INFO:WorkQueueManagerReqMgrPoller:Contacting Request manager for more work

Fileset closing and WMBSMergeBySize don't handle partial failures well

We need to handle the case where we're doing split by event or split by lumi processing and one of the jobs that runs over a file fails. The merge algorithm needs to mark the partial files as acquired so that the file closeout code will actually close out the fileset.

The merge algorithm does the right thing in not creating jobs for the files.

Start of components fail with latest WMCore.

https://savannah.cern.ch/bugs/index.php?70454

We are getting following type of error messages when starting components, seems like new tables are required for stating components now, this must be optional.

2010-07-21 13:44:35,505:CRITICAL:Harness:
PostMortem: choked when initializing with error: (ProgrammingError) (1146, "Table 'CMS_DBS3_ANZ_3.wm_workers' doesn't exist") 'DELETE FROM wm_workers \n WHERE com
ponent_id = (SELECT id FROM wm_components \n WHERE name = %s)\n ' ('DBSInsertBuffer',)
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Agent/Harness.py", line 494, in startComponent
self.prepareToStart()
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Agent/Harness.py", line 368, in prepareToStart
self.heartbeatAPI.registerComponent()
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Agent/HeartbeatAPI.py", line 40, in registerComponent
transaction = self.existingTransaction())
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Agent/Database/MySQL/InsertComponent.py", line 31, in execute
transaction = transaction)
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Database/DBCore.py", line 179, in processData
returnCursor = returnCursor)
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Database/MySQLCore.py", line 127, in executebinds
return DBInterface.executebinds(self, s, b, connection, returnCursor)
File "/uscms/home/anzar/devDBS3/external/WMCORE/src/python/WMCore/Database/DBCore.py", line 65, in executebinds
resultProxy = connection.execute(s, b)
File "/usr/local/lib/python2.6/site-packages/SQLAlchemy-0.5.5-py2.6.egg/sqlalchemy/engine/base.py", line 824, in execute
return Connection.executors[c](self, object, multiparams, params)
File "/usr/local/lib/python2.6/site-packages/SQLAlchemy-0.5.5-py2.6.egg/sqlalchemy/engine/base.py", line 888, in _execute_text
return self.__execute_context(context)
File "/usr/local/lib/python2.6/site-packages/SQLAlchemy-0.5.5-py2.6.egg/sqlalchemy/engine/base.py", line 896, in __execute_context
self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context)
File "/usr/local/lib/python2.6/site-packages/SQLAlchemy-0.5.5-py2.6.egg/sqlalchemy/engine/base.py", line 950, in _cursor_execute
self._handle_dbapi_exception(e, statement, parameters, cursor, context)
File "/usr/local/lib/python2.6/site-packages/SQLAlchemy-0.5.5-py2.6.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)

Add step run time to fwjrs

We should add the start time and end time to the fwjr for each step so we can easily calculate how long the step ran for.

WMAgent needs kill switch

We need to the extent possible the ability to stop and kill the remnants of a failing request in the likely event something wrong is discovered after a request has been approved/started.

Currently data ops runs in a mode with PA where they hold off putting data into global DBS/PhEDEx until a workflow is in the 90-95% complete range to avoid the pain of having to clean up the mess when they find a large percentage of the workflow fails. With WMAgent though we're not running that way -- so we'll need the ability to clean up the mess in the all too likely event that a request goes haywire. This'd be a permanent and last resort sort of operation

Probably needs a followup requestor kill switch & air strike as well, but that'd be frosting.

Request Kill Switch

We need the ability to kill a request at the request manager and have it propagate down the global workqueue, local work queue and then into the agent.

Unit tests for WorkQueue splitting policy

To verify that this works correctly in the WorkQueue and when creating requests with the different top level splitting algorithms: FileBased, LumiBased, EventBased.

Task Kill Switch

We need the ability to shut off the processing of a single task in the agent. The request to kill the task would originate from the RequestManager.

Creating large filesets is slow

This affects crab. Eric's code snippet:

    fileList = pubdata.getListFiles()
    thefiles = Fileset(name='FilesToSplit')
    counter = 0
    for jobFile in fileList:
        if not counter % 1000:
            print "jobFile %s seconds %s" % (counter,time.time() - t1)
        counter += 1
        block = jobFile['Block']['Name']
        try:

jobFile['Block']['StorageElementList'].extend(blockSites[block])
except:
continue
wmbsFile = File(jobFile['LogicalFileName'])
if not blockSites[block]:
wmbsFile['locations'].add('Nowhere')
[ wmbsFile['locations'].add(x) for x in blockSites[block] ]
wmbsFile['block'] = block
for lumi in lumisPerFile[jobFile['LogicalFileName']]:
wmbsFile.addRun(Run(lumi[0], lumi[1]))
thefiles.addFile(wmbsFile)

Improve Job Mask Data structure

https://savannah.cern.ch/bugs/index.php?68977

The current job mask data structure:
{FirstEvent, LastEvent,
FirstLumi, LastLumi,
FirstRun, LastRun}

Isn't flexible enough to support analysis use cases where they want to process noncontinuous lumi sections in a single file/job. We'll need something more like the following:

{firstEvent, lastEvent, runAndLumis = {}}

Where firstEvent and lastEvent will be integers as they are now and runAndLumis is a dictionary keyed by number number. Each key will map to a list of lumi IDs.

We can reuse the wmbs_job_mask table and change the queries so that you can have more than one mask per job which would support the new data structure.

WMTaskSpace caches module import. Could be shambles?

Hey,

the stepSpace() function currently caches imports. This is okay as long as we only look at the stepSpaces within python processes that terminate shortly afterward (which is what we do), but for things like testing or future uses (I can't think of offhand), this means that the wrong stepSpace could get loaded.

Just a placeholder ticket if something arises down the line.

best,
Andrew

JobSubmitter fails if credentials expire

https://savannah.cern.ch/bugs/index.php?65182

I just ran into a problem with the JobSubmitter, or whatever component assigns locations to jobs. It seems that my credentials had expired and the submitter had become stuck. It assigned locations to several jobs but did not updated their state to "Submitted", I assume becuase my credentials had expired. I fixed this and restarted things but ResourceControl thought that 30 jobs were running, mostly due to jobs having a location. The jobs were still in the "Created" state. Is this something that should be fixed in ResourceControl or what?

WMAgent should be able to submit to mutliple different batch systems

https://savannah.cern.ch/bugs/index.php?61655

Currently we only really support submitting to one type of batch system, whatever is specified in the JobSubmitter's configuration. I think it would be useful to support submitting to multiple batch systems, Condor/GLite/Whatever.

To do this we could:

  • Create a JobSubmitter plugin that looks at the location and does the mapping and submitting itself
  • Add another field to the resource control db that determines which plugin to use for a site.

FailInput queries broken for multiple failures on same file

Jobs.FailInput does not consider the case where you get multiple failures on the same input file for the same subscription. You get a crash with a unique constraint violation because it tries to insert into wmbs_sub_files_failed again with the same PK.

Observed in PromptReco in the Tier0.

Fix is easy, just left outer join against wmbs_sub_files_failed in the discovery query and only execute the insert and deletes if the discovery query returns something.

Add Sequences to MySQL/SQLite/Oracle for all WMBS Objects

We should add support for retrieving a list of sequences to all of the WMBS databases. This would make bulk object insertion much easier.

We probably need sequences for:

  • File
  • Fileset
  • Job
  • JobGroup
  • Probably not Subscriptions

We should stash the unused sequence ids in the threading storage area where we currently store the database handles.

JobCreator locks up occasionally

It looks to be a problem with ProcessPool. The poller gets hung up, the worker thread looks like it has completed all it's work.

JobSubmitter should submit older jobs first

Currently there is no ordering. This means that older workflows have a tendency to sit around instead of being closed out as the closeout code is waiting for a handful of jobs to run which are stuck behind a large workflow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.