Giter Site home page Giter Site logo

das's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

das's Issues

Queryspammer weighting

Weight queryspammer distributions in some quasi-real manner so that if it is used to hammer the cache it should trigger analytics appropriately.

Handle large docs via GridFS

It is possible that DAS will received a doc whose size will exceed MongoDB limit (4MB by default). In that case the bulk insert will fail for all docs in insert sequence (due to generators). To avoid that I need a new generator routine whose purpose will be scan doc and pass it if it has size < 4MB or put it to GridFS.

Review template quoting, sanitize templates

From #290

I didn't understand the addition of urllib quoting in, for example, das_table.tmpl. Shouldn't you use encodeURIComponent in javascript code / arguments, and urllib when quoting something originating from DAS server itself? To me it seems you are now sometimes quoting javascript itself, not the javascript variable value.

Also I note here that the quoting wasn't added universally everywhere - not in all templates, and not even systematically in the one example I happened to quote, das_table.tmpl. As I wrote before, it looks like every template needs to be sanitised. I can't easily tell which values are safe.

Configuration unification

DAS can be either configured using either configparser or wmcore.configuration. The current config code has a few problems:

  • Variables with wrong type (eg mappingdb.attempts) which are only touched in rare situations.
  • Defaults are variously stored in das_readconfig, das_writeconfig and redundant dict.get statements when the config is used.
  • Defaults are provided for configparser input but not wmcore.

Provide a single layer performing validation/casting/defaults, which doesn't care whether it reads from an underlying configparser or wmcore config.

Optimise downloads

Investigate ways of optimising the transfers of large chunks of JSON, eg from a query "dataset", whether by socket configuration or streaming the decoding.

Sanitize checkargs

Thank you for adding checkargs to verify parameters. It has a few flaws I'd like to see fixed:
You don't use what you verify. Some arguments are casted to strings (str(x)) before checking. You should instead verify what you will use.
You should type check all arguments for reasons above. A keyword argument can be None (not given), a string (given once), or a list (if given several times).
Contents of many, but not all arguments are checked. I didn't see any additional checking added for remaining arguments elsewhere so it looks like several vulnerabilities remain. You should always sanitise all arguments. Even if the argument is free form input, you can often make sure it only consists of certain legitimate characters (e.g. letters only).
Failure to verify arguments should raise an exception.
Failure to check an argument should not return the argument value back to caller. This is unsafe; you don't know what the value contains, and you just determined it's not valid. Returning the value to caller can be used to create XSS and other attacks. My general preference is to never return anything to the caller - you simply return suitable HTTP status code.
It's not sanitising the HTTP method; note that 'method' keyword argument is not the same as the request method!

Migrate init script into manage

Review note on DAS bin directory: start-up scripts should be folded into manage. We very much prefer to see everything inlined directly into the manage script without several layers of indirection, for simplicity, comprehension and transparency.

Investigate number of open connections for DAS/Mongo interaction

Follow up from #290.

Regarding open connections, they are connected sockets, i.e sockets between DAS and MongoDB. Just ssh to cmsweb@… and run netstat -tanlp | grep ESTABLISHED | grep 27017 to see them. We have currently: {{{
$ netstat -tanlp | grep ESTABLISHED | grep 27017 | awk '{print $NF}' | sort | uniq -c
212 4500/mongod
138 4860/python
74 4875/python
}}}
Why there are that many I can't answer. Maybe every DAS thread creates some number of connections? Note that half of the sockets are for python side, the other half is the mongod side, as shown above.
in reply to: ↑ 69

Analytics Tasks

Better test the existing analytics tasks and add some new ones.

genkey output is not consistent

genkey() does not necessarily produce identical output for functionally identical input, which given our reliance on qhash for finding records is a problem.

From python reference:
"CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions."

Example:

genkey({'fields': None, 'spec': [{'key': u'dataset.name', 'value': u'"/TTbar_1jet_Et30-alpgen/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO"'}]})
'b255596fb3728afe13c5c078ad6f9105'
genkey({'fields': None, 'spec': [{'value': u'"/TTbar_1jet_Et30-alpgen/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO"', 'key': u'dataset.name'}]})
'2c7d1cfc1244e5367eefe70dfeeeb321'

Here we have only trivially transposed the order of the "key" and "value" arguments, but the result is a different hash value. This problem shows up in analytics where running QueryMaintainer from the command line works but spawning it through the server doesn't, as far as I can tell just because the dictionary construction order differs. This is not because of unicode-ness of strings (tested, using json.dumps deals with this).

I will try and modify the genkey function to produce consistent output, but this is probably performance sensitive.

Implement DAS accumulation

DAS must support aggregation of information. Since DAS cache server utilize a REST model this can be done as series of steps:

  1. request data via POST request, this will create a new records in cache
  2. create crontab and run at certain interval before data expiration, e.g. every 10 minutes
  • GET data
  • invalidate expiration time and put it back via PUT request. At this time cache will keep data which are expired. Set new flag for cache to prevent it from deletion, e.g. 'accumulate':True.
  • request data again via POST request
  • compare data w/ 'accumulate':True with the new one
  • do diff/update
  • set new expiration time
  • place new data via PUT request into cache
    So for that to work I need:
  • new field in a document, e.g. accumulate
  • the cleaner job should not delete expired documents in cache who have accumulate flag is ON.
  • new view in couch to select accumulated docs
  • new tool which will use above mentioned logic to update docs in couch.

replace types.XXX into isinstance

Review note on DAS: general comment, would find expression if not isinstance(x, dict) style more readable than if type(x) is not types.DictType style.

Pass dir as argument to das_map

CHange this block to use external dir parameter
{{{
+if [ hostname -d == "cern.ch" ]

  • dir=/data/projects/das/config/maps
    +else
  • dir=$DAS_ROOT/src/python/DAS/services/maps
    +fi

}}}

Change port allocation for MongoDB

  1. Modify mongo_init script to use ports 8230-8239 for MongoDB
  2. Change DAS configuration to use this port slot for MongoDB
  3. Revisit pgrep part for sysboot in mongo_init

improve stats for DAS cli

Currently I only report stats on init, sub-system call, merge steps. I want to divide sub-system stat into URL fetch time and actual DAS sub-system processing time. This can be accomplished on making singleton DASTimer class instance and use it everywhere to collect various stats.

Add ability to learn data-service output keys

Add ability to learn and add new or reload existing map from the output of data-provider. For example by learning about keys the output of some query I can add to DAS what this data-service is capable to provide. For instance, user type

run=123

the DAS query RunSummary and get output which contains L1Trigger. So DAS can gain knowledge from the output that Run Summary provides information about L1Trigger by query run=123. If this info is captured, I can improve DAS input fields. For example, I can store associative keys into separate collection with data-service. Those keys can be used as "helpers" in DAS input query, so user can type

l1 trigger

and DAS can replies, ahh, I know data-service which provide this. And in order to get l1 trigger you must type your run number.

We can apply some word processing to allow different linguistic combinations.

This way DAS will gain knowledge what data service can provide. This can improve search and make some suggestions.

Reivew overview session

Review note on DAS: overview plotfairy version, session arguments are unnecessary and can be omitted.

Comment 23 follow-up: I guess it wasn't clear enough, but "session arguments" meant "session" and "version". All you need is the actual data arguments. Also would prefer they were deleted, not just commented out.

Need custom DAS map-reduce for Oli use case

Oli wants to have custom views in DAS to get his data:

''Essentially the sum of data for each T1 site for each combination of
acq era, tier, custodial/non-custodial.
''

I think it can be accomplished as 2 step procedure in DAS.

  1. DAS ask DBS3/phedex for dataset/block info
  • DAS asks DBS3 for list of all datasets. This brings into DAS tier/era info.
  • DAS asks Phedex for list of all blocks. This brings into DAS block info which contains replicas.
    1. We develop script which loop over all unique tier/era combinations and ask for each of them a sum of replicas from stored blocks.

Remove circular dependencies

utils/das_config.py calls

from DAS.utils.das_cms_config import read_wmcore

while utils/das_cms_config.py calls

from DAS.utils.das_config import DAS_OPTIONS

The remedy is to merge them together.

DAS parser cache

RE-based PLY parsing is easier than writing our own ad-hoc parser but is quite expensive. Add a (capped?) mongodb collection to store the parsed versions of string queries, and intercept new queries appropriately.

Code audit: DAS

Done.
The %post section has been reviewed and cleaned up.

Fix problem with record's count

RIght now to get total number of results I invoke the count, since I added empty records to protect access of services which does not return results, I should exclude them from the count of results for given query. Should be trivial, e.g.
db.merge.find(spec)count()
where spec contains query and non-existance of 'das.empty_record'.

Analytics web server help

We need a help section for DAS web analytics server. It should describe meaninig of sections, e.g. Main, Control. It should provide examples (some description and png image of it) how to submit certain tasks. Examples (png images) of what we should see when tasks are running, etc.

This will allow to train DAS operators.

AnalyticsDB atomic operations

Restore analyticsDB to using unique qhash records, with an array of hit times. Provide a workaround to inability to pull with conditions for mongodb<=1.6. Determine interplay of capped collections and updated instead of new objects.

Related, consider making sure all related documents for a given query are removed from analytics concurrently.

DAS aggregators need to show the record if possible

When using certain aggregators, e.g. max, min, I should be able to show the record itself rather then min/max value of asked field. For instance, if user type

find block | max(block.size)

I should not only show max block.size, but also a link to a record with this value.

Need init script for DAS analytics web

Eventually we will need to add DAS analytics into DAS manage init script. I need to know how to start/stop up DAS analytics web server. How to check its status, etc. A basic skeleton of init script will be useful.

MongoDB replication/sharding

Explore mongo replication. I can have two nodes, one used for raw-cache of user on-demand queries, while another can used by populator to replicate data from data-services.
Explore mongo sharding, where define sharding key, e.g. block.

Review expiration timestamps for data-services

All APIs use 3600 sec expiration timestamp (valid for testing) which need to be adjust to real case scenario. I think DBS/phedex should have 10-15 minutes, SiteDB around an 1hour, etc.

New query handling

Currently, queries are a raw python dictionary. I propose that this should be replaced by a wrapper class, with the following rationale:

  • Often not clear in the code whether an argument should be encoded for storage {spec:[{key: name, value: value}]} or decoded for search {spec:{name: value}}. This can be replaced by query.encoded() and query.decoded(), and cache the results of the transformation in the object.
  • qhash is repeatedly calculated for the same query, sometimes in optimisable cases (some places in AbstractService where it is calculated in both the caller and inner context of a function), but in many cases it is calculated seperately, and by caching the result in the object after one calculation we can save on this.
  • it is currently very difficult to pass flags around with a query, without modifying every function it might pass through, or storing it in the query dictionary (and making everything that uses it or hashes it aware). Flags I am currently thinking of are "hide_from_analytics" (currently a function argument), "force_update" (currently not available, I've not had much success hacking this into AbstractService). Conceivably there are others.
  • Some currently standalone functions (bare query, looser query, query comparison) can be moved into a query class instead of being standalone functions.

Analytics Web

Fix the analytics web so that there is a

  • appropriate interface between analytics daemon and web, probably a capped collection.
  • proper web interface
  • templating of outputs
  • plotfairy integration?

Provide configurable location for parsertab.py

To avoid creation of parsertab.py in DAS install area I need to allow its location being configurable parameter. This will easy issue on cmsweb and allow to have it in /data/projects/das area instead of DAS source code area.

Respect expire timestamp from DASJSON headers

Add new method for abstract_service to respect DASJSON header. The new tier0 service is already DAS compliant. It does ship data with DASJSON header which contains results as well as expire timestamp. I need to parse this info correctly.

Request for python config files

DAS currently uses .ini style configuration files (das.cfg). For CMSWEB deployment we would strongly prefer python configuration files as they provide much greater ability to make the configuration location and user independent. This should also make development easier since the same configuration can be used unchanged.

For example of location independence eased by python please see DQM GUI 'devtest' configuration, which works out of the box for any user on any computer system - P5, CERN GPN (lxplus, lxbuild), desktops, laptops, and outside CERN.

https://twiki.cern.ch/twiki/bin/view/CMS/DQMTest#Specific_details

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/DQM/Integration/config/

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/DQM/Integration/config/server-conf-devtest.py?revision=HEAD&view=markup

See specifically use of BASEDIR and CONFIGDIR to achieve relocation. You can also see other more complex host-specific adaptation in online in:

http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/DQM/Integration/config/server-conf-online.py?revision=1.56&view=markup

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.