geoscienceaustralia / hiperseis Goto Github PK
View Code? Open in Web Editor NEWHigh Performance Seismologic Data and Metadata Processing System
License: GNU General Public License v3.0
High Performance Seismologic Data and Metadata Processing System
License: GNU General Public License v3.0
We have some additional stations in the ENGDHAL/ISC events that are not available in our seiscomp3.
For now, if possible, any workflow that requires station information should use the files that was emailed by ISC recently. Or for any statistical measure (like inputs to inversion) we can ignore those specific stations.
The script https://github.com/GeoscienceAustralia/passive-seismic/blob/pst-12/CWB/cwbingest1month.sh is currently running which is automatically ingesting the CWB waveform data into the seiscomp3 AWS instance "niket_pst_poc_latest". This script takes care of working around the hurdle described in Issue #17.
Will update this issue once the waveform data ingestion is complete.
Integration of obspy Pick object, with the following features:
Jira - PST-194
we discard all arrivals that don't satisfy the max residual criteria for P and S waves (filter stage 1), should not be considered for the observed travel time median filter (stage 2).
My tests found for some (or all?) antelope events the preferred origins are missing arrivals. We need to investigate.
pytest test_cluster.py
These are the preferred origins for the two events showing no arrivals.
/home/sudipta/repos/passive-seismic/tests/mocks/events/00974105.xml
Origin
resource_id: ResourceIdentifier(id="quakeml:ga.ga.gov.au/origin/1126705")
time: UTCDateTime(2015, 3, 3, 2, 44, 20, 169000)
longitude: 176.1633
latitude: -14.8737
depth: 10000.0
quality: OriginQuality(associated_phase_count=17, used_phase_count=13)
evaluation_mode: 'automatic'
evaluation_status: 'preliminary'
creation_info: CreationInfo(author='regpMwp ms', version='1425351689.25')
---------
comments: 1 Elements
/home/sudipta/repos/passive-seismic/tests/mocks/events/00967739.xml
Origin
resource_id: ResourceIdentifier(id="quakeml:ga.ga.gov.au/origin/1133422")
time: UTCDateTime(2015, 3, 21, 0, 18, 54, 304000)
longitude: 118.4494
latitude: -9.9875
depth: 0.0
quality: OriginQuality(associated_phase_count=29, used_phase_count=18)
evaluation_mode: 'automatic'
evaluation_status: 'preliminary'
creation_info: CreationInfo(author='rega', version='1426898089.18')
---------
comments: 1 Elements
There are many such stations that are not present in the ISC stations list and also in our seiscomp3. We can ignore the arrivals from these stations as we have millions of arrivals, but this is something we can contribute back to ISC for their cooperation.
This step is very slow and memory intensive in one process as we can have many millions of groups :
https://github.com/GeoscienceAustralia/passive-seismic/blob/8507212e5e5e189ffa4d55e61cfba2a012db3107/seismic/cluster/cluster.py#L325
Can use a parallel pandas (https://dask.readthedocs.io) implementation to improve both time and memory requirements.
Enable/Configure fdsnws for querying the SC3 server for station metadata and waveform data.
Currently iloc/rst installation script does is not checked via CI and also iloc/rstt has no tests.
We have 2 options for using the receiver function codebase at https://github.com/trichter/rf.git to generate the analysis over the temporary stations data:
Ingest waveform data, temporary station xmls and event data into a Seiscomp3 instance and then use the receiver function codebase as is by creating an FDSN client connected to the FDSN webservice running on the Seiscomp3 instance.
Directly call the receiver function code on the waveform miniseed and skip the ingestion processas suggested by @alexgorb.
Both these approaches have their merits and demerits.
The first approach is easy, less error-prone and faster and to implement because we would be using the receiver function code base without any tweaking. But, it will take that extra step of ingesting everything into Seiscomp3.
The second approach will require tweaking the receiver function code base to a great extent, primarily how it extracts the waveform centered around the event and uses several routines related to rfstats, which in turn use others. The routines are documented scantily in that only one line describes what the routine does, but the code in the routines is largely undocumented.
Will have another discussion with @rh-downunder and @alexgorb before taking a call on which approach to pursue.
@alexgorb We have received a small subset of the temporary station survey dataset from Jason. But he mentioned that it's raw data and will need to be qa/qc-ed before consumption. He has also passed on a small subset of the 7d network which he says is better quality. I will load this data and analyse.
By # 41c83f7, we can process a small number of events, say upto, 5k events. This works on single process and in memory sort/filter/joins using pandas dataframe.
However, we need to process upwards of 500k+ events. Just from ISC and Engdahl we have 300k+ events.
This is how this can proceed:
slarchive
(can use scart
).scevtstreams
and scart
combination)POC target: 1 month worth of historical data for all primary stations in aws.
Example of cwb query: query -h localhost -t ms -s ".*" -b "2005-02-08 05:37:52.98" -d 1000
Clarification on 3:
antelope
virtualenv inside our sc3 image in aws since antelope
uses propriatary python libraries. Instead you will have to export the events in seiscomp3 xml format and copy them across to the sc3 image. This part will require some ingenuity to automate.From stations information shared by ISC in text files, extract station coordinates that can be used in many applications.
Specifically, required by #35 .
scart
that this data can be queried via sc3. This is dependent on the item above.These are the events that @alexgorb passed onto us in text files.
@alexgorb says we need to have a provision to use the events as reported by ISC. We can use this service for that: http://www.isc.ac.uk/iscbulletin/search/bulletin/
We can download quakeml
s, use obspy
to convert to sc3ml
and ingest into seiscomp3
.
Jira - PST-193
Run receiver function analysis on a given AU station for ISC events for March 2015
We do this once the earthmon db can be moved into an aws PostgreSQL db.
Wendy is tracking progress of the db migration.
Test found the following.
In engdahl
event 8881.xml
we have these two picks. They are identical except the pick id.
Is that correctly parsed as reported by Engdahl?
<pick publicID="smi:engdahl.ga.gov.au/pick/2418177">
<time>
<value>2008-04-11T10:04:27.000900Z</value>
</time>
<waveformID networkCode="" stationCode="CHTO" channelCode="BHZ"/>
<backazimuth>
<value>85.91</value>
</backazimuth>
<onset>emergent</onset>
<phaseHint>P</phaseHint>
<evaluationMode>automatic</evaluationMode>
<creationInfo>
<agencyID>ga-engdahl</agencyID>
<agencyURI>smi:engdahl.ga.gov.au/ga-engdahl</agencyURI>
<author>niket_engdahl_parser</author>
<creationTime>2017-11-10T07:53:57.118115Z</creationTime>
</creationInfo>
</pick>
<pick publicID="smi:engdahl.ga.gov.au/pick/2418178">
<time>
<value>2008-04-11T10:04:27.000900Z</value>
</time>
<waveformID networkCode="" stationCode="CHTO" channelCode="BHZ"/>
<backazimuth>
<value>85.91</value>
</backazimuth>
<onset>emergent</onset>
<phaseHint>P</phaseHint>
<evaluationMode>automatic</evaluationMode>
<creationInfo>
<agencyID>ga-engdahl</agencyID>
<agencyURI>smi:engdahl.ga.gov.au/ga-engdahl</agencyURI>
<author>niket_engdahl_parser</author>
<creationTime>2017-11-10T07:53:57.118115Z</creationTime>
</creationInfo>
</pick>
The process needs to be fully automated and should work like:
executable/python_script event_id/event.xml -d seiscomp3_inventory_db(xml)/seiscomp3_db
Output should all stations that satisfy the criteria and their respective coordinates.
This is step 2 in this process.
.... in event objects during custom picking should contain associated amplitudes for the picks.
Required for location aglorithms to run.
Needs to be associated to the corresponding picks as in the obspy/seiscomp3 datamodel.
Jira - pst-192
We need separate clustered datafile into 3 files:
Region can be specified like the following:
upperlat = 90.
bottomlat= 50.
leftlon = 240.
rightlon = 280.
The following cwb query data issues have been analysed so far:
java -jar -Xmx1600m ~/CWBQuery/CWBQuery.jar -h 54.153.144.205 -t dcc -s "IUPET..BH200" -b "2015/03/01 00:00:00" -d 31d
The output generated is:
centos@proc:~/miniseed$ java -jar -Xmx1600m ~/CWBQuery/CWBQuery.jar -h 54.153.144.205 -t dcc -s "IUPET..BH200" -b "2015/03/01 00:00:00" -d 31d
04:13:25 Query on IUPET BH200 184868 mini-seed blks 2015 060:00:00:00.0499 2015 090:00:00:00.000 ns=51312360 #dups=8
04:15:25.869 Thread-2 ReadTimeout went off. close the socket 120602 waitlen=120000 sock.isCLosed()=false loop=162
And the size of miniseed file being generated just blows up. It reaches sizes of 100G plus. If I abort this query, the returned file's size turns to zero after sometime so I cannot analyse it either. Is this a known issue for some network/station/channel combinations?
For some net/sta/cha combinations, the number of blocks for a given time period is higher than that of others. Its possible that the sampling frequency for those combinations is higher than that of others. Can we query the sampling frequency of such combinations? so we can better manage the query by diving into appropriate time periods. Otherwise, the response for such combinations takes too long (sometimes half an hour) and is returned in batches which makes it difficult to manage. Or if you know some pertinent information, please do share.
On querying CWB for a given time period, the response miniseed files have data that overflows the requested time window. For e.g. when I query as below:
centos@proc:~/miniseed$ java -jar -Xmx1600m /CWBQuery/CWBQuery.jar -h 54.153.144.205 -t dcc -s "GEFAKI.BHZ" -b "2015/03/15 10:00:00" -d 7200/miniseed$ ls -la
04:34:01 Query on GEFAKI BHZ 000299 mini-seed blks 2015 074:09:59:42.8695 2015 074:12:00:15.370 ns=144651 #dups=0
299 Total blocks transferred in 463 ms 645 b/s 0 #dups=0
centos@proc:
total 53424
drwxrwxr-x 2 centos centos 126 Sep 19 04:34 .
drwx------. 7 centos centos 270 Sep 19 04:14 ..
-rw-rw-r-- 1 centos centos 139264 Sep 19 04:34 GEFAKI_BHZ__.msd
And when I look at the starttime and endtime for this miniseed:
In [1]: from obspy.core import read
In [2]: st = read('GEFAKI_BHZ__.msd')
In [3]: tr = st[0]
In [4]: tr.stats
Out[4]:
network: GE
station: FAKI
location:
channel: BHZ
starttime: 2015-03-15T09:59:42.869538Z
endtime: 2015-03-15T12:00:15.369538Z
sampling_rate: 20.0
delta: 0.05
npts: 144651
calib: 1.0
_format: MSEED
mseed: AttribDict({'record_length': 4096, 'encoding': u'STEIM2', 'filesize': 139264, u'dataquality': u'D', 'number_of_records': 34, 'byteorder': u'>'})
Travis/CircleCI?
We need ellipticity correction for travel times before we use the travel times in inversion products.
We need: ftp://rses.anu.edu.au/pub/ak135/ellip
and that requires this: ftp://rses.anu.edu.au/pub/ak135/tau
@alexgorb Did I get these right? Can you fill the remaining ones?
nblock: block id of the source
nst: block id of the station
resid: time residual of arrival
nev: ?
vlon: source longitude
vlat: source latitude
dep: source depth
slon: station longitude
slat: station latitude
obstt: ?
Output of this exercise will be used in 3d inversion model.
The format of the output needs to be the following with the column names being as above in that order:
17503 266751 0.3 592032 55.673 86.934 21.6 87.695 43.628 484.800 43.800 1
17503 272459 0.8 592032 55.673 86.934 21.6 74.620 42.637 490.800 44.500 1
17503 288465 -0.1 592032 55.673 86.934 21.6 116.175 39.850 522.400 48.700 1
17503 291184 -1.4 592032 55.673 86.934 21.6 75.980 39.266 516.100 47.900 1
17503 292720 -0.1 592032 55.673 86.934 21.6 99.814 39.221 522.200 48.600 1
Changes in f5d811c brings most of convert_logs
under coverage report. However, that was accomplished by avoiding system calls or multiprocessing calls during tests.
This can provide coverage reporting while using multiprocessing/system call.
May need obspy hacking, more exception handling.
We need a docker image corresponding to our seiscomp3 image from aws.
Currently, the ISC events downloaded from the bulletin website have the preferred origin missing. The code to add the most appropriate preferred origin is checked in at bdef9d5#diff-c276ddde798091163facc2a874f3553b. But it has to run on the ISC events that are backed-up at: s3://pyrobots-backup/niket/failed-attempts/. Also, the event type needs to be changed from "other" to "earthquake" for them to be ingestable into Seiscomp3.
Many ways of doing this. Can use obspy
client.
This is a derived issue from: #23
We need to be able to accept an ISF file, a P/S pick, and output a new ISF with the new pick appended.
Reference from @rh-downunder:
https://ds.iris.edu/media/workshop/2013/01/advanced-studies-institute-on-seismological-research/files/lecture_introrecf.pdf
References from @alexgorb:
Codes:
https://seiscode.iris.washington.edu/projects/rfsyn
https://github.com/trichter/rf
https://github.com/iwbailey/processRFmatlab
Reading: https://openresearch-repository.anu.edu.au/bitstream/1885/49353/6/04chapter03.pdf
Other:
https://academic.oup.com/gji/article/164/3/551/732954/On-the-use-of-teleseismic-receiver-functions-for
https://academic.oup.com/gji/article/168/1/171/578851/Teleseismic-wavefield-interpolation-and-signal
http://authors.library.caltech.edu/73702/2/410.full.pdf
https://sbgf.org.br/revista/index.php/rbgf/article/viewFile/189/59
https://link.springer.com/article/10.1007/s11589-003-0045-2
https://dl.acm.org/citation.cfm?id=3075895
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.905.466&rep=rep1&type=pdf
The goal of this task is to explore the receiver function concept, look at the available implementations and integrate with the waveform and event metadata pipeline hosted on the seiscomp3 POC.
@alexgorb, thanks for the chalk and talk session yesterday. I am also attaching the picture of the board.
My research indicates that all we need is IMS1.0/ISF1.0 or ISF2.0 input files and a station list file.
The isf_stationfile is a simple comma and space separated text file with stationcode, alternate_stationcode, lat, lon, elevation.
This is desirable for many reasons including:
Analyse the data available at /g/data/ha3/Passive/_ANU/7D(2012-2013)/ASDF and extract metadata for all temp stations for receiver function analysis
Need to work with historical events in seiscomp3.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.