xmidt-org / talaria Goto Github PK
View Code? Open in Web Editor NEWThe Xmidt routing agent.
License: Apache License 2.0
The Xmidt routing agent.
License: Apache License 2.0
When I tried to start the talaria service I get the following error:
"Unable to initialize Viper environment: Config File 'talaria' Not Found in '[/etc/talaria / /home/ddp/.talaria /home/ddp/Projects/talaria/src/talaria]".
I assume a configuration file is necessary to start the service correctly but I didn't found any documentation of it. Could you please provide some examples of a configuration file?
Dependent on xmidt-org/wrp-go#53.
If the device sends Talaria a WRP with an empty Device ID field, Talaria should fill it with what it believes to be the Device ID of that device.
Prometheus query: sort((xmidt_talaria_device_count{instance=~"talaria-prod-.+", stage="prod", region="hoa"}))
Returns data from 123 talaria. There are 126 talaria in the region.
In addition the values returned are inconsistent as some show 123, 125 and only 2 show 126.
Possible areas to look at is if no data is being recorded for the metric being checked, config error in prometheus, or talaria is doing something so the data is incorrect or not logged.
Talaria does not prevent devices from connecting that would not normally be hashed to it. For example, in a 2 talaria cluster a device with mac:112233445566 could connect directly (or be misdirected) to either one and would remain there until a rehash event. This affects addressability of the device and performance of the cluster.
The talaria(s) to which the device isn't hashed should not allow the connection in the first place.
Send websocket close frame from websocket server talaria to client with reason for disconnect.
This will help differentiate cases where the server forcefully closed the connection versus cases when the client feels the socket on the other end closed due to unknown reasons.
For scalability inside kubernetes, talaria should register twice with consul. Once with the kubernetes dns for scytale and one with external dns so devices can connect to the instance via petasos.
The sync.Map
construct in the golang STL may be a good match for Talaria's connected devices. We should:
sync.Map
sync.Map
The connection information (SessionID, trust, and time connected) should be added to all wrps that go through talaria, which means talaria has to re-encode every message.
Is the connection ID unique from one talaria to the next? Otherwise, there could be a small chance that two devices of the same ID connect to two talaria in different data centers with the same connection ID.
No clear ways reproduce this yet but seen during decent % of API load to the /device WRP endpoint.
fatal error: concurrent map iteration and map write
goroutine 1277169 [running]:
runtime.throw(0xcf8f22, 0x26)
/usr/lib/golang/src/runtime/panic.go:774 +0x72 fp=0xc00e249320 sp=0xc00e2492f0 pc=0x430ee2
runtime.mapiternext(0xc00e249528)
/usr/lib/golang/src/runtime/map.go:858 +0x579 fp=0xc00e2493a8 sp=0xc00e249320 pc=0x410e89
github.com/spf13/viper.mergeMaps(0xc012966180, 0xc00c602840, 0x0)
/builddir/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1672 +0x91 fp=0xc00e249598 sp=0xc00e2493a8 pc=0x90f181
github.com/spf13/viper.(*Viper).MergeConfigMap(0xc00a2d4fc0, 0xc012966180, 0x0, 0xc00a2d4fc0)
/builddir/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1380 +0x66 fp=0xc00e2495c0 sp=0xc00e249598 pc=0x90bc06
github.com/xmidt-org/bascule.NewAttributesWithOptions(0x0, 0x0, 0xc012966180, 0xa, 0xc00948ece8)
/builddir/go/pkg/mod/github.com/xmidt-org/[email protected]/token.go:195 +0x65 fp=0xc00e249628 sp=0xc00e2495c0 pc=0xa37ea5
When you submit an api request Talaria requires the "X-Webpa-Device-Name" header, but it doesn't validate it against the message.
e.g.
curl -i -H "Authorization:$basicauth" -H "Content-Type:application/json" -H "Accept:application/json" --data-binary '@SimpleApiRequestMessage.json' -X POST https://$a:8080/api/v2/device/send
results in a 400
{"code": 400, "message": "Could extract device id: Missing device name header"}
whereas :
export b=mac:112233445566
curl -i -H "Authorization:$basicauth" -H "Content-Type:application/json" -H "Accept:application/json" --data-binary '@SimpleApiRequestMessage.json' -X POST https://$a:8080/api/v2/device/send -H "X-Webpa-Device-Name:$b"
will succeed even if $b
does not match the device inside the SimpleApiRequestMessage.json which is set to
"dest":"mac:4ca155000006/config",
It appears that talaria doesn't use or validate against this header. A request without the dest field will fail with a 400.
Talaria service fails to start after upgrading from 0.1.1-153 to 0.1.1-165. Any dependency need to be changed?.
vendor/github.com/spf13/viper.(*Viper).Unmarshal(0x0, 0xaddba0, 0xc42000e900, 0xc420189790, 0x13)
/root/rpmbuild/BUILD/talaria-0.1.1/src/vendor/github.com/spf13/viper/viper.go:738 +0x2f
vendor/github.com/Comcast/webpa-common/service/servicecfg.NewEnvironment(0xd1ab40, 0xc4201ac500, 0xd1ae20, 0x0, 0x0, 0x0, 0xd1abc0, 0xc4201e0690)
/root/rpmbuild/BUILD/talaria-0.1.1/src/vendor/github.com/Comcast/webpa-common/service/servicecfg/environment.go:25 +0xd1
main.talaria(0xc42001e1e0, 0x1, 0x1, 0x0)
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:120 +0xb1d main.main.func1(0xc420076058)
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:169 +0x49
main.main()
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:172 +0x22
ts=2018-04-03T10:06:04.689560519Z caller=outbounder.go:233 level=info msg="Starting outbounder"
ts=2018-04-03T10:06:04.689575956Z level=info eventMap="unsupported value type"
ts=2018-04-03T10:06:04.689796743Z caller=webpa.go:483 level=info msg="starting server" name=talaria.health address=:8081
ts=2018-04-03T10:06:04.689812879Z caller=health.go:120 level=debug msg="Health Monitor Started"
ts=2018-04-03T10:06:04.69490963Z caller=webpa.go:506 level=info msg="starting server" name=talaria address=:8080
Is there any documentation on how to configure and stand up this WebPA Server component?
The feature for talaria to fan out via service discovery to caduceus does not work as expected.
eventEndpoints:
default: "http://shouldnotwork.io"
in the log the messages appear that indicate talaria did not fan out/fanned out incorrectly
{"component":"consulWatcher","error":"no records available","level":"debug","msg":"looking up routes","routes":[],"ts":"2020-07-07T22:27:04.519010222Z"}
Hi Team,
we have upgraded talaria from version:- 0.1.1-153.el6 to 0.1.1-244.el6. Earlier it was working fine and post-upgrade I am facing an issue with it. We have not made any change into the config file and we haven't patched our system post upgrade as well.
When I do service talaria start, the command executes successfully but when I do netstat -punta|grep 6000, it shows nothing. Also into the console.out file, I can see some error log printed.
I have an observation as well which creates problem many of the times.
The default port of Talaria & Scytale is "6000" in 0.1.1-244.el6 which was not the case earlier. Can't we change it into the final build so that no conflict occurs post installation between these two applications & avoids manual input after every new installation or upgrade?
We've been pointing fingers as to which component is responsible for being incorrect. I think we have the answer below, which was data taken during one of the inconsistent periods.
[clduser@talaria-prod-00101-184-204-1gv ~]$ curl -s 'http://localhost:8500/v1/health/service/talaria?passing=true&pretty' | grep -e '"Node": {' | wc -l
125
[clduser@talaria-prod-00101-184-204-1gv ~]$ curl -s http://localhost:9390/metrics | grep sd_instance_count
# HELP xmidt_talaria_sd_instance_count The current number of service instances of a given type
# TYPE xmidt_talaria_sd_instance_count gauge
xmidt_talaria_sd_instance_count{service="talaria[prod rum]{passingOnly=true}"} 115
[clduser@talaria-prod-00101-184-204-1gv ~]$
We can no longer blame the local agent, as it evidently has the same data as before the event window.
Currently, the receipt of a pong message is the only thing that increases the read timeout via SetReadDeadline
. Given all the production issues we're having with ping misses, it might be a good idea to have the receipt of any websocket frame trigger a call to SetReadDeadline
. Talaria would still send pings, but the lack of a pong would not disconnect devices so long as WRP messages were getting received.
CentOS Linux release 7.6.1810 (Core)โ
Docker version 18.09.7, build 2d0083d
Console Log:
$ git clone https://github.com/Comcast/talaria.git talaria_docker
$ cd talaria_docker/
$ docker build -t talaria:local .
Sending build context to Docker daemon 646.1kB
Step 1/18 : FROM golang:alpine as builder
---> 4e4b7a8b9495
Step 2/18 : MAINTAINER Jack Murdock [email protected]
---> Using cache
---> fee8921fb0db
Step 3/18 : WORKDIR /go/src
---> Using cache
---> 0c1d12e154c2
Step 4/18 : RUN apk add --update --repository https://dl-3.alpinelinux.org/alpine/edge/testing/ git curl
---> Using cache
---> 305cb5bf0af9
Step 5/18 : RUN curl https://glide.sh/get | sh
---> Using cache
---> adc58bddb7f7
Step 6/18 : COPY src/ /go/src/
---> Using cache
---> e9c19d1f61e7
Step 7/18 : RUN glide -q install --strip-vendor
---> Running in 51e9adbccd42
[ERROR] Failed to set version on github.com/jtacoma/uritemplates to 307ae868f90f4ee1b73ebe4596e0394237dacce8: Unexpected error while defensively cleaning up after possible derelict nested submodule directories: exit status 128
[ERROR] Failed to set references: Unexpected error while defensively cleaning up after possible derelict nested submodule directories: exit status 128 (Skip to cleanup)
The command '/bin/sh -c glide -q install --strip-vendor' returned a non-zero code: 1
Here is an example of the changes required xmidt-org/svalinn#111 and xmidt-org/svalinn#112
From the discussion on the XMiDT Discussion Board talaria is not using consul to discover the available caducei.
talaria:
version: 0.5.3
go version: go1.14.3
built time: 2020-05-18 23:36:58
git commit: d4547aa
os/arch: linux/amd64
eventEndpoints
and eventMap
to http://dummy:6000/api/v3/notify.eventEndpoints
and eventMap
to http://dummy:6000/api/v3/notify.Including X-Xmidt-Transaction-Uuid and omitting the X-Xmidt-Source header in a direct request to Scytale can cause our simulators to timeout and and talaria to return a 504.
curl -X POST
https://$scytaleurl:443/api/v2/device
-H 'authorization: $auth'
-H 'content-type: application/json'
-H 'x-webpa-device-name: $testmac/config'
-H 'x-xmidt-content-type: application/json'
-H 'x-xmidt-message-type: 3'
-H 'x-xmidt-transaction-uuid: postman-1234'
-d ' {"command":"GET","names":["Device.DeviceInfo.X_CISCO_COM_FirmwareName"]}'
From side discussion with Joel, link map is in talaria.
Note: if you included X-Xmidt-Source, but omit X-Xmidt-Transaction-Uuid then you get a 200 return code, but an empty body.
tldr;
have the uuid and missing source -> 504
missing uuid and have source -> 200 (no body)
have the uuid and have source -> 200 + body
missing uuid and missing source - > 200 (no body)
When a device connects to Talaria it should emit a WRP Simple Events as defined below and apply the routing rules defined in the event-map
configuration.
{
"msg_type": "4",
"source": "{talaria-fqdn}",
"dest": "event:device-status/{device-id}/online",
"content_type": "json",
"partner_ids": [ "{partner-id}" ],
"metadata": {
"/boot-time" : "{boot-time}",
"/hw-model" : "{hw-model}",
"/hw-manufacturer" : "{hw-manufacturer}",
"/hw-serial-number" : "{hw-serial-number}",
"/hw-last-reboot-reason" : "{last-reboot-reason}",
"/fw-name" : "{fw-name}",
"/last-reconnect-reason" : "{last-reconnect-reason}",
"/protocol" : "{parodus-version}",
"/trust" : "{trust}",
"/compliance" : "{compliance}",
},
"payload": {
"id": "{device-id}",
"ts": "2018-11-21T21:19:02.614191735Z",
}
}
{
"msg_type": "4",
"source": "{talaria-fqdn}",
"dest": "event:device-status/{device-id}/offline",
"content_type": "json",
"partner_ids": [ "{partner-id}" ],
"metadata": {
"/boot-time" : "{boot-time}",
"/hw-model" : "{hw-model}",
"/hw-manufacturer" : "{hw-manufacturer}",
"/hw-serial-number" : "{hw-serial-number}",
"/hw-last-reboot-reason" : "{last-reboot-reason}",
"/fw-name" : "{fw-name}",
"/last-reconnect-reason" : "{last-reconnect-reason}",
"/protocol" : "{parodus-version}",
"/trust" : "{trust}",
"/compliance" : "{compliance}",
},
"payload": {
"id": "{device-id}",
"ts": "2018-11-21T21:19:02.614191735Z",
"bytes-sent": 0,
"messages-sent": 1,
"bytes-received": 0,
"messages-received": 0,
"connected-at": "2018-11-21T21:19:02.614191735Z",
"up-time": "16m46.6s",
"reason-for-closure": "ping miss"
}
}
{talaria-fqdn}
- The fully qualified domain name for the Talaria sending this event (example: dns:talaria-1234.example.com
).{device-id}
- The device id of the device that connected (example: mac:112233445566
){partner-id}
- The partner-id of the device that connected (from authorization JWT).{boot-time}
- The boot time of the device that connected (from convey header).{hw-model}
- The hardware model of the device that connected (from convey header).{hw-manufacturer}
- The hardware manufacturer of the device that connected (from convey header).{hw-serial-number}
- The hardware serial number of the device that connected (from convey header).{last-reboot-reason}
- The last reboot reason of the device that connected (from convey header).{fw-name}
- The firmware name of the device that connected (from convey header).{last-reconnect-reason}
- The last reconnection reason of the device that connected (from convey header).{parodus-version}
- The parodus version of the device that connected (from convey header).{trust}
- The trust value of the device that connected (from authorization JWT).{compliance}
- How well the client complies with sending the required information. Detailsts
- The current timestamp on the Talaria machine.connected-at
- The timestamp on the Talaria machine when the device first connected.reason-for-closure
- The reason that Talaria closed the connection.up-time
- The total duration of time the connection was open from Talaria's perspective.If a value listed above is unknown the item may be set to empty or omitted completely.
As WRP message come from the devices through Talaria or messages about devices come from Talaria, the Talaria server needs to enforce the correct value of trust. To do this, Talaria must overwrite the WRP.metadata["trust"] field in the outgoing WRP message to reflect the trust level of the device. Trust level is based on the trust level from the JWT token's trust
field. If there is no JWT or it does not have a trust field, the value written SHALL be 0
.
when do we call watch? just during initialization/bringing up the service? or is it done at some interval?
Originally posted by @kristinaspring in https://github.com/xmidt-org/talaria/diffs
We want to add the following metrics:
message_propagation_time[ message_type, direction ]
as a histogram
message_type
label is the wrp message type string
direction
is inbound
or outbound
where inbound
is defined as the message is going to the connected device, and outbound
is from the connected device
Prometheus Query: sort((xmidt_talaria_device_count{instance=~"talaria-prod-.+", stage="prod", region="hoa"}))
This query shows result from all 126 talaria in the region. Three talaria:
talaria-prod-00101-184-204-4um
talaria-prod-00101-184-204-adb
talaria-prod-00101-184-204-hpf
appear with connected devices but do not appear in the service discovery metrics. Kind of looks like these may be orphaned devices?
If the Content Type in a wrp Simple Event is empty, we should populate it with the default application/octet-stream
:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
Also have to change references to the branch in .travis.yml, README, and CONTRIBUTING. Double check any other markdown files as well - sometimes links have the branch name in them.
It would be very helpful if we can see the type of IP address (IPv4 vs IPv6) used by xmidt_talaria_hardware_model
.
Ideally we'd add a label ip_version
of 4
or 6
to the xmidt_talaria_hardware_model
metric.
Dockerfile
and Dockerfile.local
seem to be the exact same files.
I am guessing this was only needed in the glide days
Context: b9596b8
https://github.com/Comcast/talaria/blob/bc16d08a8e1a4f8e7b952cd3f667013c5e9b0b17/src/talaria/deviceStatus.go#L43 the trailing comma should be removed
Add some metrics for socket close:
the reasons for close should probably be enumerated if possible to prevent blowing up prometheus.
When a request comes inbound for a device, validate that the partner id(s) given for the request match the device's metadata of valid partner id(s). Otherwise, send a 403 response.
As part of streamlining the error reporting for api requests:
Codex is receiving some online/offline events either without the ts
key in the payload (most likely option) or where the value for the ts
key isn't a string.
We need the ability to drain devices based on partner-id
.
We have upgraded from talaria-0.1.1-153.el6.x86_64 to talaria-0.1.3-1.el7.x86_64
The default configuration file no longer works and talaria exits with error. We can also see few "unsupported value type" errors from log (attached below).
Is there any other parameter we need to add with config files.?
Consul running as "consul agent -dev -node machine"
talaria_log.txt
journalctl_talaria.txt
---
primary:
address: ":8080"
health:
address: ":8888"
pprof:
address: ":9999"
discoveryClient:
staticNodes: "https://localhost:8585"
log:
file : "/var/run/talaria/talaria.log"
level : "DEBUG"
maxSize : 50
maxBackup : 3
versions:
talaria-v0.5.1-1.el7.x86_64
caduceus-v0.2.7-1.el7.x86_64
Description:
Talaria isn't correctly delivering device-status events. This is manifested in a log error when Caduceus attempts the event.
{"error":"Post http://[redacted]:8080/responder: net/http: invalid header field value "/trust=\x00" for key X-Xmidt-Metadata","level":"debug","msg":"retrying HTTP transaction","retry":1,"statusCode":0,"ts":"2020-05-11T22:15:59.728513299Z","url":"http://[redacted]:8080/responder"}
Steps:
Providing an invalid payload in a direct request to talaria causes it to restart. This is tracked internally as XMIDT-576
Currently Talaria does not perform any validation to ensure a WRP message sent from a device actually has the correct src
attribute. This means a bad actor can masquerade as any other device once the initial connection authorization has been performed.
should be fixed with #69
Log device Convey header on every device connect event as it is required for debugging
This is a cross post of the issue in caduceus which is also applicable to talaria
/github/xmidt-org/caduceus#225
Devices that connect to talaria present a set of credentials about themselves during registration.
Some of that data includes information about which users are allowed to interact with them.
Before routing requests to devices on behalf of some user, talaria should first check that such user has all the right credentials.
There were security warnings about transitive dep on existing parts of the go.sum during a previous PR.
#156 (comment)
We would like a metric gauge that shows the count of connections per convey hardware field (hw-model
).
If there is not valid convey header hw-model
the value invalid
should be used.
As connections are made, the gauge+hw-model
label will be incremented.
As connections are disconnected/closed, the gauge+hw-model
label will be decremented.
Currently, there is no way to configure whether or not to allow devices that do not present a token. We'd like to create one that's off by default but which when enabled, allows devices to still connect to the cluster with a lower level of trust.
This work depends on the validator library implementing xmidt-org/bascule#58
https://github.com/Comcast/talaria/blob/bc16d08a8e1a4f8e7b952cd3f667013c5e9b0b17/src/talaria/deviceStatus.go#L73
the trailing comma needs to be removed.
https://github.com/xmidt-org/talaria/blob/master/outbounder.go#L74
https://github.com/xmidt-org/talaria/blob/master/dispatcher.go#L63
the dispatcher only uses the first auth key from the list given by the outbounder. We should make this more clear and just ask for one key, instead of providing the option for a slice of keys.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.