Giter Site home page Giter Site logo

talaria's Issues

Config File 'talaria' Not Found

When I tried to start the talaria service I get the following error:
"Unable to initialize Viper environment: Config File 'talaria' Not Found in '[/etc/talaria / /home/ddp/.talaria /home/ddp/Projects/talaria/src/talaria]".

I assume a configuration file is necessary to start the service correctly but I didn't found any documentation of it. Could you please provide some examples of a configuration file?

The metric xmidt_talaria_sd_instance_count reporting incorrect/inconsistent values

Prometheus query: sort((xmidt_talaria_device_count{instance=~"talaria-prod-.+", stage="prod", region="hoa"}))

Returns data from 123 talaria. There are 126 talaria in the region.
In addition the values returned are inconsistent as some show 123, 125 and only 2 show 126.

Possible areas to look at is if no data is being recorded for the metric being checked, config error in prometheus, or talaria is doing something so the data is incorrect or not logged.

Devices are not filtered by hash on connect

Talaria does not prevent devices from connecting that would not normally be hashed to it. For example, in a 2 talaria cluster a device with mac:112233445566 could connect directly (or be misdirected) to either one and would remain there until a rehash event. This affects addressability of the device and performance of the cluster.

The talaria(s) to which the device isn't hashed should not allow the connection in the first place.

Send websocket close frame to client with reason for disconnect

Send websocket close frame from websocket server talaria to client with reason for disconnect.
This will help differentiate cases where the server forcefully closed the connection versus cases when the client feels the socket on the other end closed due to unknown reasons.

Improve talaria registration for K8s

For scalability inside kubernetes, talaria should register twice with consul. Once with the kubernetes dns for scytale and one with external dns so devices can connect to the instance via petasos.

Explore using sync.Map from go 1.9

The sync.Map construct in the golang STL may be a good match for Talaria's connected devices. We should:

  1. Profile what we have
  2. Convert to using sync.Map
  3. Profile with sync.Map
  4. See if the code beauty and performance are awesome & merge or trash

Add Connection information and re-encode wrp

The connection information (SessionID, trust, and time connected) should be added to all wrps that go through talaria, which means talaria has to re-encode every message.

Is the connection ID unique from one talaria to the next? Otherwise, there could be a small chance that two devices of the same ID connect to two talaria in different data centers with the same connection ID.

fatal error: concurrent map iteration and map write

No clear ways reproduce this yet but seen during decent % of API load to the /device WRP endpoint.

fatal error: concurrent map iteration and map write
goroutine 1277169 [running]:
runtime.throw(0xcf8f22, 0x26)
/usr/lib/golang/src/runtime/panic.go:774 +0x72 fp=0xc00e249320 sp=0xc00e2492f0 pc=0x430ee2
runtime.mapiternext(0xc00e249528)
/usr/lib/golang/src/runtime/map.go:858 +0x579 fp=0xc00e2493a8 sp=0xc00e249320 pc=0x410e89
github.com/spf13/viper.mergeMaps(0xc012966180, 0xc00c602840, 0x0)
/builddir/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1672 +0x91 fp=0xc00e249598 sp=0xc00e2493a8 pc=0x90f181
github.com/spf13/viper.(*Viper).MergeConfigMap(0xc00a2d4fc0, 0xc012966180, 0x0, 0xc00a2d4fc0)
/builddir/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1380 +0x66 fp=0xc00e2495c0 sp=0xc00e249598 pc=0x90bc06
github.com/xmidt-org/bascule.NewAttributesWithOptions(0x0, 0x0, 0xc012966180, 0xa, 0xc00948ece8)
/builddir/go/pkg/mod/github.com/xmidt-org/[email protected]/token.go:195 +0x65 fp=0xc00e249628 sp=0xc00e2495c0 pc=0xa37ea5

talaria requires header but does not validate/use it

When you submit an api request Talaria requires the "X-Webpa-Device-Name" header, but it doesn't validate it against the message.

e.g.

curl -i -H "Authorization:$basicauth" -H "Content-Type:application/json" -H "Accept:application/json" --data-binary '@SimpleApiRequestMessage.json' -X POST https://$a:8080/api/v2/device/send

results in a 400

{"code": 400, "message": "Could extract device id: Missing device name header"}

whereas :

export b=mac:112233445566
curl -i -H "Authorization:$basicauth" -H "Content-Type:application/json" -H "Accept:application/json" --data-binary '@SimpleApiRequestMessage.json' -X POST https://$a:8080/api/v2/device/send -H "X-Webpa-Device-Name:$b"

will succeed even if $b does not match the device inside the SimpleApiRequestMessage.json which is set to

"dest":"mac:4ca155000006/config",

It appears that talaria doesn't use or validate against this header. A request without the dest field will fail with a 400.

service fails to start after upgrading to release 0.1.1-165

Talaria service fails to start after upgrading from 0.1.1-153 to 0.1.1-165. Any dependency need to be changed?.

tail -f /var/log/talaria/console.out

vendor/github.com/spf13/viper.(*Viper).Unmarshal(0x0, 0xaddba0, 0xc42000e900, 0xc420189790, 0x13)
/root/rpmbuild/BUILD/talaria-0.1.1/src/vendor/github.com/spf13/viper/viper.go:738 +0x2f
vendor/github.com/Comcast/webpa-common/service/servicecfg.NewEnvironment(0xd1ab40, 0xc4201ac500, 0xd1ae20, 0x0, 0x0, 0x0, 0xd1abc0, 0xc4201e0690)
/root/rpmbuild/BUILD/talaria-0.1.1/src/vendor/github.com/Comcast/webpa-common/service/servicecfg/environment.go:25 +0xd1
main.talaria(0xc42001e1e0, 0x1, 0x1, 0x0)
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:120 +0xb1d main.main.func1(0xc420076058)
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:169 +0x49
main.main()
/root/rpmbuild/BUILD/talaria-0.1.1/src/talaria/talaria.go:172 +0x22

tail -f /var/run/talaria/talariaLog.log

ts=2018-04-03T10:06:04.689560519Z caller=outbounder.go:233 level=info msg="Starting outbounder"
ts=2018-04-03T10:06:04.689575956Z level=info eventMap="unsupported value type"
ts=2018-04-03T10:06:04.689796743Z caller=webpa.go:483 level=info msg="starting server" name=talaria.health address=:8081
ts=2018-04-03T10:06:04.689812879Z caller=health.go:120 level=debug msg="Health Monitor Started"
ts=2018-04-03T10:06:04.69490963Z caller=webpa.go:506 level=info msg="starting server" name=talaria address=:8080

Talaria round robin to caduceus does not work as expected

The feature for talaria to fan out via service discovery to caduceus does not work as expected.

  1. in the configuration change outbound.eventEndpoints.default to a server that does not exist or has the wrong port.
    eventEndpoints:
        default: "http://shouldnotwork.io"
  1. verify that service discovery is configured correctly
  2. restart talaria
  3. connect/disconnect a device

in the log the messages appear that indicate talaria did not fan out/fanned out incorrectly
{"component":"consulWatcher","error":"no records available","level":"debug","msg":"looking up routes","routes":[],"ts":"2020-07-07T22:27:04.519010222Z"}

Talaria service is not starting after upgrade

Hi Team,

we have upgraded talaria from version:- 0.1.1-153.el6 to 0.1.1-244.el6. Earlier it was working fine and post-upgrade I am facing an issue with it. We have not made any change into the config file and we haven't patched our system post upgrade as well.

When I do service talaria start, the command executes successfully but when I do netstat -punta|grep 6000, it shows nothing. Also into the console.out file, I can see some error log printed.

I have an observation as well which creates problem many of the times.
The default port of Talaria & Scytale is "6000" in 0.1.1-244.el6 which was not the case earlier. Can't we change it into the final build so that no conflict occurs post installation between these two applications & avoids manual input after every new installation or upgrade?

image
image
console.out.txt

Service discovery data is inconsistant with local Consul agent

We've been pointing fingers as to which component is responsible for being incorrect. I think we have the answer below, which was data taken during one of the inconsistent periods.

[clduser@talaria-prod-00101-184-204-1gv ~]$ curl -s 'http://localhost:8500/v1/health/service/talaria?passing=true&pretty' | grep -e '"Node": {' | wc -l
125
[clduser@talaria-prod-00101-184-204-1gv ~]$ curl -s http://localhost:9390/metrics | grep sd_instance_count
# HELP xmidt_talaria_sd_instance_count The current number of service instances of a given type
# TYPE xmidt_talaria_sd_instance_count gauge
xmidt_talaria_sd_instance_count{service="talaria[prod rum]{passingOnly=true}"} 115
[clduser@talaria-prod-00101-184-204-1gv ~]$

We can no longer blame the local agent, as it evidently has the same data as before the event window.

SetReadDeadline might need to be called for all messages

Currently, the receipt of a pong message is the only thing that increases the read timeout via SetReadDeadline. Given all the production issues we're having with ping misses, it might be a good idea to have the receipt of any websocket frame trigger a call to SetReadDeadline. Talaria would still send pings, but the lack of a pong would not disconnect devices so long as WRP messages were getting received.

Docker Build error

CentOS Linux release 7.6.1810 (Core)โ€‹
Docker version 18.09.7, build 2d0083d

Console Log:
$ git clone https://github.com/Comcast/talaria.git talaria_docker
$ cd talaria_docker/
$ docker build -t talaria:local .
Sending build context to Docker daemon 646.1kB
Step 1/18 : FROM golang:alpine as builder
---> 4e4b7a8b9495
Step 2/18 : MAINTAINER Jack Murdock [email protected]
---> Using cache
---> fee8921fb0db
Step 3/18 : WORKDIR /go/src
---> Using cache
---> 0c1d12e154c2
Step 4/18 : RUN apk add --update --repository https://dl-3.alpinelinux.org/alpine/edge/testing/ git curl
---> Using cache
---> 305cb5bf0af9
Step 5/18 : RUN curl https://glide.sh/get | sh
---> Using cache
---> adc58bddb7f7
Step 6/18 : COPY src/ /go/src/
---> Using cache
---> e9c19d1f61e7
Step 7/18 : RUN glide -q install --strip-vendor
---> Running in 51e9adbccd42

[ERROR] Failed to set version on github.com/jtacoma/uritemplates to 307ae868f90f4ee1b73ebe4596e0394237dacce8: Unexpected error while defensively cleaning up after possible derelict nested submodule directories: exit status 128
[ERROR] Failed to set references: Unexpected error while defensively cleaning up after possible derelict nested submodule directories: exit status 128 (Skip to cleanup)
The command '/bin/sh -c glide -q install --strip-vendor' returned a non-zero code: 1

Caduceus RoundRobin via Consul is not working as expected.

Overview

From the discussion on the XMiDT Discussion Board talaria is not using consul to discover the available caducei.

Version

talaria:
  version: 	0.5.3
  go version: 	go1.14.3
  built time: 	2020-05-18 23:36:58
  git commit: 	d4547aa
  os/arch: 	linux/amd64

How To Reproduce

Expected Results

including transaction uuid field to scytale causes device to timeout

Including X-Xmidt-Transaction-Uuid and omitting the X-Xmidt-Source header in a direct request to Scytale can cause our simulators to timeout and and talaria to return a 504.

curl -X POST
https://$scytaleurl:443/api/v2/device
-H 'authorization: $auth'
-H 'content-type: application/json'
-H 'x-webpa-device-name: $testmac/config'
-H 'x-xmidt-content-type: application/json'
-H 'x-xmidt-message-type: 3'
-H 'x-xmidt-transaction-uuid: postman-1234'
-d ' {"command":"GET","names":["Device.DeviceInfo.X_CISCO_COM_FirmwareName"]}'

From side discussion with Joel, link map is in talaria.

Note: if you included X-Xmidt-Source, but omit X-Xmidt-Transaction-Uuid then you get a 200 return code, but an empty body.
tldr;
have the uuid and missing source -> 504
missing uuid and have source -> 200 (no body)
have the uuid and have source -> 200 + body
missing uuid and missing source - > 200 (no body)

Emit online/offline events

When a device connects to Talaria it should emit a WRP Simple Events as defined below and apply the routing rules defined in the event-map configuration.

Online

{
    "msg_type": "4",
    "source": "{talaria-fqdn}",
    "dest": "event:device-status/{device-id}/online",
    "content_type": "json",
    "partner_ids": [ "{partner-id}" ],
    "metadata": {
        "/boot-time" : "{boot-time}",
        "/hw-model" : "{hw-model}",
        "/hw-manufacturer" : "{hw-manufacturer}",
        "/hw-serial-number" : "{hw-serial-number}",
        "/hw-last-reboot-reason" : "{last-reboot-reason}",
        "/fw-name" : "{fw-name}",
        "/last-reconnect-reason" : "{last-reconnect-reason}",
        "/protocol" : "{parodus-version}",
        "/trust" : "{trust}",
        "/compliance" : "{compliance}",
    },
    "payload": {
        "id": "{device-id}",
        "ts": "2018-11-21T21:19:02.614191735Z",
    }
}

Offline

{
    "msg_type": "4",
    "source": "{talaria-fqdn}",
    "dest": "event:device-status/{device-id}/offline",
    "content_type": "json",
    "partner_ids": [ "{partner-id}" ],
    "metadata": {
        "/boot-time" : "{boot-time}",
        "/hw-model" : "{hw-model}",
        "/hw-manufacturer" : "{hw-manufacturer}",
        "/hw-serial-number" : "{hw-serial-number}",
        "/hw-last-reboot-reason" : "{last-reboot-reason}",
        "/fw-name" : "{fw-name}",
        "/last-reconnect-reason" : "{last-reconnect-reason}",
        "/protocol" : "{parodus-version}",
        "/trust" : "{trust}",
        "/compliance" : "{compliance}",
    },
    "payload": {
        "id": "{device-id}",
        "ts": "2018-11-21T21:19:02.614191735Z",
        "bytes-sent": 0,
        "messages-sent": 1,
        "bytes-received": 0,
        "messages-received": 0,
        "connected-at": "2018-11-21T21:19:02.614191735Z",
        "up-time": "16m46.6s",
        "reason-for-closure": "ping miss"
    }
}

Description of most fields

  • {talaria-fqdn} - The fully qualified domain name for the Talaria sending this event (example: dns:talaria-1234.example.com).
  • {device-id} - The device id of the device that connected (example: mac:112233445566)
  • {partner-id} - The partner-id of the device that connected (from authorization JWT).
  • {boot-time} - The boot time of the device that connected (from convey header).
  • {hw-model} - The hardware model of the device that connected (from convey header).
  • {hw-manufacturer} - The hardware manufacturer of the device that connected (from convey header).
  • {hw-serial-number} - The hardware serial number of the device that connected (from convey header).
  • {last-reboot-reason} - The last reboot reason of the device that connected (from convey header).
  • {fw-name} - The firmware name of the device that connected (from convey header).
  • {last-reconnect-reason} - The last reconnection reason of the device that connected (from convey header).
  • {parodus-version} - The parodus version of the device that connected (from convey header).
  • {trust} - The trust value of the device that connected (from authorization JWT).
  • {compliance} - How well the client complies with sending the required information. Details
  • ts - The current timestamp on the Talaria machine.
  • connected-at - The timestamp on the Talaria machine when the device first connected.
  • reason-for-closure - The reason that Talaria closed the connection.
  • up-time - The total duration of time the connection was open from Talaria's perspective.

If a value listed above is unknown the item may be set to empty or omitted completely.

Enforce WRP.metadata["trust"] for all messages

As WRP message come from the devices through Talaria or messages about devices come from Talaria, the Talaria server needs to enforce the correct value of trust. To do this, Talaria must overwrite the WRP.metadata["trust"] field in the outgoing WRP message to reflect the trust level of the device. Trust level is based on the trust level from the JWT token's trust field. If there is no JWT or it does not have a trust field, the value written SHALL be 0.

Add a few metrics

We want to add the following metrics:

message_propagation_time[ message_type, direction ] as a histogram

message_type label is the wrp message type string

direction is inbound or outbound where inbound is defined as the message is going to the connected device, and outbound is from the connected device

The metric xmidt_talaria_device_count is inconsistent with service discovery metrics

Prometheus Query: sort((xmidt_talaria_device_count{instance=~"talaria-prod-.+", stage="prod", region="hoa"}))

This query shows result from all 126 talaria in the region. Three talaria:
talaria-prod-00101-184-204-4um
talaria-prod-00101-184-204-adb
talaria-prod-00101-184-204-hpf

appear with connected devices but do not appear in the service discovery metrics. Kind of looks like these may be orphaned devices?

Rename master branch to main

Also have to change references to the branch in .travis.yml, README, and CONTRIBUTING. Double check any other markdown files as well - sometimes links have the branch name in them.

Socket Close Metrics

Add some metrics for socket close:

  • whether or not talaria initiated closing the socket
  • reason for close given by device
  • reason for close given by talaria

the reasons for close should probably be enumerated if possible to prevent blowing up prometheus.

Validate partners for a device request

When a request comes inbound for a device, validate that the partner id(s) given for the request match the device's metadata of valid partner id(s). Otherwise, send a 403 response.

malformed online/offline events

Codex is receiving some online/offline events either without the ts key in the payload (most likely option) or where the value for the ts key isn't a string.

Talaria doesn't run with default configuration

We have upgraded from talaria-0.1.1-153.el6.x86_64 to talaria-0.1.3-1.el7.x86_64
The default configuration file no longer works and talaria exits with error. We can also see few "unsupported value type" errors from log (attached below).
Is there any other parameter we need to add with config files.?

Consul running as "consul agent -dev -node machine"

talaria log & journalctl log attached.

talaria_log.txt
journalctl_talaria.txt

/etc/talaria/talaria.yaml

---
primary:
address: ":8080"
health:
address: ":8888"
pprof:
address: ":9999"
discoveryClient:
staticNodes: "https://localhost:8585"
log:
file : "/var/run/talaria/talaria.log"
level : "DEBUG"
maxSize : 50
maxBackup : 3

device-status events not being delivered

versions:
talaria-v0.5.1-1.el7.x86_64
caduceus-v0.2.7-1.el7.x86_64
Description:
Talaria isn't correctly delivering device-status events. This is manifested in a log error when Caduceus attempts the event.

{"error":"Post http://[redacted]:8080/responder: net/http: invalid header field value "/trust=\x00" for key X-Xmidt-Metadata","level":"debug","msg":"retrying HTTP transaction","retry":1,"statusCode":0,"ts":"2020-05-11T22:15:59.728513299Z","url":"http://[redacted]:8080/responder"}

Steps:

  1. start listener & register webhook for device-status
  2. connect device
  3. expected to see device status event
    instead no event delivered

Message origin validation

Currently Talaria does not perform any validation to ensure a WRP message sent from a device actually has the correct src attribute. This means a bad actor can masquerade as any other device once the initial connection authorization has been performed.

Device Access Validator

Devices that connect to talaria present a set of credentials about themselves during registration.
Some of that data includes information about which users are allowed to interact with them.

Before routing requests to devices on behalf of some user, talaria should first check that such user has all the right credentials.

Add Convey Hardware Metric

We would like a metric gauge that shows the count of connections per convey hardware field (hw-model).

If there is not valid convey header hw-model the value invalid should be used.

As connections are made, the gauge+hw-model label will be incremented.
As connections are disconnected/closed, the gauge+hw-model label will be decremented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.