Automatically migrated from Gitolite
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

55 lines
2.2 KiB

9 years ago
* allow comments in (parentheses) in units, and ignore these when matching against an alarm pattern...
* web interface (angularjs)
* separate alarm and IRC logic
* monitor inodes
9 years ago
* watchdog on slave and master -> should send WARN notifications
* notifications (text, arbitrary-serialized-data as attachment, DEBUG/INFO/WARN/ERR/CRIT)
* consider redundancy - can already connect multiple masters through pubsub, how to deal with duplicate processing checking?
-> subscribe to ccollectd
-> debug switch for outputting all to terminal
-> keep up/down state
-> keep last-value state (resource usage)
-> keep track of persistent downtimes (down for more than X time, as configured in config file)
-> alarms (move this from the IRC bot to cprocessd)
-> classify message importance
-> cprocessd-stream socket, PUB that just streams processed data
-> cprocessd-query socket, REP that responds to queries
-> server-status
-> down-list
-> last-value
-> server-list
-> service-list
-> use marrow.mailer
-> receives data from cprocessd-stream
-> sends e-mails for configured importance levels
-> currently named 'alert'
-> receives data from cprocessd-stream
-> IRC bot
-> posts alerts to specified IRC channels, depending on minimum severity level configured for that channel (ie. INFO for #cryto-network but ERR for #crytocc)
-> sends SMS for (critical) alerts
-> receives data from cprocessd-stream
-> Twilio? does a provider-neutral API exist? might need an extra abstraction...
-> offers web interface with streaming status data
-> publicly accessible and password-protected
-> streaming data from cprocessd-stream
-> on-pageload state from cprocessd-query (including 'current downtimes')
-> tornado+zmq ioloop,
-> web dashboard
-> AngularJS
-> fancy graphs (via AngularJS? idk if a directive exists for this)
-> show downtimes as well as live per-machine stats
-> also show overview of all machines in a grid, color-coded for average load of all resources
-> historical up/down data
-> sqlite storage? single concurrent write, so should work
-> perhaps letting people sign up for e-mail alerts is an option? to-inbox will be tricky here