Operator tooling: cluster repair #17

sanmiguel · 2016-08-08T18:17:17Z

There have been scheduler bugs in the past that have led to orphaned executors: the only way currently to deal with these is to forcibly kill them off (e.g. using riak-mesos framework teardown) and start over.

We should investigate how we can provide tooling for an operator to manually bring a node back under control of a scheduler.

/cc @seanjensengrey

The text was updated successfully, but these errors were encountered:

seanjensengrey · 2016-08-08T18:19:17Z

Not only scheduler bugs, but ZK corruption, etc. If we are to support running Riak clusters on Mesos with the same kind of uptime and longevity we see on bare metal we need to have ways to transition a node back to a normal operating state w/o killing it.

sanmiguel added the enhancement label Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator tooling: cluster repair #17

Operator tooling: cluster repair #17

sanmiguel commented Aug 8, 2016

seanjensengrey commented Aug 8, 2016

Operator tooling: cluster repair #17

Operator tooling: cluster repair #17

Comments

sanmiguel commented Aug 8, 2016

seanjensengrey commented Aug 8, 2016