Node status and actions

The current list of status columns and some related information.

Node actions

For some of node statuses, there are some actions that are allowed to help make changes to that nodes’ backing IaaS instance. They provide a way to handle errors per node and seamlessly take care of data migration in case of instance failures and load balancing in case of node addition.

Status	Allowed Actions	Description
To Be Added	Delete	This action can be taken if a universe create fails and node is stuck in ‘To Be Added’. Once this action is performed, the node (and its underlying instance) will be removed from this universe.
Live	Stop Processes, Remove Node	The server processes on the node will be stopped, node status becomes ‘Stopped’. A Live node can be also marked as ‘Removed’, which moves the data out of it along with stopping the server processes running on that node. Note that the backing instance is still under the control of the universe. This removes the MASTER/TSERVER setting of the node on the UI.
Stopped	Start Processes, Release Instance	The server processes that are stopped on that node can be restarted using the ‘Start Processes’ pulldown option. The other option for a ‘Stopped’ node is to release the backing instance to IaaS and that will stop tracking the ip of this node in the universe.
Removed	Add Node, Release Instance	A removed node can be added back - this restarts the processes on that node and move data onto it from other nodes, and marks it ‘Live’. The other option is to release the backing instance to IaaS, and this will stop tracking the ip of this node in the universe and the node will be marked ‘Decomissioned’.
Decommissioned	Add Node	A new instance will be used/spawned to replace the released instance, server processes restarted and data load balanced onto this node. It will become ‘lIve’ after this operation.

Rest of the status types do not have any user actions, as they are mostly transient and will end up in one of the above statuses.

NoteAdd Node just recreates a new backing instance for an existing node in the universe or cluster. To add a completely new node (as in, increase the number of nodes in the universe), one can use the Edit Universe option to expand the universe.

The rest of this page describes how to modify the state of each node in a universe/cluster. The UI provides different actions that can be taken against each node under the ACTION column drop down.

There are two broad set of use cases:

Following are the steps to ensure that an underlying instance for a given node can be replaced with a new instance without any data loss.

Let’s say, for example, that a VM/machine in the universe is hitting end of life or having unrecoverable hardware or other system (OS, disk etc) problems. The machine crashes for good, and so there are no processes running on it. This will be detected by the UI and shown as an Unreachable node. Note that RAFT will ensure other leaders will get elected for the underlying data shards. But the universe is in an partial under replicated scenario and will not be able to tolerate many more failures. So quick remedy is needed.

Note that there is no MASTER/TSERVER shown for that node. If that node was the Master Leader, then the RAFT level election will move it to another node quickly. Similar leader elections happen for tablets for which this node was the the leader tablet server.

Since we know that the instance is dead, we can go ahead and release the ip as well using the ‘Release Instance’ dropdown option at the end of the Removed node. It will show up as a Decommissioned node.

The node can brought back to life on a new backing instance using the Add Node option from the dropdown for the Decomissioned node. For IaaS, like AWS and GCP, YugaWare will spawn with the existing instance type in the correct/existing region and zone of that node. After the end of this operation, the node will have yb-master/yb-tserver processes running along with some data that is load balanced onto this node and status will be marked ‘Live’. Note that the node name is reused and is part of the healthy cluster now.

NoteDo not REMOVE more than (RF - 1)/2 nodes at any given time. For example, on a RF=3 cluster with 3 server nodes, there can only be one removed node. This is needed for consensus algorithm. We will be adding safeguards against this soon.

NoteREMOVE NODE will not work for the case where the number of nodes is equal to the RF of the cluster. Since there is no other nodes to move the data via the load balancer.

Quick operations, on an existing instance

The second scenario is for more of a ‘quick’ planned change that can be performed on a node. For example, the DevOps wants to mount a new disk on the node or just install and run a new security daemon. In that case, the instance is still in use and stopping any running YugabyteDB process might be needed. Then the user can pick the option and then perform the system task, and then pick the Start Processes for that node.

The following two steps helps stop the server processes on the node and restart it back up. There is no data moved out of the node proactively, but the data shard/tablet leaders could change as perf RAFT requirements.

Once the yb-tserver (and yb-master, if applicable) are stopped, the node status is updated and the instance is ready for the planned system changes.

NoteDo not STOP more than (RF - 1)/2 processes at any given time. For example, on an RF=3 cluster with 3 nodes, there can only be one node with stopped processes to allow majority of nodes to perform consensus operations.

After the work is complete, the processes can be restarted via the same dropdown for that node.

The node will go back to ‘Live’ state once the processes are up and running.

In the worst case scenario, when the system runs into some unrecoverable errors at this stage, there is a Release Instance option for the stopped node, which will help remove the backing instance as well, as decribed above.

As a summary of all the actions that were run on this universe, one can check the ‘Tasks’ tab to see the remove/add and stop/start tasks that were run on that universe.

Interaction with other operations

If there is a node in any of the in-transit states of Stopped, or Decommissioned in the universe, we disallow edit operations and . These operations are allowed once such a node comes out of that in-transit state.