Adds support for multiple managers running distributed fate#6139
Adds support for multiple managers running distributed fate#6139keith-turner wants to merge 7 commits intoapache:mainfrom
Conversation
Supports partitioning fate processing across multiple manager
processes. When multiple manager processes are started one will become
primary and do mostly what the manager did before this change. However,
processing of user fate operations is now spread across all manager
processes. The following is high level guide to these changes.
* New ManagerAssistant class that supports running task assigned by the
primary manager process. Currently it only supports fate. This class
runs in every manager process. This class does the following.
* Gets its own lock in zookeeper separate from the primary manager
lock. This lock is at `/managers/assistants` in ZK. This lock is
like a tserver or compactor lock. Every manager process will be
involved in two locks, one to determine who is primary and one for
the assistant functionality.
* Starts a thrift server that can accept assignments of fate ranges
to process. This second thrift server was needed in the manager
because the primary thrift server is not started until after the
primary manager gets its lock.
* In the manager startup sequence this class is created and started
before the manager waits for its primary lock. This allows
non-primary managers to receive work from the primary via RPC.
* In the future, this lock and thrift service could be used by the
primary manager to delegate more task like compaction coordination,
table management, balancing, etc. Each functionality could be
partitioned in its own way and have its own RPCs for delegation.
* This class creates its own SeverContext. This was needed because
the server context has a reference to the server lock. Since a
second lock was created needed another server context. Would
like to try to improve this in a follow on change as it seems
likely to make the code harder to maintain and understand.
* Because this class does not extend AbstractServer it does not get
some of the benefits that class offers like monitoring of its new
lock. Would like to improve this in a follow on issue.
* New FateWorker class. This runs in the ManagerAssistant and handles
request from the primary manager to adjust what range of the fate
table its currently working on.
* New FateManager class that is run by the primary manager and is
responsible for partitioning fate processing across all assistant
managers. As manager processes come and go this will repartition
the fate table evenly across the managers.
* Some new RPCs for best effort notifications. Before these changes
there were in memory notification systems that made the manager
more responsive. These would allow a fate operation to signal the
Tablet Group Watcher to take action sooner. FateWorkerEnv sends
these notifications to the primary manger over a new RPC. Does
not matter if they are lost, things will eventually happen.
* Some adjustment of the order in which metrics were setup in the
startup sequence was needed to make things work. Have not yet
tested metrics w/ these changes.
* Broke Fate class into Fate and FateClient class. The FateClient
class supports starting and checking on fate operations. Most
code uses the FateClient. This breakup was needed as the primary
manager will interact with the FateClient to start operations
and check on their status. Fate extends FateClient to minimize
code changes, but it does not need to. In a follow on would like
to remove this extension.
* Fate operations update two in memory data structures related
to bulk import and running compactions. These updates are no
longer done. Would like reexamine the need for these in follow
on issues.
* Two new tests :
* MultipleManagerIT : tests starting and stopping managers and
ensures fate runs across all managers correctly.
* ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three
managers running. Does not start and stop managers.
This change needs some user facing follow on work to provide information
to the user. Need to update the service status command. Also need to
update listing of running fate operations to show where they are
running.
|
I went back and looked at how I implemented the locking in #3262 , the Multiple Manager PR from almost 3 years ago (I can't believe it's been that long). This comment isn't a suggestion to change what you have here, just pointing out an alternative that may or may not help.
|
|
Do you think that the work partitioning algorithm can be generic enough to support splitting the other types of work that the Manager does? I'm wondering if the algorithm and RPC layer can be more generic. I did something like this in #3801, where a Task is a message sent from one server to another, the message included the instructions for what the receiver should do instead of the RPC being specific to the task. |
Probably could be made more generic. The relevant code is FatePartition.java, FateManager#getDesiredPartitions(), and fate-worker.thrift. Would probably still need the code that generates fate specific partitions, but those could be rolled into a more generic partition data class and related RPC. |
|
Seems like the new ScopedValues in java 25 would be a nice way to solve the multiple server context problem in this PR. |
Supports partitioning fate processing across multiple manager processes. When multiple manager processes are started one will become primary and do mostly what the manager did before this change. However, processing of user fate operations is now spread across all manager processes. The following is a high level guide to these changes.
/managers/assistantsin ZK. This lock is like a tserver or compactor lock. Every manager process will be involved in two locks, one to determine who is primary and one for the assistant functionality.This change needs some user facing follow on work to provide information to the user. Need to update the service status command. Also need to update listing of running fate operations to show where they are running.