Adds support for multiple managers running distributed fate by keith-turner · Pull Request #6139 · apache/accumulo

keith-turner · 2026-02-20T00:40:02Z

Supports partitioning fate processing across multiple manager processes. When multiple manager processes are started one will become primary and do mostly what the manager did before this change. However, processing of user fate operations is now spread across all manager processes. The following is a high level guide to these changes.

New ManagerAssistant class that supports running task assigned by the primary manager process. Currently it only supports fate. This class runs in every manager process. This class does the following.
- Gets its own lock in zookeeper separate from the primary manager lock. This lock is at /managers/assistants in ZK. This lock is like a tserver or compactor lock. Every manager process will be involved in two locks, one to determine who is primary and one for the assistant functionality.
- Starts a thrift server that can accept assignments of fate ranges to process. This second thrift server was needed in the manager because the primary thrift server is not started until after the primary manager gets its lock.
- In the manager startup sequence this class is created and started before the manager waits for its primary lock. This allows non-primary managers to receive work from the primary via RPC.
- In the future, this lock and thrift service could be used by the primary manager to delegate more task like compaction coordination, table management, balancing, etc. Each functionality could be partitioned in its own way and have its own RPCs for delegation.
- This class creates its own SeverContext. This was needed because the server context has a reference to the server lock. Since a second lock was created, needed another server context. Would like to try to improve this in a follow on change as it seems likely to make the code harder to maintain and understand.
- Because this class does not extend AbstractServer it does not get some of the benefits that class offers like monitoring of its new lock. Would like to improve this in a follow on issue.
New FateWorker class. This runs in the ManagerAssistant and handles request from the primary manager to adjust what range of the fate table its currently working on.
New FateManager class that is run by the primary manager and is responsible for partitioning fate processing across all assistant managers. As manager processes come and go this will repartition the fate table evenly across the managers.
Some new RPCs for best effort notifications. Before these changes there were in memory notification systems that made the manager more responsive. These would allow a fate operation to signal the Tablet Group Watcher to take action sooner. FateWorkerEnv sends these notifications to the primary manger over a new RPC. Does not matter if they are lost, things will eventually happen.
Some adjustment of the order in which metrics were setup in the startup sequence was needed to make things work. Have not yet tested metrics w/ these changes.
Broke Fate class into Fate and FateClient class. The FateClient class supports starting and checking on fate operations. Most code uses the FateClient. This breakup was needed as the primary manager will interact with the FateClient to start operations and check on their status. Fate extends FateClient to minimize code changes, but it does not need to. In a follow on would like to remove this extension.
Fate operations update two in memory data structures related to bulk import and running compactions. These updates are no longer done. Would like reexamine the need for these in follow on issues.
Two new tests :
- MultipleManagerIT : tests starting and stopping managers and ensures fate runs across all managers correctly.
- ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three managers running. Does not start and stop managers.

This change needs some user facing follow on work to provide information to the user. Need to update the service status command. Also need to update listing of running fate operations to show where they are running.

Supports partitioning fate processing across multiple manager processes. When multiple manager processes are started one will become primary and do mostly what the manager did before this change. However, processing of user fate operations is now spread across all manager processes. The following is high level guide to these changes. * New ManagerAssistant class that supports running task assigned by the primary manager process. Currently it only supports fate. This class runs in every manager process. This class does the following. * Gets its own lock in zookeeper separate from the primary manager lock. This lock is at `/managers/assistants` in ZK. This lock is like a tserver or compactor lock. Every manager process will be involved in two locks, one to determine who is primary and one for the assistant functionality. * Starts a thrift server that can accept assignments of fate ranges to process. This second thrift server was needed in the manager because the primary thrift server is not started until after the primary manager gets its lock. * In the manager startup sequence this class is created and started before the manager waits for its primary lock. This allows non-primary managers to receive work from the primary via RPC. * In the future, this lock and thrift service could be used by the primary manager to delegate more task like compaction coordination, table management, balancing, etc. Each functionality could be partitioned in its own way and have its own RPCs for delegation. * This class creates its own SeverContext. This was needed because the server context has a reference to the server lock. Since a second lock was created needed another server context. Would like to try to improve this in a follow on change as it seems likely to make the code harder to maintain and understand. * Because this class does not extend AbstractServer it does not get some of the benefits that class offers like monitoring of its new lock. Would like to improve this in a follow on issue. * New FateWorker class. This runs in the ManagerAssistant and handles request from the primary manager to adjust what range of the fate table its currently working on. * New FateManager class that is run by the primary manager and is responsible for partitioning fate processing across all assistant managers. As manager processes come and go this will repartition the fate table evenly across the managers. * Some new RPCs for best effort notifications. Before these changes there were in memory notification systems that made the manager more responsive. These would allow a fate operation to signal the Tablet Group Watcher to take action sooner. FateWorkerEnv sends these notifications to the primary manger over a new RPC. Does not matter if they are lost, things will eventually happen. * Some adjustment of the order in which metrics were setup in the startup sequence was needed to make things work. Have not yet tested metrics w/ these changes. * Broke Fate class into Fate and FateClient class. The FateClient class supports starting and checking on fate operations. Most code uses the FateClient. This breakup was needed as the primary manager will interact with the FateClient to start operations and check on their status. Fate extends FateClient to minimize code changes, but it does not need to. In a follow on would like to remove this extension. * Fate operations update two in memory data structures related to bulk import and running compactions. These updates are no longer done. Would like reexamine the need for these in follow on issues. * Two new tests : * MultipleManagerIT : tests starting and stopping managers and ensures fate runs across all managers correctly. * ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three managers running. Does not start and stop managers. This change needs some user facing follow on work to provide information to the user. Need to update the service status command. Also need to update listing of running fate operations to show where they are running.

dlmarion · 2026-02-20T12:45:46Z

I went back and looked at how I implemented the locking in #3262 , the Multiple Manager PR from almost 3 years ago (I can't believe it's been that long). This comment isn't a suggestion to change what you have here, just pointing out an alternative that may or may not help.

In the Manager class I changed the locking such that each Manager got its own lock at the same path that it's using today, but it's not an exclusive lock. Then I added a new path for the "primary manager".
I modified ManagerClient.getManagerConnection to return the address of the primary manager when the client wanted to communicate with specific RPC endpoints that only the primary manager handled.
In the Manager class I added a property for the expected number of Managers, and blocked all of the Managers from fully starting until that number had been reached. Much like it does for tablet servers.
There is a LiveManagerSet implementation that might be of use, it behaves just like LiveTServerSet to listen and notify for changes to the number of servers.
One difference I see between this and Started working on multiple managers #3262 is that in Started working on multiple managers #3262 the Thrift RPC server is started before the manager gets its lock. Before the lock is obtained some fake ServiceDescriptors are added to the ServiceLockData so that clients cannot connect. After the lock is obtained, the lock data is updated with the correct host and port information that we want to advertise for that type of Manager (primary vs active secondary).

dlmarion · 2026-02-20T12:46:57Z

Do you think that the work partitioning algorithm can be generic enough to support splitting the other types of work that the Manager does? I'm wondering if the algorithm and RPC layer can be more generic. I did something like this in #3801, where a Task is a message sent from one server to another, the message included the instructions for what the receiver should do instead of the RPC being specific to the task.

keith-turner · 2026-02-20T18:34:41Z

Do you think that the work partitioning algorithm can be generic enough to support splitting the other types of work that the Manager does? I'm wondering if the algorithm and RPC layer can be more generic.

Probably could be made more generic. The relevant code is FatePartition.java, FateManager#getDesiredPartitions(), and fate-worker.thrift. Would probably still need the code that generates fate specific partitions, but those could be rolled into a more generic partition data class and related RPC.

keith-turner · 2026-02-21T00:25:02Z

Seems like the new ScopedValues in java 25 would be a nice way to solve the multiple server context problem in this PR.

keith-turner added this to the 4.0.0 milestone Feb 20, 2026

keith-turner mentioned this pull request Feb 20, 2026

WIP distributed Fate #6119

Closed

keith-turner added 3 commits February 20, 2026 00:43

remove done follow on comment

952ddf3

fix build issues

3cee431

add missing override

a286363

keith-turner mentioned this pull request Feb 20, 2026

Breaks Fate into Fate and FateClient #6142

Open

keith-turner added 3 commits February 20, 2026 22:13

Merge branch 'main' into dist-fate2

ec19663

remove hadoop import

b25658e

fix build issues

cef527e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Adds support for multiple managers running distributed fate#6139

Adds support for multiple managers running distributed fate#6139
keith-turner wants to merge 7 commits intoapache:mainfrom
keith-turner:dist-fate2

keith-turner commented Feb 20, 2026

Uh oh!

dlmarion commented Feb 20, 2026 •

edited

Loading

Uh oh!

dlmarion commented Feb 20, 2026 •

edited

Loading

Uh oh!

keith-turner commented Feb 20, 2026

Uh oh!

keith-turner commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

keith-turner commented Feb 20, 2026

Uh oh!

dlmarion commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlmarion commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented Feb 20, 2026

Uh oh!

keith-turner commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dlmarion commented Feb 20, 2026 •

edited

Loading

dlmarion commented Feb 20, 2026 •

edited

Loading