Skip to content

Comments

Adds support for multiple managers running distributed fate#6139

Open
keith-turner wants to merge 7 commits intoapache:mainfrom
keith-turner:dist-fate2
Open

Adds support for multiple managers running distributed fate#6139
keith-turner wants to merge 7 commits intoapache:mainfrom
keith-turner:dist-fate2

Conversation

@keith-turner
Copy link
Contributor

Supports partitioning fate processing across multiple manager processes. When multiple manager processes are started one will become primary and do mostly what the manager did before this change. However, processing of user fate operations is now spread across all manager processes. The following is a high level guide to these changes.

  • New ManagerAssistant class that supports running task assigned by the primary manager process. Currently it only supports fate. This class runs in every manager process. This class does the following.
    • Gets its own lock in zookeeper separate from the primary manager lock. This lock is at /managers/assistants in ZK. This lock is like a tserver or compactor lock. Every manager process will be involved in two locks, one to determine who is primary and one for the assistant functionality.
    • Starts a thrift server that can accept assignments of fate ranges to process. This second thrift server was needed in the manager because the primary thrift server is not started until after the primary manager gets its lock.
    • In the manager startup sequence this class is created and started before the manager waits for its primary lock. This allows non-primary managers to receive work from the primary via RPC.
    • In the future, this lock and thrift service could be used by the primary manager to delegate more task like compaction coordination, table management, balancing, etc. Each functionality could be partitioned in its own way and have its own RPCs for delegation.
    • This class creates its own SeverContext. This was needed because the server context has a reference to the server lock. Since a second lock was created, needed another server context. Would like to try to improve this in a follow on change as it seems likely to make the code harder to maintain and understand.
    • Because this class does not extend AbstractServer it does not get some of the benefits that class offers like monitoring of its new lock. Would like to improve this in a follow on issue.
  • New FateWorker class. This runs in the ManagerAssistant and handles request from the primary manager to adjust what range of the fate table its currently working on.
  • New FateManager class that is run by the primary manager and is responsible for partitioning fate processing across all assistant managers. As manager processes come and go this will repartition the fate table evenly across the managers.
  • Some new RPCs for best effort notifications. Before these changes there were in memory notification systems that made the manager more responsive. These would allow a fate operation to signal the Tablet Group Watcher to take action sooner. FateWorkerEnv sends these notifications to the primary manger over a new RPC. Does not matter if they are lost, things will eventually happen.
  • Some adjustment of the order in which metrics were setup in the startup sequence was needed to make things work. Have not yet tested metrics w/ these changes.
  • Broke Fate class into Fate and FateClient class. The FateClient class supports starting and checking on fate operations. Most code uses the FateClient. This breakup was needed as the primary manager will interact with the FateClient to start operations and check on their status. Fate extends FateClient to minimize code changes, but it does not need to. In a follow on would like to remove this extension.
  • Fate operations update two in memory data structures related to bulk import and running compactions. These updates are no longer done. Would like reexamine the need for these in follow on issues.
  • Two new tests :
    • MultipleManagerIT : tests starting and stopping managers and ensures fate runs across all managers correctly.
    • ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three managers running. Does not start and stop managers.

This change needs some user facing follow on work to provide information to the user. Need to update the service status command. Also need to update listing of running fate operations to show where they are running.

Supports partitioning fate processing across multiple manager
processes. When multiple manager processes are started one will become
primary and do mostly what the manager did before this change.  However,
processing of user fate operations is now spread across all manager
processes.  The following is high level guide to these changes.

 * New ManagerAssistant class that supports running task assigned by the
   primary manager process.  Currently it only supports fate. This class
   runs in every manager process.  This class does the following.
   * Gets its own lock in zookeeper separate from the primary manager
     lock. This lock is at `/managers/assistants` in ZK.  This lock is
     like a tserver or compactor lock.  Every manager process will be
     involved in two locks, one to determine who is primary and one for
     the assistant functionality.
   * Starts a thrift server that can accept assignments of fate ranges
     to process. This second thrift server was needed in the manager
     because the primary thrift server is not started until after the
     primary manager gets its lock.
   * In the manager startup sequence this class is created and started
     before the manager waits for its primary lock.  This allows
     non-primary managers to receive work from the primary via RPC.
   * In the future, this lock and thrift service could be used by the
     primary manager to delegate more task like compaction coordination,
     table management, balancing, etc.  Each functionality could be
     partitioned in its own way and have its own RPCs for delegation.
   * This class creates its own SeverContext.  This was needed because
     the server context has a reference to the server lock.  Since a
     second lock was created needed another server context.  Would
     like to try to improve this in a follow on change as it seems
     likely to make the code harder to maintain and understand.
   * Because this class does not extend AbstractServer it does not get
     some of the benefits that class offers like monitoring of its new
     lock.  Would like to improve this in a follow on issue.
 * New FateWorker class.  This runs in the ManagerAssistant and handles
   request from the primary manager to adjust what range of the fate
   table its currently working on.
 * New FateManager class that is run by the primary manager and is
   responsible for partitioning fate processing across all assistant
   managers. As manager processes come and go this will repartition
   the fate table evenly across the managers.
 * Some new RPCs for best effort notifications. Before these changes
   there were in memory notification systems that made the manager
   more responsive.  These would allow a fate operation to signal the
   Tablet Group Watcher to take action sooner.  FateWorkerEnv sends
   these notifications to the primary manger over a new RPC.  Does
   not matter if they are lost, things will eventually happen.
 * Some adjustment of the order in which metrics were setup in the
   startup sequence was needed to make things work. Have not yet
   tested metrics w/ these changes.
 * Broke Fate class into Fate and FateClient class.  The FateClient
   class supports starting and checking on fate operations. Most
   code uses the FateClient.  This breakup was needed as the primary
   manager will interact with the FateClient to start operations
   and check on their status. Fate extends FateClient to minimize
   code changes, but it does not need to. In a follow on would like
   to remove this extension.
 * Fate operations update two in memory data structures related
   to bulk import and running compactions.  These updates are no
   longer done.  Would like reexamine the need for these in follow
   on issues.
 * Two new tests :
    * MultipleManagerIT : tests starting and stopping managers and
      ensures fate runs across all managers correctly.
    * ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three
      managers running.  Does not start and stop managers.

This change needs some user facing follow on work to provide information
to the user.  Need to update the service status command.  Also need to
update listing of running fate operations to show where they are
running.
@keith-turner keith-turner added this to the 4.0.0 milestone Feb 20, 2026
@dlmarion
Copy link
Contributor

dlmarion commented Feb 20, 2026

I went back and looked at how I implemented the locking in #3262 , the Multiple Manager PR from almost 3 years ago (I can't believe it's been that long). This comment isn't a suggestion to change what you have here, just pointing out an alternative that may or may not help.

  1. In the Manager class I changed the locking such that each Manager got its own lock at the same path that it's using today, but it's not an exclusive lock. Then I added a new path for the "primary manager".
  2. I modified ManagerClient.getManagerConnection to return the address of the primary manager when the client wanted to communicate with specific RPC endpoints that only the primary manager handled.
  3. In the Manager class I added a property for the expected number of Managers, and blocked all of the Managers from fully starting until that number had been reached. Much like it does for tablet servers.
  4. There is a LiveManagerSet implementation that might be of use, it behaves just like LiveTServerSet to listen and notify for changes to the number of servers.
  5. One difference I see between this and Started working on multiple managers #3262 is that in Started working on multiple managers #3262 the Thrift RPC server is started before the manager gets its lock. Before the lock is obtained some fake ServiceDescriptors are added to the ServiceLockData so that clients cannot connect. After the lock is obtained, the lock data is updated with the correct host and port information that we want to advertise for that type of Manager (primary vs active secondary).

@dlmarion
Copy link
Contributor

dlmarion commented Feb 20, 2026

Do you think that the work partitioning algorithm can be generic enough to support splitting the other types of work that the Manager does? I'm wondering if the algorithm and RPC layer can be more generic. I did something like this in #3801, where a Task is a message sent from one server to another, the message included the instructions for what the receiver should do instead of the RPC being specific to the task.

@keith-turner
Copy link
Contributor Author

Do you think that the work partitioning algorithm can be generic enough to support splitting the other types of work that the Manager does? I'm wondering if the algorithm and RPC layer can be more generic.

Probably could be made more generic. The relevant code is FatePartition.java, FateManager#getDesiredPartitions(), and fate-worker.thrift. Would probably still need the code that generates fate specific partitions, but those could be rolled into a more generic partition data class and related RPC.

@keith-turner
Copy link
Contributor Author

Seems like the new ScopedValues in java 25 would be a nice way to solve the multiple server context problem in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants