|Managing Pivotal GemFire / Network Partitioning|
This topic describes network partitioning scenarios and what happens to the partitioned sides of the distributed system.
In a network partitioning scenario, the "losing side" constitutes the cluster partition where the membership coordinator has detected that there is an insufficient quorum of members to continue.
The membership coordinator calculates membership weight change after sending out its view preparation message. If a quorum of members does not remain after the view preparation phase, the coordinator on the "losing side" declares a network partition event and sends a network-partition-detected UDP message to the members. The coordinator then closes its distributed system with a ForcedDisconnectException. If a member fails to receive the message before the coordinator closes the connection, it is responsible for detecting the event on its own.
When the losing side discovers that a network partition event has occurred, all peer members receive a RegionDestroyedException with Operation: FORCED_DISCONNECT.
[info 2008/05/01 11:14:51.853 PDT <CloserThread> tid=0x4a] Invoked splitBrain.SBListener: afterRegionDestroy in client1 whereIWasRegistered: 14291 event.isReinitializing(): false event.getDistributedMember(): thor(14291):40440/34132 event.getCallbackArgument(): null event.getRegion(): /TestRegion event.isDistributed(): false event.isExpiration(): false event.isOriginRemote(): false Operation: FORCED_DISCONNECT Operation.isDistributed(): false Operation.isExpiration(): false
Peers still actively performing operations on the cache may see ShutdownExceptions or CacheClosedExceptions with Caused by: ForcedDisconnectException.
When a member is isolated from all locators, it is unable to receive membership view changes. It can't know if the current coordinator is present or, if it has left, whether there are other members available to take over that role. In this condition, a member will eventually detect the loss of all other members and will use the loss threshold to determine whether it should shut itself down. In the case of a distributed system with 2 locators and 2 cache servers, the loss of communication with the non-lead cache server plus both locators would result in this situation and the remaining cache server would eventually shut itself down.