|Managing Pivotal GemFire / Troubleshooting and System Recovery|
The safest response to a network outage is to restart all the processes and bring up a fresh data set.
When the network connecting members of a distributed system goes down, system members treat this like a machine crash. Members on each side of the network failure respond by removing the members on the other side from the membership list. If network partitioning detection is enabled, one partition containing the lead member and a locator will continue to run, and other data hosts will shut down; otherwise, the system will behave as explained below.
The members recreate the data as they return to active work. For details, see Recovering from Application and Cache Server Crashes.
Both sides of the distributed system continue to run as though the members on the other side were not running. If the members that participate in a partitioned region are on both sides of the network failure, both sides of the partitioned region also continue to run as though the data stores on the other side did not exist. In effect, you now have two partitioned regions.
When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single distributed system and combine their buckets back into a single partitioned region. You can be sure that the data is in an inconsistent state. Whether you are configured for data redundancy or not, you don’t really know what data was lost and what wasn’t. Even if you have redundant copies and they survived, different copies of an entry may have different values reflecting the interrupted workflow and inaccessible data.
By default, both sides of the distributed system continue to run as though the members on the other side were not running. For distributed regions, however, the regions’s reliability policy configuration can change this default behavior.
When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single distributed system.
If a client loses contact with all of its servers, the effect is the same as if it had crashed. You need to restart the client. See Recovering from Client Failure. If a client loses contact with some servers, but not all of them, the effect on the client is the same as if the unreachable servers had crashed. See Recovering from Server Failure.
Servers, like applications, are members of a distributed system, so the effect of network failure on a server is the same as for an application. Exactly what happens depends on the configuration of your site.