Skip to content

Commit 557a3d1

Browse files
committed
Update resliency page
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
1 parent bef38a4 commit 557a3d1

File tree

1 file changed

+23
-21
lines changed

1 file changed

+23
-21
lines changed

docs/resiliency/index.asciidoc

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -94,16 +94,35 @@ space. The following issues have been identified:
9494

9595
Other safeguards are tracked in the meta-issue {GIT}11511[#11511].
9696

97+
98+
[float]
99+
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)
100+
101+
Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node
102+
while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719]
103+
104+
[float]
105+
=== Jepsen Test Failures (STATUS: ONGOING)
106+
107+
We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses.
108+
109+
[float]
110+
=== Document guarantees and handling of failure (STATUS: ONGOING)
111+
112+
This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case.
113+
114+
== Unreleased
115+
97116
[float]
98-
=== Loss of documents during network partition (STATUS: ONGOING)
117+
=== Loss of documents during network partition (STATUS: UNRELEASED, v5.0.0)
99118

100119
If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).
101120

102121
If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in {GIT}2488[split-brain] before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master ({GIT}7572[#7572]).
103122
To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. {GIT}14252[#14252]
104123

105124
[float]
106-
=== Safe primary relocations (STATUS: ONGOING)
125+
=== Safe primary relocations (STATUS: UNRELEASED, v5.0.0)
107126

108127
When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As
109128
cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the
@@ -117,23 +136,7 @@ on the relocation target, each of the nodes believes the other to be the active
117136
chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. {GIT}12573[#12573]
118137

119138
[float]
120-
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)
121-
122-
Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node
123-
while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719]
124-
125-
[float]
126-
=== Jepsen Test Failures (STATUS: ONGOING)
127-
128-
We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses.
129-
130-
[float]
131-
=== Document guarantees and handling of failure (STATUS: ONGOING)
132-
133-
This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case.
134-
135-
[float]
136-
=== Do not allow stale shards to automatically be promoted to primary (STATUS: ONGOING, v5.0.0)
139+
=== Do not allow stale shards to automatically be promoted to primary (STATUS: UNRELEASED, v5.0.0)
137140

138141
In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data
139142
to no data at all ({GIT}14671[#14671]). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather
@@ -143,7 +146,7 @@ for one of the good shard copies to reappear. In case where all good copies are
143146
stale shard copy.
144147

145148
[float]
146-
=== Make index creation resilient to index closing and full cluster crashes (STATUS: ONGOING, v5.0.0)
149+
=== Make index creation resilient to index closing and full cluster crashes (STATUS: UNRELEASED, v5.0.0)
147150

148151
Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that
149152
a primary cannot be assigned if the cluster dies before enough shards have been allocated ({GIT}9126[#9126]). The same happens if an index
@@ -153,7 +156,6 @@ recover an index in the presence of a single shard copy. Allocation IDs can also
153156
but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh
154157
shard will be allocated upon reopening the index.
155158

156-
== Unreleased
157159

158160
[float]
159161
=== Use two phase commit for Cluster State publishing (STATUS: UNRELEASED, v5.0.0)

0 commit comments

Comments
 (0)