Skip to content

Commit 1abc652

Browse files
committed
import timeouts page
1 parent e199fc5 commit 1abc652

File tree

2 files changed

+171
-0
lines changed

2 files changed

+171
-0
lines changed

content/troubleshooting/_index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ title = 'Troubleshooting'
33
date = 2024-09-02T11:18:44-07:00
44
weight = 5
55
+++
6+
7+
[Troubleshooting Client Timeouts](/troubleshooting/timeouts)

content/troubleshooting/timeouts.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
+++
2+
title = 'Client Timeouts'
3+
date = 2024-09-05T11:13:11-07:00
4+
weight = 1
5+
+++
6+
7+
## Troubleshooting Timeouts
8+
9+
Client complaining about "timeout errors", but not sure how to track it down?
10+
Here's a simple utility for examining your situation.
11+
12+
### First, check listen_disabled_num
13+
14+
Before you go ahead with troubleshooting, you'll want to telnet to your
15+
memcached instance and run `stats`, then look for "listen_disabled_num". This
16+
is a poorly named counter which describes how many times you've reached
17+
maxconns. Each time memcached hits maxconns it will delay new connections,
18+
which means you'll possibly get timeouts.
19+
20+
Also, disable or tune any firewalls you may have in the way.
21+
22+
### Then, carefully check the usual suspects
23+
24+
Is the machine in swap? You will see random lag bubbles if your OS is swapping
25+
memcached to disk periodically.
26+
27+
Is the machine overloaded? 0% CPU idle with a load of 400 and memcached
28+
probably isn't getting enough CPU time. You can try `nice` or `renice`, or
29+
just run less on the machine. If you're severely overloaded on CPU, you might
30+
notice the mc_conn_tester below reporting very high wait times for `set`
31+
commands.
32+
33+
Is the memcached server 32bit? 32bit hosts have less memory available to the
34+
kernel for TCP sockets and friends. We've observed some odd behavior under
35+
large numbers of open sockets and high load with 32bit systems. Strongly
36+
consider going 64bit, as it may help some hard to trace problems go away,
37+
including segfaults due to the 2/4g memory limit.
38+
39+
### Next, mc_conn_tester.pl
40+
41+
Fetch this:
42+
43+
https://www.memcached.org/files/mc_conn_tester.pl
44+
45+
```
46+
$ ./mc_conn_tester.pl -s memcached-host -p 11211 -c 1000 --timeout 1
47+
Averages: (conn: 0.00081381) (set: 0.00001603) (get: 0.00040122)
48+
$ ./mc_conn_tester.pl --help
49+
Usage:
50+
mc_conn_tester.pl [options]
51+
52+
Options:
53+
-s --server hostname
54+
Connect to an alternate hostname.
55+
[...etc...]
56+
```
57+
58+
This is a minimal utility for testing a quick routine with a memcached
59+
instance. It will connect, attempt a couple sets, attempt a few gets, then loop and
60+
repeat.
61+
62+
The utility does not use any memcached client and instead does minimal, raw
63+
commands with the ASCII protocol. Thus helping to rule out client bugs.
64+
65+
If it reaches a timeout, you can see how far along in the cycle it was:
66+
67+
```
68+
Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
69+
Fail: (timeout: 1) (elapsed: 1.00133896) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
70+
Fail: (timeout: 1) (elapsed: 1.00135303) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
71+
Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
72+
```
73+
74+
In the above line, it has a total elapsed time of the test, and then the times
75+
at which each sub-test succeeded. In the above scanario it wasn't able to
76+
connect to memcached, so all tests failed.
77+
78+
```
79+
Fail: (timeout: 1) (elapsed: 0.00121498) (conn: 0.00114512) (set: 1.00002694) (get: 0.00000000)
80+
Fail: (timeout: 1) (elapsed: 0.00368810) (conn: 0.00360799) (set: 1.00003314) (get: 0.00000000)
81+
Fail: (timeout: 1) (elapsed: 0.00128603) (conn: 0.00117397) (set: 1.00004005) (get: 0.00000000)
82+
Fail: (timeout: 1) (elapsed: 0.00115108) (conn: 0.00108099) (set: 1.00002789) (get: 0.00000000)
83+
```
84+
85+
In this case, it failed waiting for "get" to complete.
86+
87+
If you want to log all of the tests mc_conn_tester.pl runs, open the file and
88+
change the line:
89+
90+
```
91+
my $debug = 0;
92+
```
93+
94+
to
95+
96+
```
97+
my $debug = 1;
98+
```
99+
100+
You will then see normal lines begin with `loop:` and failed tests will start
101+
with `Fail:` as usual.
102+
103+
### You're probably dropping packets.
104+
105+
In most cases, where listen_disabled_num doesn't apply, you're likely dropping
106+
packets for some reason. Either a firewall is in the way and has run out of
107+
stateful tracking slots, or your network card or switch is dropping packets.
108+
109+
You'll most likely see this manifest as:
110+
111+
```
112+
Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
113+
```
114+
115+
... where `conn:` and the rest are all zero. So the test was not able to
116+
connect to memcached.
117+
118+
On most systems SYN retries are 3 seconds, which is awfully long. Losing a
119+
single SYN packet will certainly mean a timeout. This is easily proven:
120+
121+
```
122+
$ ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 1 > log_one_second
123+
&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 4 > log_three_seconds
124+
&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 8 > log_eight_seconds
125+
```
126+
127+
... Run 5000 tests each round (you can adjust this if you wish). The first one
128+
having a timeout of 1s, which is often the client default. Then next with 4s,
129+
which would allow for one SYN packet to be lost but still pass the test. Then
130+
finally 8s, which allows two SYN packets to be lost in a row and yet still
131+
succeed.
132+
133+
If you see the number of `Fail:` lines in each log file *decrease*, then your
134+
network is likely dropping SYN packets.
135+
136+
Fixing that, however, is beyond the scope of this document.
137+
138+
### TIME_WAIT buckets or local port exhaustion
139+
140+
If mc_conn_tester.pl is seeing connection timeouts (conn: is 0), you may be
141+
running out of local ports, firewall states, or TIME_WAIT buckets. This can
142+
happen if you are opening and closing connections quicker than the sockets can
143+
die off.
144+
145+
Use netstat to see how many you have open, and if the number is high enough
146+
that it may be problematic. `netstat -n | grep -c TIME_WAIT`.
147+
148+
Details of how to tune these variables are outside the scope of this document,
149+
but google for "Linux TCP network tuning TIME_WAIT" (or whatever OS you have) will
150+
usually give you good results. Look for the variables below and understand
151+
their meaning before tuning.
152+
153+
```
154+
!THESE ARE EXAMPLES, NOT RECOMMENDED VALUES!
155+
net.ipv4.ip_local_port_range = 16384 65534
156+
net.ipv4.tcp_max_tw_buckets = 262144
157+
net.ipv4.tcp_fin_timeout = 30
158+
net.ipv4.tcp_tw_reuse = 1
159+
net.ipv4.tcp_tw_recycle = 1
160+
```
161+
162+
Also read up on iptables and look up information on managing conntrack states
163+
or conntrack buckets. If you find some links you love e-mail us and we'll link
164+
them here.
165+
166+
### But your utility never fails!
167+
168+
Odds are good your client has a bug :( Try reaching out to the client author
169+
for help.

0 commit comments

Comments
 (0)