|
| 1 | ++++ |
| 2 | +title = 'Client Timeouts' |
| 3 | +date = 2024-09-05T11:13:11-07:00 |
| 4 | +weight = 1 |
| 5 | ++++ |
| 6 | + |
| 7 | +## Troubleshooting Timeouts |
| 8 | + |
| 9 | +Client complaining about "timeout errors", but not sure how to track it down? |
| 10 | +Here's a simple utility for examining your situation. |
| 11 | + |
| 12 | +### First, check listen_disabled_num |
| 13 | + |
| 14 | +Before you go ahead with troubleshooting, you'll want to telnet to your |
| 15 | +memcached instance and run `stats`, then look for "listen_disabled_num". This |
| 16 | +is a poorly named counter which describes how many times you've reached |
| 17 | +maxconns. Each time memcached hits maxconns it will delay new connections, |
| 18 | +which means you'll possibly get timeouts. |
| 19 | + |
| 20 | +Also, disable or tune any firewalls you may have in the way. |
| 21 | + |
| 22 | +### Then, carefully check the usual suspects |
| 23 | + |
| 24 | +Is the machine in swap? You will see random lag bubbles if your OS is swapping |
| 25 | +memcached to disk periodically. |
| 26 | + |
| 27 | +Is the machine overloaded? 0% CPU idle with a load of 400 and memcached |
| 28 | +probably isn't getting enough CPU time. You can try `nice` or `renice`, or |
| 29 | +just run less on the machine. If you're severely overloaded on CPU, you might |
| 30 | +notice the mc_conn_tester below reporting very high wait times for `set` |
| 31 | +commands. |
| 32 | + |
| 33 | +Is the memcached server 32bit? 32bit hosts have less memory available to the |
| 34 | +kernel for TCP sockets and friends. We've observed some odd behavior under |
| 35 | +large numbers of open sockets and high load with 32bit systems. Strongly |
| 36 | +consider going 64bit, as it may help some hard to trace problems go away, |
| 37 | +including segfaults due to the 2/4g memory limit. |
| 38 | + |
| 39 | +### Next, mc_conn_tester.pl |
| 40 | + |
| 41 | +Fetch this: |
| 42 | + |
| 43 | +https://www.memcached.org/files/mc_conn_tester.pl |
| 44 | + |
| 45 | +``` |
| 46 | +$ ./mc_conn_tester.pl -s memcached-host -p 11211 -c 1000 --timeout 1 |
| 47 | +Averages: (conn: 0.00081381) (set: 0.00001603) (get: 0.00040122) |
| 48 | +$ ./mc_conn_tester.pl --help |
| 49 | +Usage: |
| 50 | + mc_conn_tester.pl [options] |
| 51 | +
|
| 52 | +Options: |
| 53 | + -s --server hostname |
| 54 | + Connect to an alternate hostname. |
| 55 | +[...etc...] |
| 56 | +``` |
| 57 | + |
| 58 | +This is a minimal utility for testing a quick routine with a memcached |
| 59 | +instance. It will connect, attempt a couple sets, attempt a few gets, then loop and |
| 60 | +repeat. |
| 61 | + |
| 62 | +The utility does not use any memcached client and instead does minimal, raw |
| 63 | +commands with the ASCII protocol. Thus helping to rule out client bugs. |
| 64 | + |
| 65 | +If it reaches a timeout, you can see how far along in the cycle it was: |
| 66 | + |
| 67 | +``` |
| 68 | +Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000) |
| 69 | +Fail: (timeout: 1) (elapsed: 1.00133896) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000) |
| 70 | +Fail: (timeout: 1) (elapsed: 1.00135303) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000) |
| 71 | +Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000) |
| 72 | +``` |
| 73 | + |
| 74 | +In the above line, it has a total elapsed time of the test, and then the times |
| 75 | +at which each sub-test succeeded. In the above scanario it wasn't able to |
| 76 | +connect to memcached, so all tests failed. |
| 77 | + |
| 78 | +``` |
| 79 | +Fail: (timeout: 1) (elapsed: 0.00121498) (conn: 0.00114512) (set: 1.00002694) (get: 0.00000000) |
| 80 | +Fail: (timeout: 1) (elapsed: 0.00368810) (conn: 0.00360799) (set: 1.00003314) (get: 0.00000000) |
| 81 | +Fail: (timeout: 1) (elapsed: 0.00128603) (conn: 0.00117397) (set: 1.00004005) (get: 0.00000000) |
| 82 | +Fail: (timeout: 1) (elapsed: 0.00115108) (conn: 0.00108099) (set: 1.00002789) (get: 0.00000000) |
| 83 | +``` |
| 84 | + |
| 85 | +In this case, it failed waiting for "get" to complete. |
| 86 | + |
| 87 | +If you want to log all of the tests mc_conn_tester.pl runs, open the file and |
| 88 | +change the line: |
| 89 | + |
| 90 | +``` |
| 91 | +my $debug = 0; |
| 92 | +``` |
| 93 | + |
| 94 | +to |
| 95 | + |
| 96 | +``` |
| 97 | +my $debug = 1; |
| 98 | +``` |
| 99 | + |
| 100 | +You will then see normal lines begin with `loop:` and failed tests will start |
| 101 | +with `Fail:` as usual. |
| 102 | + |
| 103 | +### You're probably dropping packets. |
| 104 | + |
| 105 | +In most cases, where listen_disabled_num doesn't apply, you're likely dropping |
| 106 | +packets for some reason. Either a firewall is in the way and has run out of |
| 107 | +stateful tracking slots, or your network card or switch is dropping packets. |
| 108 | + |
| 109 | +You'll most likely see this manifest as: |
| 110 | + |
| 111 | +``` |
| 112 | +Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000) |
| 113 | +``` |
| 114 | + |
| 115 | +... where `conn:` and the rest are all zero. So the test was not able to |
| 116 | +connect to memcached. |
| 117 | + |
| 118 | +On most systems SYN retries are 3 seconds, which is awfully long. Losing a |
| 119 | +single SYN packet will certainly mean a timeout. This is easily proven: |
| 120 | + |
| 121 | +``` |
| 122 | +$ ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 1 > log_one_second |
| 123 | +&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 4 > log_three_seconds |
| 124 | +&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 8 > log_eight_seconds |
| 125 | +``` |
| 126 | + |
| 127 | +... Run 5000 tests each round (you can adjust this if you wish). The first one |
| 128 | +having a timeout of 1s, which is often the client default. Then next with 4s, |
| 129 | +which would allow for one SYN packet to be lost but still pass the test. Then |
| 130 | +finally 8s, which allows two SYN packets to be lost in a row and yet still |
| 131 | +succeed. |
| 132 | + |
| 133 | +If you see the number of `Fail:` lines in each log file *decrease*, then your |
| 134 | +network is likely dropping SYN packets. |
| 135 | + |
| 136 | +Fixing that, however, is beyond the scope of this document. |
| 137 | + |
| 138 | +### TIME_WAIT buckets or local port exhaustion |
| 139 | + |
| 140 | +If mc_conn_tester.pl is seeing connection timeouts (conn: is 0), you may be |
| 141 | +running out of local ports, firewall states, or TIME_WAIT buckets. This can |
| 142 | +happen if you are opening and closing connections quicker than the sockets can |
| 143 | +die off. |
| 144 | + |
| 145 | +Use netstat to see how many you have open, and if the number is high enough |
| 146 | +that it may be problematic. `netstat -n | grep -c TIME_WAIT`. |
| 147 | + |
| 148 | +Details of how to tune these variables are outside the scope of this document, |
| 149 | +but google for "Linux TCP network tuning TIME_WAIT" (or whatever OS you have) will |
| 150 | +usually give you good results. Look for the variables below and understand |
| 151 | +their meaning before tuning. |
| 152 | + |
| 153 | +``` |
| 154 | +!THESE ARE EXAMPLES, NOT RECOMMENDED VALUES! |
| 155 | +net.ipv4.ip_local_port_range = 16384 65534 |
| 156 | +net.ipv4.tcp_max_tw_buckets = 262144 |
| 157 | +net.ipv4.tcp_fin_timeout = 30 |
| 158 | +net.ipv4.tcp_tw_reuse = 1 |
| 159 | +net.ipv4.tcp_tw_recycle = 1 |
| 160 | +``` |
| 161 | + |
| 162 | +Also read up on iptables and look up information on managing conntrack states |
| 163 | +or conntrack buckets. If you find some links you love e-mail us and we'll link |
| 164 | +them here. |
| 165 | + |
| 166 | +### But your utility never fails! |
| 167 | + |
| 168 | +Odds are good your client has a bug :( Try reaching out to the client author |
| 169 | +for help. |
0 commit comments