In a variety of different subsystems here at Rapleaf, we’ve noticed some peculiar behavior from the Ruby socket libraries. Under some circumstances, a blocking IO call on a socket appears to go on forever, even in the face of the underlying socket closing.
Now, obviously, a blocking IO call is not expected to return until it’s done. If there’s an error in the underlying socket, like a connection reset or closed, you’d expect the blocking call to fail and throw an exception, or at least return *something* that indicates that it failed. In most situations, this is what happens. For instance:
value = @sock.read(4)
will return nil if there is an error in the read call. With this kind of pattern, you can code robustly around socket communication.
However, we’ve seen some circumstances where this doesn’t work like expected. One of those instances is in a DRb server under a high number of connections – 1024 to be precise. When the 1025th connection comes in, the process essentially goes berserk. It goes from moderate load to 100% CPU and stops responding to new connection requests. Some poking around with ThreadDump led us to find the DRb server waiting forever for a #read call. Using netstat, we were able to verify that the connection had already closed on the client side, but the server was still waiting to do its read. Moreover, nothing will *ever* cause that connection to go away on its own. The introduction of a timeout call around the read appears to alleviate the problem. (I believe that the underlying problem of going nuts at 1025 sockets is an issue with the implementation of C’s select function.) Note – you can only usually get this behavior to occur when you’ve increased the default number of open files beyond 1024. Usually, you’ll get “too many open files” errors long before 1024 sockets are open.
Another place that we saw an infinite wait cause us trouble was within the memcache-client gem. We run three web servers, each with a memcached instance on it. During some maintenance, we rebooted one of the web servers, expecting that the memcache-clients in all our Mongrels would happily go about their business. Wrong! Trusty ThreadDump allowed us to find the server waiting for memcache-client to make a connection to the downed memcache server. Even more peculiarly, the problem only occurred when the web server hosting memcached was rebooted without first downing the memcached server. If the memcached process was shutdown normally, all the other servers recovered and went along happily. It seems that attempting to open a socket to a downed server will hang forever in some circumstances, even without a large amount of load on it (as in the previous example). Again, an introduction of a timeout around each get/set/add/delete/get_multi method of memcache-client has solved the problem.
Finally, even the stalwart Net/HTTP standard library appears to be a victim of this behavior. Periodically, when we attempt to connect to various web services, the call to TCPSocket.open will hang forever. The real irony of this one is that Net/HTTP even has the code for connection timeouts built into the class, but the standard #start method doesn’t give you a way to pass a value. (This particular issue is treated on in much greater detail here).
So what’s the lesson here? If you’re doing blocking IO over the network, Ruby’s socket library cannot be trusted to fail well. It’s up to you to add your own timeouts to survive potentially webserver-crushing infinite waits.
This does, however, make me wonder about, first, the quality of the socket implementation in Ruby, and second, whether some effort should be taken to rewrite the standard library code, as well as various gems, with nonblocking IO methods. That is, use the *_nonblock style methods and IO.select to do single-threaded timed-out IO. This would be preferable to the use of Timeout::timeout, since all that does is spawn a thread to do the work that will be interrupted with an exception after a fixed amount of time.
UPDATE 12/08/2007:
Some have asked for our memcache-client extension code. Here it is:
class MemCache
alias_method :old_get, :get
alias_method :old_set, :set
alias_method :old_incr, :incr
alias_method :old_add, :add
alias_method :old_delete, :delete
alias_method :old_get_multi, :get_multi
def get(key, raw = false, timeout = 1.0)
Timeout::timeout(timeout) do
old_get(key, raw)
end
rescue Timeout::Error
nil
end
def set(key, value, expiry = 0, raw = false, timeout = 1.0)
Timeout::timeout(timeout) do
old_set(key, value, expiry, raw)
end
end
def incr(key, amount = 1, timeout = 1.0)
Timeout::timeout(timeout) do
old_incr(key, amount)
end
end
def add(key, value, expiry = 0, raw = false, timeout = 1.0)
Timeout::timeout(timeout) do
old_add(key, value, expiry, raw)
end
end
def delete(key, expiry = 0, timeout = 1.0)
Timeout::timeout(timeout) do
old_delete(key, expiry)
end
end
def get_multi(*args)
if args.last.is_a?(Float) || args.last.is_a?(Fixnum)
# assume it's a timeout
timeout = args.pop
Timeout::timeout(timeout) do
old_get_multi(*args)
end
else
Timeout::timeout(15) do
old_get_multi(*args)
end
end
rescue Timeout::Error
{}
end
end

5 Comments
Hey, I’ve been looking for this solution all morning! I figured the memcache-client needed socket timeouts cause I couldn’t see any in there. Would you mind sharing the code?
–scott
Btw, thanks for the blog! We’re trying to keep similar notes over at http://geekblog.vodpod.com.
Rewrite the standard libraries? Yes! Python’s standard web libraries are in a terrible shape and I’m sure Ruby’s are similar.
The problem is that there’s -so much- of them (and so many places they are used) that gaining the momentum needed to fix everything up is impossible.
I maintain memcache-client. Socket timeouts and massive speed increases are in the 1.6.x releases and Rails 2.3 will contain this code. Rails 2.2 contains the old 1.5.0 code. See my website for more details.
Yes, issue’s with the socket libs in Ruby core have been known for awhile to be flacky.
We had also recently debugged a similar issue though we found it to be timeout’s fault.
http://blog.kineticweb.com/articles/2009/02/09/when-timeouts-arent-timeouts
2 Trackbacks
[...] sockets to memcache were not getting cleaned up. Anyway, I found the Rapleaf guys had seemed to see similar trouble. The described adding timeouts to the memcache-client to fix the problem. So we tried something [...]
[...] I found the Rapleaf guys had seemed to see similar trouble. They described adding timeouts to the memcache-client to fix the [...]