Date: January 5, 2006 Author: Daniel Stenberg Status of project Hiper - high performance libcurl modifications ================================================================ What is Hiper You won't find such a description in this document. See http://curl.haxx.se/libcurl/hiper/ for further details. Live Progress Info During my work, I've posted occational updates on the curl-library mailing list but more importantly done frequent updates of http://curl.haxx.se/libcurl/hiper/schedule.html Schedule I took time off my regular job during Decemember 2005 and the first week of January 2006 to work on hiper full-time. Step 1 - Measure the Existing Solution I started full-time work on project Hiper on December 1st 2005. I began by putting together a test application that used the existing API to allow me to properly and with accuracy measure execution and transfer speeds when doing a large amount of transfers. I soon discovered that it was impossible to do any sensible measurements by using live and actual URLs since the transfers were too unrelialble and uncontrolled. I then enhanced the current HTTP server in the curl test suite and made that support a large amount of transfers and some extra magic "commands" that would make the server either just sit "idle" or "stream" (continuously sending data in a never-ending stream). I then wrote up two files using the curl test suite file format and by acessing the properly formatted URLs on my localhost the HTTP server would either run "idle" or run "stream". Having this working, I patched libcurl to always only recv() a single byte off the network each time, just to make sure that the time spent on reading data is constant and never very long. I adjusted the test application (actually called 'hiper') to create Y idle transfers and Z stream transfers, had it run for N seconds and then quit and produce a summary on stdout. Now I got very solid and repeatable results. I started to run repeated tests and save the results when I ran into the dreaded 1024 socket maximum limit. One side of the problem is that the fd_set type only allows 1024 file descriptors (on my Linux), which I had to solve by simply making my own type with room for more connections and do ugly typecasts in the code. The other side of the problem is that user applications have a limit imposed by the system on the maximum amount of file descriptors it can have open and I had to work around that by writing a special tool that runs setuid root that increases the limit, downgrades to a normal user again and then run the command line of your choice. This second approach has to be used for both 'hiper' and the test HTTP server. (You need to build the HTTP server with CURL_SWS_FORK_ENABLED defined to have it do forks since it isn't desirable to do so when running the normal curl tests.) Now I could run my test program without problems. I decided to run the tests with 1 stream connection and a varying amount of idle ones. I did 1001, 2001, 3001, 5001 and 9001 connections and measured how long select() and curl_multi_perform() (including the curl_multi_fdset() call) would take in average, over a period of 20 seconds. I ran each test 5-6 times and I used the average time of all the runs. The times in number of microseconds: Connections multi_perform select 1001 3504 951 2001 7606 1988 3001 11045 2715 5001 16406 4024 9001 32147 8030 Test system CPU: Athlon XP 2800 RAM: 1 GB Linux: 2.6 glibc: 2.3.5 libcurl: 7.15.1 The only reason I stopped at 9001 connections is that my test machine ran out of avaiable memory by then as I ran the test server on the same machine, and I didn't want to risk the test result accuracy by having it start using the swap during the tests. It means that at 9000 connections we spend 40ms for each socket action, even when only one socket ever have action. With these 32000 microseconds curl_multi_perform() takes for 9000 connections, it loops 18000 laps which makes less than 2 microseconds per lap. (Of course counting time/laps is an oversimplification, but anyway.) Hopefully we should achieve less than 10 microseconds for each call to curl_multi_socket() for an active connection. The timing graph displayed on the libevent site (duplicated on the hiper project page) suggests that libevent is pretty much fixed at 50 microseconds (although I don't know what test box was used in their testing, we can compare the select()-times from my tests and see that they are at least resonably close). Summing up, the current ~40 ms spent at 9000 connections could then possibly be lowered to something around 60 us! Step 2 - Implement curl_multi_socket API Most of the design decisions and debates about this new API have already been held on the curl-library mailing list a long time ago so I had a basic idea on what approach to use. The main ideas of the new API are simply: 1 - The application can use whatever event system it likes as it gets info from libcurl about what file descriptors libcurl waits for what action on. (The previous API returns fd_sets which is very select()-centric). 2 - When the application discovers action on a single socket, it calls libcurl and informs that there was action on this particular socket and libcurl can then act on that socket/transfer only and not care about any other transfers. (The previous API always had to scan through all the existing transfers.) The idea is that curl_multi_socket() calls a given callback with information about what socket to wait for what action on, and the callback only gets called if the status of that socket has changed. In the API draft from before, we have a timeout argument on a per socket basis and we also allowed curl_multi_socket() to pass in an 'easy handle' instead of socket to allow libcurl to shortcut a lookup and work on the affected easy handle right away. Both these turned out to be bad ideas. The timeout argument was removed from the socket callback since after much thinking I came to the conclusion that we really don't want to handle timeouts on a per socket basis. We need it on a per transfer (easy handle) basis and thus we can't provide it in the callbacks in a nice way. Instead, we have to offer a curl_multi_timeout() that returns the largest amount of time we should wait before we call the "timeout action" of libcurl, to trigger the proper internal timeout action on the affected transfer. To get this to work, I added a struct to each easy handle in which we store an "expire time" (if any). The structs are then "splay sorted" so that we can add and remove times from the linked list and yet somewhat swiftly figure out 1 - how long time there is until the next timer expires and 2 - which timer (handle) should we take care of now. Of course, the upside of all this is that we get a curl_multi_timeout() that should also work with old-style applications that use curl_multi_perform(). The easy handle argument was removed fom the curl_multi_socket() function because having it there would require the application to do a socket to easy handle conversion on its own. I find it very unlikely that applications would want to do that and since libcurl would need such a lookup on its own anyway since we didn't want to force applications to do that translation code (it would be optional), it seemed like an unnecessary option. I also realized that when we use underlying libraries such as c-ares (for DNS asynch resolving) there might in fact be more than one transfer waiting for action on the same socket and thus it makes the lookup even tricker and even less likely to ever get done by applications. Instead I created an internal "socket to easy handles" hash table that given a socket (file descriptor) returns a list of easy handles that waits for some action on that socket. To make libcurl be able to report plain sockets in the socket callback, I had to re-organize the internals of the curl_multi_fdset() etc so that the conversion from sockets to fd_sets for that function is only done in the last step before the data is returned. I also had to extend c-ares to get a function that can return plain sockets, as that library too returned only fd_sets and that is no longer good enough. The changes done to c-ares have been committed and are available in the c-ares CVS repository destined to be included in the upcoming c-ares 1.3.1 release. The 'shiper' tool is the test application I wrote that uses the new curl_multi_socket() in its current state. It seems to be working and it uses the API as it is documented and supposed to work. It is still using select(), because I needed that during development (like until I had the socket hash implemented etc) and because I haven't yet learned how to use libevent or similar. The hiper/shiper tools are very simple and initiates lots of connections and have them running for the test period and then kills them all. Since I wasn't done with the implementation until early January I haven't had time to run very many measurements and checks, but I have done a few runs with up to a few hundred connections (with a single active one). The curl_multi_socket() invoke then takes 3-6 microseconds in average (using the read-only-1-byte-at-a-time hack). If this number does increase a lot when we add connections, it certainly matches my in my opinion very ambitious goal. We are now below the 60 microseconds "per socket action" goal. It is destined to be somewhat higher the more connections we have since the hash table gets more populated and the splay tree will grow etc. Some tests at 7000 and 9000 connections showed that the socket hash lookup is somewhat of a bottle neck. Its current implementation may be a bit too limiting. It simply has a fixed-size array, and on each entry in the array it has a linked list with entries. So the hash only checks which list to scan through. The code I had used so for used a list with merely 7 slots (as that is what the DNS hash uses) but with 7000 connections that would make an average of 1000 nodes in each list to run through. I upped that to 97 slots (I believe a prime is suitable) and noticed a significant speed increase. I need to reconsider the hash implementation or use a rather large default value like this. At 9000 connections I was still below 10us per call. Status Right Now The curl_multi_socket() API is implemented according to how it is documented. The man pages for curl_multi_socket and curl_multi_timeout are both committed to CVS and are available online for easy browsing: http://curl.haxx.se/libcurl/c/curl_multi_socket.html http://curl.haxx.se/libcurl/c/curl_multi_timeout.html The hiper-5.patch I made available early morning January 5th, 2006 should apply fine on a recent CVS checkout (at the time of this writing curl 7.15.1 is the latest public curl release but the hiper patch does not apply fine on that). What is Left for the curl_multi_socket API 1 - More measuring with more extreme number of connections 2 - More testing with actual URLs and complete from start to end transfers. I'm quite sure we don't set expire times all over in the code properly, so there is bound to be some timeout bugs left. What it really takes is for me to commit the code and to make an official release with it so that we get people "out there" to help out testing it. What is Left for project Hiper 1 - Add HTTP pipelining support 2 - Add a zero (or at least close to zero) copy interface Neither of these points have been planned or detailed exactly how they will be implemented. Roadmap Ahead I plan and hope to return to full-time hiper work later on this spring or possibly summer to continue where I pause now. Of course some spare time might also be spent until then to get us moving forward.