DRAFT: Enable TCP advice service ------------------------------------------------------------------- Purpose: To make it easy for clients to find the optimal TCP buffer size for a given data server. The service is called "enable", because it enables clients to achieve much higher throughput from the server. The idea is that there is an "enable server" collocated with each data service (i.e.: ftp, http, hpss, dpss) Architecture: A simple service that does the following: monitors log files for connections from clients, and runs network tests based on this list of clients: e.g.: ftp logs, http logs (also has the ability to read a list of hosts to monitor from a config file) When Enable detects a new client, it runs pipechar (http://www-didc.lbl.gov/pipechar/ )and ping, and stores the results in a database. Clients can query the enable server to get this data, which is keyed on client hostname. The server can also be configured to re-run the tests periodically (e.g.: every 24 hours) Implementation: server written in python use BerkeleyDB for the database use XML-RPC for the client/server messaging protocol (http://www.xml-rpc.com/ ), migrating to SOAP and/or the GF GMA messaging protocol, as these become standardized and as tools become available. provide Java, Python, and C client APIs to query DB provide a data server API that lets data services tell enable which clients to monitor provide the ability for multiple types of tests (ie: pipechar, iperf, etc.) config file that specifies: list of log files to monitor how often to re-run tests list of clients to always monitor, even if not in the log files other stuff too, Im certain Client API: provide blocking and non-blocking: if the client is not in the DB it may request that the test be run on demand, and that the server send the results at the end of the test. A typical pipechar test takes 3-5 minutes. Open Issues/Questions: + To reduce the amount of redundant testing, maybe should only have 1 entry per subnet? How to tell if 2 hosts are on the same subnet? something in the config file to specify this? + should we time out clients from the DB: i.e.: if the server does not get a connection from a given client for 30 days, should it be removed from the test list? Future: Provide a client API to send back the server the throughput speeds actually seen by the client, and compare this to the pipechar predicted results. Use this data to somehow improve pipechar's estimate. Possibly link this to the NWS prediction models. Lots of possibilities for interesting things here. API: getNetInfo(host): returns latency and bottlenext bandwidth; results must already be in the server DB. (server runs ping and pipechar tests) testHost(host): tells server to add this host to the list of hosts to test, and will return the results of the first test: note: this all may take server minutes to return. addHost(host): just add host to the list of hosts to test and return removeHost(host): remove this host from the test list getCurrentBW(host): server runs an iperf test listHosts(): list all hosts currently being monitored, and their attributes (ie: which tests are being run, etc.) Other random ideas: The idea of "detects a new client" might need to be refined a little to avoid running too many tests in the case of servers which happen to have a lot of really transient clients. For a given log file, and overall, at least a parameter specifying the minimum connect-time necessary to make a "new client" might be useful. Another thing that might reduce redundant testing is to give servers "peers" that they can redirect queries to first before running a test themselves. Maybe even a simple "routing table" for requests could specify specific ranges of network addresses that are known by certain peers? For this to work, it would be important to provide a timestamp of the test-time in the response, but that would be a good thing to have in general anyways. I think the periodic measurements could be a show-stopper, so I think that your timeout idea could be extended by keeping track of how many queries there were for various clients, and only automatically re-running only those that go over a threshhold (say, 3). An implementation trick would be to decrement all the ones that got 0 requests and then kick out the ones with less than some negative number (say -7). If the hosts can report both their IP address and subnet as part of the testing process, you can get the extent of the subnet and draw conclusions later. Another trick might be a lookup into BGP to find AS number and use that, although that would not work for internal networks (within DOE, for example).