DOE/MICS Mid-Year Project Report
Date: June 1, 2003
Project Title: Net100
Project Type: Base
PI: Brian Tierney
Institution: LBNL
1. Executive Summary
The Net100 Collaboration (PSC, NCAR, UT, LBNL, and ORNL) is developing a model for
network-aware operating systems using Web100 as the means for incorporating network
information and its analysis into host operating systems to improve performance. To
investigate how effective network-aware operating systems can be, we are using a
three-phase approach. First, we will use the network-aware, Web100-based operating
system to tune a simple, bulk-transport application and demonstrate its use over high
performance network links. We will then extend this model to support more advanced
and complex applications, moving from point-to-point optimization to optimizations for
fully distributed environments. Finally, as proof that a network-aware operating
system can tune and optimize performance on behalf of applications, we will also develop
application-internal tools (based on NetLogger) to monitor the efficiency of application
support, and provide an external monitoring methodology to gauge the impact this system
has on the rest of the network.
Significant Net100 accomplishments to date (all sites)
- systematic testing of WAD using NTAF
- WAD tuning of parallel and single stream GridFTP
- demonstrate continuous flow tuning by WAD
- added Sally Floyd high-speed extensions(HS-TCP) to WAD and Linux kernel
- PSC submitted TCP MIB to IETF (Web100)
- C WAD daemon (ORNL) and Python WAD (LBNL)
- WAD daemons developed and demonstrated 13x speedup (LBNL, ORNL)
- PSC developed new tool (pathprobe) based on Web100
- network probes and database deployed (LBNL)
- combined kernel mods for Web100 and LANL's DRS (ORNL)
- WAD tuned TCP AIMD parameters, "virtual MSS", 6x faster recovery from TCP loss (PSC, ORNL)
- event notification extensions to Web100 (ORNL, PSC)
- NetLogger/Web100 extensions to iperf (LBNL) and ttcp (ORNL)
- added Tom Kelly's Scalable TCP and doing testing and comparison with HS-TCP (ORNL/LBNL)
Publications:
T. Dunigan, M. Mathis and B. Tierney, A TCP Tuning Daemon, Proceeding of IEEE Supercomputing 2002 Conference, Nov. 2002, LBNL-51022. B. Tierney, Using NetLogger and Web100 for TCP Analysis , Invited Paper, First International Workshop on Protocols for Fast Long-Distance Networks , LBNL-51776. A. Antony, J. Blom, C. de Laat, Jason Lee, W. Sjouw, Microscopic Examination of TCP flows over transatlantic Links, iGrid2002 special issue, Future Generation Computer Systems, volume 19 issue 6. Submitted the following to IEEE Supercomputing:
- Brian L. Tierney, Jason R. Lee, Dan Gunter, Martin Stoufer, Improving Distributed Application Performance Using TCP Instrumentation
2. Recent LBNL Accomplishments: (Dec 2002 - June 2001)
NTAF progress:
- Completed production deployment of NTAF:The development phase has concluded and NTAF is stable and robust enough to be run in an automated environment. This involved the following tasks:
- code refactoring
- Added new version of NetLogger, which provided better overall stability and fault tolerance. This integration also allowed for streamlining of the development code base.
- Added a pyGMA interface to NTAF
- wrote a NetLogger output module for the new versions of pathrate and pathload (C. Dovrolis), and incorporated into NTAF
- finished adding netest and NCSC (Goujun Jin's tools) to NTAF
- added parallel stream iperf testing
- started working on collecting SCNM results as part of the NTAF iperf tests
- rewrote netarchd to use the new NetLogger
- added GGF standard naming (from the Network Measurements Working Group) to all results. This allows us to easily compare the results from multiple tools
pyWAD progress:
- Wrote a HOWTO for pyWAD+NLV.
- released new version of pyWAD with many new feature, such as:
- ability to monitor dynamic ports (such as GridFTP)
- ability to monitor other, non web100 attributes, such as interrupts or CPU use), triggered by port monitoring
- ability to dump all web100 counters at the end of a connection
- Wrote a modified version of pyWAD called pyAIMD for easily performing automatic testing using various AIMD parameters
Progress on the web interface to netarchd:
- converted everything from cgi to webware for increased performance and flexibility
- converted everything from gnuplot to "R", which gave us a much richer set of plotting options
- added lots of features, including the ability to compare single stream results with multi-stream results
- added a Data Mining Interface (DMI) and a Object Oriented Database Interface (OODB). This allows for higher level tools and faster access to underlying data.
- some bug fixes
Protocol Analysis Work included:
- Performed a large amount of simulations on Floyd's High-Speed TCP, and submitted a paper to Supercomputing on the results. See: http://www-itg.lbl.gov/~evandro/hstcp/index.html
- Performed many tests of HS-TCP and AIMD tuning over 10 Gbps transatlantic links. The results are at: http://www.nikhef.nl/~jason/UvA/
- Performed a number of tests of HS-TCP with GridFTP, the results of which are included in the submission to SC2003.
- Worked with ORNL on experiments with Tom Kelly's scalable TCP
Other work:
- Wrote and submitted a paper on the Grid users of TCP instrumentation to SC2003
- Continued to Work with ORNL to evaluate (simulate, emulate, deploy) various TCP tuning options, including evaluating other auto-tuning proposals (Web100 auto-tuning, Linux 2.4, Feng's dynamic right sizing) and TCP Vegas
- Continued development of the tuning daemon and apply tuning to bulk transfer applications (GridFTP), parallel flows, and AIMD tuning, and continue working with Sally Floyd to test and validate High Speed TCP
- Continued monitoring, probing, and analyzing ESnet links, and informed ESNet of several performance problems.
- Continued to develop and apply more involved data mining techniques to analyze the results generated by the NTAF. These will hopefully deliver deeper understandings of the various tests and the meaning of synthesized metrics and multi metric correlations.
- Began analysis of actual flow behavior from predictive tools like iperf, pipechar, pathrate
- Continued to test and gather performance data from all Neb100 sites (SLAC, ORNL, NCSA, NCAR, PSC, NERSC)
- Began working on integrating passive network LBNL's SCNM project
- Continued to work closely with the EU Datatag project to use net100 in their research
Some Sample Results:
3. In the coming 6 months LBNL plans to:
The NTAF monitoring infrastructure is now completely in place, with new data being archived daily. The main task for the next few months will be to analyze the data in the archive, and to look for correlation between TCP settings and throughput. We will also continue to monitor the infrastructure for robustness, and continue adding more tools to NTAF. We are also evaluating the network monitoring publication schema being defined by the GGF Network Measurement working group.
Specific tasks include:
- evolve NTAF to act as a OGSA Web Service for use in Grid network monitoring
- Integration of SCNM results with NTAF, and analyze the results
- continue to test the GGF Network Measurement publication schemas, and provide feedback to GGF.
- continue to evaluate network analysis tools for possible inclusion in the Net100 NTAF (pathprobe, moping, pathrate)
- continue to work with ORNL to evaluate (simulate, emulate, deploy) various TCP tuning options, including evaluating other auto-tuning proposals (Web100 auto-tuning, Linux 2.4, Feng's dynamic right sizing) and TCP Vegas
- continue development of the tuning daemon and apply tuning to bulk transfer applications (GridFTP), parallel flows, and AIMD tuning, and continue working with Sally Floyd to test and validate High Speed TCP
- continue to Test Net100 TCP modifications over 10 Gbps transatlantic links
- continue monitoring, probing, and analyzing ESnet links
- continue to enhance web interfaces with new statistical graphics. This will help with the exploration and discovery of more subtle correlations between NTAF measurements.
- start running mult-stream tests. Enhance archive schema to better handle multi stream tests.
- develop and apply more involved data mining techniques to analyze the results generated by the NTAF. These will hopefully deliver deeper understandings of the various tests and the meaning of synthesized metrics and multi metric correlations.
- compare actual flow behavior from predictive tools like iperf, NWS, pipechar, pathrate
- continue to test and gather performance data from all Neb100 sites (SLAC, ORNL, NCSA, NCAR, PSC, NERSC)
4. Research Interactions
We have ongoing interactions with:
- Cees de Laat, University of Amsterdam
- Sally Floyd, modified slow-start, modified AIMD
- Linda Winkler, Europe-US OC48 WAD testing
- Thomas Hacker, parallel TCP flows
- Wu Feng and his Dynamic Right Sizing work and his TCP Vegas Work
- LBL self-configuring network monitoring project (tcpdump server)
- KC Claffey and Les Cottrell on INCITE and pingER data collection
- C. Dovrolis pathrate/pathload work
- R. Wolski and NWS
- NCS/pipechar project (Goujun Jin)
- various data grid projects (SciDAC)
- Probe/HPSS projects at NERSC/ORNL
- GGF Network Measurements working group
- Internet2 end-to-end projects (Surveyor, NIMI, etc.)
5. Remarks
Detailed information on progress for the LBNL portion of Net100 is maintained
at
http://www-didc.lbl.gov/net100/
Detailed status is at:
http://www-didc.lbl.gov/~jason/net100/
The full project web page is
http://www.net100.org