|
User Login Poll Recent Weblogs Previous Headlines
Random Headlines |
Enabling High Performance Data TransfersPosted by root in the Network section on Wed 21 Dec 2005 at 03:36
System Specific Notes for System Administrators (and Privileged Users)
On this page:
This (DRAFT) page is currently under active revision. Please send any suggestions, additions or corrections to us at nettune@ncne.org so we can keep the information here as up-to-date as possible.
IntroductionToday, the majority of university users have physical network connections that are at least 100 megabits per second all the way through the Internet to every important data center in the world (as well as to every other university user). For many users, that connection might be 1 gigabit per second or faster. In some countries (e.g. Korea and Japan) the same statement applies to every home connection as well: 100 Mb/s from home to all important web servers, data centers and to each other. To put these data rates into perspective, consider this: 100 Mb/s is more than 10 megabytes in one second, or 600 megabytes (an entire CD-R image) in one minute. Clearly very few people see these data rates. However, some experts can get very high data rates (for example see the Land Speed Records). Why? The biggest strength of the Internet is the way in which the TCP/IP "hourglass" hides the details of the network from the application and vice versa. An unfortunate but direct consequence of the hourglass is that it also hides all flaws everywhere. Network performance debugging (often euphemistically called "TCP tuning") is extremely difficult because nearly all flaws have exactly same symptom: reduced performance. For example insufficient TCP buffer space is indistinguishable from excess packet loss (silently repaired by TCP retransmissions) because both flaws just slow the application, without any specific identifying symptoms. Flaws fall into three broad areas: the applications themselves, the computer system (including the operating system and TCP tuning) and the network path. Each of these areas requires a very different approach to performance debugging. This page is focused on helping users and system administrators optimize the TCP/IP on their computer systems.
The objectives of this page are to summarize all of the end system network tuning issues, provide easy configuration checks for non-experts, and maintain a repository of operating system specific advice and information about getting the best possible network performance on these platforms. In the Tutorial we will briefly explain the issues and define some terms. Under High Performance Networking Options we describe each of the optional TCP features may have to configured without addressing the details of any specific operating system. The section, "Detailed Procedures", provides step-by-step directions on making the necessary changes for several operating systems. Note that today most TCP implementations are pretty good. The primary flaws are default configurations that are ideal for Internet back roads: many millions of relatively low speed home users. AcknowledgmentsJamshid Mahdavi maintained this page for many years, both at PSC and later, remotely from Novell. We are greatly indebted to his vision and persistence in establishing this resource. Thanks Jamshid!. Many, many people have helped us compile this information. We want to thank everyone who sent us updates, additions and corrections. We have decided to include attributions for all future contributors. (Sorry not to be able to give full credit where credit is due for past contributors.) This material has been maintained as a sideline of many different projects, nearly all of which have been funded by the National Science Foundation. It was started under NSF-9415552, but also supported under Web100 (NSF-0083285) and is currently supported under the NPAD project (ANI-0334061). TutorialThe dominant protocol used on the Internet today is TCP, a "reliable" "window-based" protocol. Under ideal conditions, best possible network performance is achieved when the network pipe between the sender and the receiver is kept full of data. Bandwidth*Delay Products (BDP)The amount of data that can be in transit in the network, termed "Bandwidth-Delay-Product," or BDP for short, is simply the product of the bottleneck link bandwidth and the Round Trip Time (RTT). BDP is a simple but important concept in a window based protocol such as TCP. Some of the issues discussed below arise because of the fact that the BDP of today's networks has increased way beyond what it was when the TCP/IP protocols were initially designed. In order to accommodate the large increases in BDP, some high performance extensions have been proposed and implemented in the TCP protocol. But these high performance options are sometimes not enabled by default and will have to be explicitly turned on by the system administrators. BuffersIn a "reliable" protocol such as TCP, the importance of BDP described above is that this is the amount of buffering that will be required in the end hosts (sender and receiver). The largest buffer the original TCP (without the high performance options) supports is limited to 64K Bytes. If the BDP is small either because the link is slow or because the RTT is small (in a LAN, for example), the default configuration is usually adequate. But for a paths that have a large BDP, and hence require large buffers, it is necessary to have the high performance options discussed in the next section be enabled. Computing the BDPTo compute the BDP, we need to know the speed of the slowest link in the path and the Round Trip Time (RTT). The peak bandwidth of a link is typically expressed in Mbit/s (or more recently in Gbit/s). The round-trip delay (RTT) for wide area links is typically between 10 msec and 100 msec, which can be measured with ping or traceroute As an example, for two hosts with GigE cards, communicating across a coast-to-coast link over an Abilene connection (assuming a 2.4 Gbps OC-48 link), the bottleneck link will be the GigE card itself. The actual round trip time (RTT) can be measured using ping, but we will use 70 msec in this example. Knowing the bottleneck link speed and the RTT, the BDP can be calculated as follows:
Based on these calculations, it is easy to see why the typical default buffer size of 64 KBytes would be way inadequate for this connection. The next section presents a brief overview of the high performance options. Specific details on how to enable these options in various operating systems is provided in a later section. High Performance Networking OptionsThe options below are presented in the order that they should be checked and adjusted.
Using Web Based Network Diagnostic ServersThere are some diagnostic servers that are useful for troubleshooting some of the networking problems. We point out a couple of them here.
TCP Features Supported by Various Operating SystemsInformation for other older operating systems has been moved to the historical operating system page.
Detailed procedures for system tuning under various operating systemsSee the specific instructions for each system:
Procedure for raising network limits under FreeBSD 2.1.5You can't modify the maximum socket buffer size in FreeBSD 2.1.0-RELEASE, but in 2.2-CURRENT you can use sysctl -w kern.maxsockbuf=524288 to make it 512kB (for example). You can also set the TCP and UDP default buffer sizes using the variables net.inet.tcp.sendspace net.inet.tcp.recvspace net.inet.udp.recvspace MTU discovery is on by default in FreeBSD past 2.1.0-RELEASE. If you wish to disable MTU discovery, the only way that we know is to lock an interface's MTU, which disables MTU discovery on that interface. Tuning TCP for Linux 2.4 and 2.6The maximum buffer sizes for all sockets can be set with /proc variables: /proc/sys/net/core/rmem_max - maximum receive window /proc/sys/net/core/wmem_max - maximum send window These determine the maximum acceptable values for SO_SNDBUF and SO_RCVBUF (arguments to setsockopt() system call). The kernel sets the actual memory limit to twice the requested value (effectively doubling rmem_max and wmem_max) to provide for sufficient memory overhead. The per connections memory space defaults are set with two 3 element arrays: /proc/sys/net/ipv4/tcp_rmem - memory reserved for TCP rcv buffers /proc/sys/net/ipv4/tcp_wmem - memory reserved for TCP snd buffers These are arrays of three values: minimum, default and maximum that are used to bound autotuning and balance memory usage while under global memory stress. The following values would be reasonable for path with a 4MB BDP (You must be root):
echo 2500000 > /proc/sys/net/core/wmem_max
echo 2500000 > /proc/sys/net/core/rmem_max
echo "4096 5000000 5000000" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 5000000" > /proc/sys/net/ipv4/tcp_wmem
All Linux 2.4 and 2.6 versions include sender side autotuning, so the actual sending socket buffer (wmem value) will be dynamically updated for each connection. You can check to see if receiver side autotuning is present an enabled by looking at the file: /proc/sys/net/ipv4/tcp_moderate_rcvbuf If it is present and enabled (value 1) the receiver socket buffer size (rmem value) will be dynamically updated for each connection. If it is not present you may want to get a newer kernel. Generally autotuning should not be disabled unless there is a specific need, e.g. comparison studies of TCP performance. If you do not have autotuning (Linux 2.4 before 2.4.27 or 2.6 before 2.6.7) you may want to set the default tcp_rmem value (the middle value) to a more accurate estimate of the actual path BDP, to minimize possible interactions with other applications. Do not adjust tcp_mem unless you know exactly what you are doing. This array determines how the system balances the total network memory usage against other memory usage, such as disk buffers. It is initialized at boot time to appropriate fractions of total system memory. You do not need to adjust rmem_default or wmem_default (at least not for TCP tuning). These are the default buffer sizes for non-TCP sockets (e.g. unix domain sockets, UDP, etc). All standard advanced TCP features are on by default. You can check them by cat'ing the following /proc files: /proc/sys/net/ipv4/tcp_timestamps /proc/sys/net/ipv4/tcp_window_scaling /proc/sys/net/ipv4/tcp_sack Linux supports both /proc and sysctl (using alternate forms of the variable names - net.core.rmem_max) for inspecting and adjusting network tuning parameters. The following is a useful shortcut for inspecting all tcp parameters: sysctl -a | fgrep tcp For additional information on kernel variables, look at the documentation
included with your kernel source, typically in some location such as
/usr/src/linux- If you would like to have these changes to be preserved across reboots, you can add the tuning commands to your /etc/rc.d/rc.local file. Autotuning was prototyped under the Web100 project. Web100 also provides complete TCP instrumentation and some additional features to improve performance on paths with very large BDP. Contributors: John Heffner Checked for Linux 2.6.13, 9/19/2005Tuning TCP for Mac OS XMac OS X has a single sysctl parameter, kern.ipc.maxsockbuf, to set the maximum combined buffer size for both sides of a TCP (or other) socket. In general, it can be set to at least twice the BDP. E.g: sysctl -w kern.ipc.maxsockbuf=8000000 The default send and receive buffer sizes can be set using the following sysctl variables: sysctl -w net.inet.tcp.sendspace=4000000 sysctl -w net.inet.tcp.recvspace=4000000 If you would like these changes to be preserved across reboots you can edit /etc/sysctl.conf. RFC1323 features are supported and on by default. SACK is not available at this time Although we have never tested it, there is a commercial product to tune TCP on Macintoshes. The URL is http://www.sustworks.com/products/prod_ottuner.html. I don't endorse the product they are selling (since I've never tried it). However, it is available for a free trial, and they appear to do an excellent job of describing perf-tune issues for Macs. Tested for 10.3, MBM 5/15/05Procedure for raising network limits under NetBSDRFC1323 is on by default in NetBSD 1.1 and above. Under NetBSD 1.2, it can be verified to be on by typing:sysctl net.inet.tcp.rfc1323 The maximum socket buffer size can be modified by changing SB_MAX in /usr/src/sys/sys/socketvar.h. The default socket buffer sizes can be modified by changing TCP_SENDSPACE and TCP_RECVSPACE in /usr/src/sys/netinet/tcp_usrreq.c. It may also be necessary to increase the number of mbufs, NMBCLUSTERS in /usr/src/sys/arch/*/include/param.h. Update: It is also possible to set these parameters in the kernel configuration file. options SB_MAX=1048576 # maximum socket buffer size options TCP_SENDSPACE=65536 # default send socket buffer size options TCP_RECVSPACE=65536 # default recv socket buffer size options NMBCLUSTERS=1024 # maximum number of mbuf clusters Procedure for raising network limits under Microsoft Windows 2000 and Windows XPNew: The following URL: http://rdweb.cns.vt.edu/public/notes/win2k-tcpip.htm appears to be a pretty good summary of the procedure for TCP tuning under Windows 2000. It also has the URL for the Windows 2000 TCP tuning document from Microsoft. We are not sure if it still necessary to set DefaultReceiveWindow even after setting the parameters indicated in the URL above. If your machine does a lot of large outbound transfers, it will be necessary to set DefaultSendWindow in addition to the suggestions mentioned above. Procedure for raising network limits under Microsoft Windows 98New: Some folks at NLANR/MOAT in SDSC have written a tool to do guide you through some of this stuff. It can be found at http://moat.nlanr.net/Software/TCPtune/. Even newer: I've updated some sending window information that was inaccurate. See below. Several folks have recently helped me to figure out how to accomplish the necessary tuning under Windows98, and the features do appear to exist and work. Thanks to everyone for the assistance! The new description below should be useful to even the complete Windows novice (such as me :-). Windows98 includes implementation of RFC1323 and RFC2018. Both are on by default. (However, with a default buffer size of only about 8kB, window scaling doesn't do much). Windows stores the tuning parameters in the Windows Registry. In the registry are settings to toggle on/off Large Windows, Timestamps, and SACK. In addition, default socket buffer sizes can be specified in the registry. In order to modify registry variables, do the following steps:
TCP/IP Stack Variables Support for TCP Large Windows (TCPLW) Win98 TCP/IP supports TCP large windows as documented in RFC 1323. TCP large windows can be used for networks that have large bandwidth delay products such as high-speed trans-continental connections or satellite links. Large windows support is controlled by a registry key value in: HKLM\system\currentcontrolset\services\VXD\MSTCP The registry key Tcp1323Opts is a string value type. The values for Tcp1323Opt are
The default value for Tcp1323Opts is 3: Window Scaling and Time stamp options. Large window support is enabled if an application requests a Winsock socket to use buffer sizes greater than 64K. The current default value for TCP receive window size in Memphis TCP is 8196 bytes. In previous implementations the TCP window size was limited to 64K, this limit is raised to 2**30 through the use of TCP large window support. Support for Selective Acknowledgements (SACK) Win98 TCP supports Selective Acknowledgements as documented in RFC 2018. Selective acknowledgements allow TCP to recover from IP packet loss without resending packets that were already received by the receiver. Selective Acknowledgements is most useful when employed with TCP large windows. SACK support is controlled by a registry key value in: HKLM\system\currentcontrolset\services\VXD\MSTCP The registry key SackOpts is a string value type. The values for SackOpts are
Support for Fast Retransmission and Fast Recovery Win98 TCP/IP supports Fast Retransmission and Fast Recovery of TCP connections that are encountering IP packet loss in the network. These mechanisms allow a TCP sender to quickly infer a single packet loss by reception of duplicate acknowledgements for a previously sent and acknowledged TCP/IP packet. This mechanism is useful when the network is intermittently congested. The reception of 3 (default value) successive duplicate acknowledgements indicates to the TCP sender that it can resend the last unacknowledged TCP/IP packet (fast retransmit) and not go into TCP slow start due to a single packet loss (fast recovery). Fast Retransmission and Recovery support is controlled by a registry key value in: HKLM\system\currentcontrolset\services\VXD\MSTCP\Parameters The registry key MaxDupAcks is DWORD taking integer values from 2 to N. If MaxDupAcks is not defined, the default value is 3. Update: If you wish to set the default receiver window for applications, you should set the following key: DefaultRcvWindow HKLM\system\currentcontrolset\services\VXD\MSTCP DefaultRcvWindow is a string type and the value describes the default receive windowsize for the TCP stack. Otherwise the windowsize has to be programmed in apps with setsockopt. For a long time, I had the following sentence on this page: Matt Mathis (with help from many others, especially Jamshid Mahdavi)
|
Sponsored Links
ICF.BOFH.RU for more info visit the INFO.ICF.BOFH.RU or send email to info@icf.bofh.ru. Another sponsor NKOORT.RU email to info@nkoort.ru. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||