IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines
IT Today is brought to you by Auerbach Publications

IT Performance Improvement

Management

Security

Networking and Telecommunications

Software Engineering

Project Management

Database


Share This Article


Search the Site



Free Subscription to IT Today





Powered by VerticalResponse

 
Advances in Network Management
IP Communications and Services for NGN
Networking Systems Design and Development
Security of Mobile Communications
Network Design for IP Convergence

Troubleshooting 10 Gbps Networks

By Dan Joe Barry

The impending arrival of 40 Gbps Ethernet networks and the promise of 100 Gbps Ethernet in the near future have commanded headline coverage over the last year, but 2009 is more likely to be remembered as the year when 10 Gbps network deployments finally took off. One of the driving forces behind this development is the increased use of IP networks for a broad range of services ranging from Internet access to IPTV to hosted services; all services with very different requirement and characteristics.

For this reason, management of 10 Gbps networks will face new challenges as the network becomes more critical and availability is no longer a goal, but a guarantee!

From Availability to Reliability
Troubleshooting IP networks has traditionally focused on ensuring availability and relied on simple, but well known tools. In the past, this has also been sufficient due to the original design intent for IP, namely that packets will reach their destination, even if there is a problem in the network. This is perfect for "bulk transfer", where information is exchanged in bursts of packets in a coordinated manner. If packets do not reach the destination or are delayed, then they are simply resent.

But, with IP networks now being used as a converged platform for almost all communication services, the requirements are quite different. "Streaming" of real-time data for voice or video is sensitive to delays and lost packets. Re-sending will not help, especially for real-time VOIP telephone conversations or when watching high-definition IPTV. The IP network can still guarantee that the stream will reach its destination through Quality of Service preferential treatment and very fast switchover to backup routes in the event of an error, but how good is the backup path? Does it have the bandwidth or performance required to handle the stream?

The key challenge for high speed networks is to adopt a strategy that moves beyond ensuring availability to assuring reliability and high performance for demanding applications. This will be crucial in supporting the general trend towards converged managed services, triple-play residential services and cloud computing.

Availability Is No Longer Enough
Ensuring availability in IP networks can be a time-consuming and painful task once errors occur. When referring to IP, we are not just referring to layer 3 of the Internet protocol suite, but to the full set of protocols supporting IP at every layer including Ethernet at layer 2 as well as UDP and TCP at layer 4. We have all become accustomed to referring to this collection of protocols as "IP", but from a troubleshooting perspective, one needs to be aware of the various layers and protocols involved and their interactions with each other.

For example, while one might have isolated a problem to IP at layer 3, the solution can include the need to clear the layer 2 ARP cache in order to purge any old IP address bindings to Ethernet MAC addresses before traffic can start flowing again.

Troubleshooting techniques have followed a well-trodden path:

  1. Detect the error - usually through complaints of lost traffic or connectivity
  2. Isolate the error - using ipconfig, ping and traceroute (or similar tools)
  3. Analyze and develop hypotheses on the cause of the error
  4. Test the various hypotheses until connectivity is restored

The first aspect of this approach that should be noted is that it is reactionary - an error has occurred and steps are being taken to resolve it. But notice also that this is a case where connectivity has been lost; in other words, all possible destinations to the end-user are unavailable despite the fact that IP networks are designed to have multiple paths to the destination. In most cases, this is simply a case of a misconfiguration of the users client PC, but what this case also reveals is that problems in the network with certain routes will probably go undetected as alternative routes can be used. The end-user will not notice a problem until all alternative routes are unavailable, which is unlikely.

Another aspect of this approach is that it relies on simple (albeit effective) tools for retrieving information. IPconfig (and its counterpart in the Unix world ifconfig) provides useful information on Ethernet/IP interfaces that can tell the user if there is an issue with the availability of interfaces and dynamic addressing. Ping can be used to test connectivity from both the client and the server, while traceroute (tracert on Windows and tracepath on some Linux versions) can help determine the available routes and paths to the destination.

However, there is one characteristic of these tools that should be borne in mind: they are intrusive. They are inserting traffic into the same connections that are bearing services. Usually the bandwidth consumed is insignificant, but it is something to bear in mind.

There can be many reasons why there is no connectivity or availability ranging from simple misconfiguration of the client PC, cabling and interface card issues to complex network, router, switch and firewall configuration issues. All of these potential sources of error need to be analyzed and tested.

This provides the final and clear observation of this approach; namely that it is extremely time and resource consuming.

This is why ensuring availability is no longer enough: it is reactive, rather simple and consumes time and resources. It also requires that excess capacity is available in the network to ensure that services can still be maintained and delivered over alternative routes while problems are being solved. With more services moving to IP at higher bandwidths and line-rates, is this still a viable strategy?

What about Network Management?
Network management systems have been used for decades to provide a central view of the network for Fault, Configuration, Accounting, Performance and Security (FCAPS) management. While telecom networks and protocols such as ATM and SONET/SDH have included extensive network management overhead information, IP networks have been designed to be lightweight and free of this administrative burden. While this has made operation of IP networks simpler, it has equally made central management of IP networks more difficult.

Management systems exist that allow central configuration, fault and, to a certain extent performance management of IP networks, but these rely on Simple Network Management Protocol (SNMP) interfaces to network equipment allowing simple get and set commands to the network equipment's Management Information Base (MIB), which is (hopefully) reflecting the true status of the equipment. SNMP traps can be used to inform the management system of events, such as faults detected by the network equipment. Statistics can be gathered and exported and thresholds set for performance.

This is fine for monitoring the network and for alerts of potential problems in the network, which helps us with detecting the error. Unless fault correlation intelligence is employed, a single problem in the network could result in a blizzard of fault alarms, which can make it difficult to determine the root-cause of the problem.

Once again, this is a reactive model. Performance thresholds can possibly alert us to impending issues, but require a great deal of insight into the network to determine the correct threshold levels.

What about Built-in Router Capabilities?
Routers often provide monitoring and debugging tools. For example, Cisco provides a packet debugging facility in some of its routers via the "debug IP packet" command. However, while this facility provides packet capture capabilities with time-stamping, which can be extremely useful in troubleshooting routing problems, it can be dangerous to use. Packet debugging is compute intensive and can potentially lock-up the router requiring a re-boot. It is therefore suggested as a last-resort step.

Cisco also provides netflow information, which is detailed information on "flows" of packets entering the router. The equivalent facility from Juniper is called jflow. There is also sflow, which provides measurements based on samples of incoming traffic.

These facilities are more commonly used to provide real-time information on packet flows that can not only be useful in troubleshooting a router network, but could also provide early warning of impending issues.

However, as in the packet debugging case, collecting this information is computationally intensive and distracts from the main task of the router of forwarding packets.

A more common approach today is to use independent netflow probes that tap and monitor the links entering the router and provide netflow data to collectors that can forward this information to a central manager. This can provide a full overview of what is occurring in the network.

Once one has taken the mental step in accepting the need for independent probes, a world of new possibilities opens. With dedicated network probes and other network appliances, it is possible to establish an effective network performance monitoring solution based on packet flows that not only provides information on the network traffic traversing the network, but also the applications used to send this traffic.

This brings us beyond troubleshooting availability and connectivity to troubleshooting reliability and application performance.

From Network to Application Performance Monitoring
In each Ethernet frame sent in an IP network, a hoard of information is available relating to the encapsulated IP, TCP or UDP information. This provides information on who sent the data and for whom it is intended as well as the application used to send the data.

Network and application performance monitoring solutions use this information to provide graphical representations of network usage. They can show which applications and users are using bandwidth and whether utilization on certain links is reaching a critical level. Application response time can also be measured.

With these tools in hand, it is possible to proactively plan for network changes to balance network loads, provide higher quality of service levels to critical application data flows, take action to reduce the impact of recreational applications (such as PTP traffic) and thereby improve application response time.

Now it is possible to converse with users in a language they understand: in terms of applications rather than packets. For users, packets lost and link utilization are virtually meaningless. The issues they face are when their database query is too slow or their website access is not working. Network and application performance monitoring tools provide the means to address these queries directly.

In other words, we are now capable of troubleshooting application performance rather than just availability.

Troubleshooting Application Performance
While network and application performance monitoring solutions provide the insight into the IP network in terms of applications and can help to pre-empt network problems, these problems can still occur.

Network forensic tools help us to analyze real-time data to determine if there are anomalies in the network. Network probes provide the captured packet data, which is then analyzed centrally. Many network and application performance monitoring solutions include this capability.

An increasingly more popular approach is to store all captured packet data to disk, which can then be analyzed after the event to pin-point the cause of a problem in the past. This allows network forensic tools to be applied when and where they are needed.

To enable such a solution, there are a number of requirements that need to be met:

  • All packet data needs to be captured in real-time with zero packet loss
  • Each packet needs to be time-stamped to be useful for later forensic analysis
  • A full-throughput capture-to-disk solution is required that can store all packet data in real-time
  • It must be possible to replay all packet data as it was recorded with the same inter-frame gaps in order to fully re-create the conditions at the time of the network event.

At 10 Gbps, this is challenging, but solutions exist that address this. For example, at 10 Gbps, up to 15 million packets per second need to be captured per port, analyzed and stored to disk. That is a packet every 67ns. Needless to say, this requires high performance network probes or appliances to achieve.

There are large, proprietary hardware solutions available that meet these requirements, but these can be expensive. It is also possible to find more affordable packet capture and replay solutions that address the needs above based on standard PC server hardware using intelligent real-time network adapters designed for high-speed packet capture and replay.

With these tools in hand, a network administrators capabilities are significantly extended to allow more than just troubleshooting of connectivity, but also troubleshooting of service or application performance degradation.

Looking to the Future
The sheer bandwidth and number of packets that 10 Gbps in themselves necessitate more advanced troubleshooting capabilities. But the added dimension of more sensitive, real-time service requirements, as well as the move to cloud computing underlines the need for a new way of thinking in network troubleshooting and management. Ensuring availability is no longer enough - guaranteeing service and application performance should now be the focus of all network administrators. The good news is that the tools that you need to do this are available and improving every day, which should allow us all to look to the future with more confidence in our network infrastructure.

For this reason, management of 10 Gbps networks will face new challenges as the network becomes more critical and availability is no longer a goal, but a guarantee!


About the Author
Dan Joe Barry works for Napatech. Napatech develops and markets the world's most advanced programmable network adapters for network traffic analysis and application off-loading. Napatech is the leading OEM supplier of Ethernet network acceleration adapter hardware with an installed base of more than 40,000 ports.


© Copyright 2010 Auerbach Publications