Multiside traces / auto-correlations

IT performance analysis is always a comparison task – usually between at least 2 states – an estimated, or precalculated or guesses, or before or somewhere else experienced or simply expcted „good“ state – and the experience of a bad state

comparisons can cover a multitude of scenarios like 

  • yesterday good – today bad 
  • my „guessed and understanding of „should be“ vers. experienced bad 
  • location A vers. location B 
  • Application A vers. App B 
  • user A vers. user B

one of the most reliable ways to understand performance is analysis of packet data. 

It is like a blood test - compared to ask the patient or measuring temperature.

Traces cannot lie - because they just provide measured and non-interpreted, unchanged data.

In this scenario is an approach described to use pcap data from 2 sides to easy and fast troubleshooting performance issues of IT services

In each IT environment at least 3 areas are responsible for performance:

  • The client side
  • The network
  • The application / server side (which can of course be very complex)

For a client the time between request and reply determines the external performance. ngs.

Performance issues

A client establishes a session to a server - and all data of this session should be in sync, like  the number of sent client packets = number off received server-side packets

  • Under optimal conditions – network time would not change – and performance is predictable.
  • Under “Normal” conditions – network performance does change – and service performance changes.

Latency can rise – by delays on components, by rerouting – from Berlin to Munich via NYC – packets are not received and acked – so retransmissions triggerd by sequencing or timeout may occur.

And the service responsetime depends on server resources, application design, architecture, backend and a high variety of components.

If performance issues occur

Which side is causing the performance issues experienced by clients

  • If clients can rule out local error – it should be caused by network or server
  • If server receives a request and processes it instantly and fast - the cause is at client - or network
  • if the network can guarantee fast and error free end2end throughput - the cause must be client or server

 This happens often – that all sides declare their innocence based on their available information – but the problems persist!

To resolve issues - people need to understand fast and precise - which side is responsible This is not a blaming game - if it is done with the right methods.

Method of analysis using packet capture

One of the most precise methods for such a situation is packet capture & analytics.

It is like a blood test - compared to ask the patient or measuring temperature.

Traces cannot lie - because they just provide measured and non-interpreted, unchanged data.

 

Typical questions to clarify

  • A client sends a request - what it received at the server?
  • what it correct received?
  • What is the delay between send and receive?
  • A client sends request retransmission - why was retransmitted?
  • what first request NOT received by server? did he replied and reply was not received at client?
  • where both request received at the sever?
  • If packet loss occurs - where does it happen?
  • Are there any differences between send - and receive order?
  • client request - received out-of-order?
  • Is TTL constant at the receiver side?

These are very essential questions – and answers of those can point to the cause of delay

But - using single trace file tools - this sounds like a huge manual work - comparing packets by packets in wireshark.  

.

Automization by multiside traces / Auto-Correlation (MTAC)

… a feature developed for SharkMon –  our solution for analyszing large number of pcap files with wireshark metrics to provide an ongoing monitoring.

SharkMon enables users to define packet analysis scenarios for 1000s of trace files - providing a constant monitoring - and to correlate and compare both sides based on free definable metrics of packet content by using the same variety and syntax as wireshark.  

 

If traces / pcap files of sender (client-PC) and receiver (server-side tcpdump, datacenter capture probe) exist for same time - they can be synchronized and issues of data integrity, packet loss, timing, route changes, application performance etc . easily identified.

 

Process

First check: measure service time to identify issues.

This should be done on client side to understand clear the difference: how fast is the reply received on the client - and on the providing site / server sametime.

 

The picture below shows the results from 2 side locations:

  • Client side - the green line
  • Server-side - the red line

The gap between both lines is the network time

 

You can identify a spike on both lines at 1:10pm, which makes clear – that this is a server / backend issue – because the responsetime measured directly on the server via tcpdump shows the same outage – this is local server performance.

But people still want to understand the variance of network effects here.

The following picture shows the variation of network performance on 2 sides:

  • TTL - for measuring the number of routing hops end2end, a variation would point to rerouting and latency
  • iRTT - for the average latency based on 3way handshake
  • RTO - the time for retransmission effects

The results here are clear

  • TTL did not change - the route between both side is constant
  • iRTT is on both side constant - no much variation in latency
  • iRTO is at client side constant and higher than on server side - the experienced networking issue will not be on the server side.

Summary

Using 2-side trace-correlation - client / service effects are identified fast - and precise.

Finger-pointing is clearly avoided - downtime is reduced.

 

It can show clear if the incident was caused on the transfer network or at the server-side infrastructure.

In case of network issues - it can clear identify the real important parameters.

 

SharkMon by interview network solutions

SharkMon can collect packet data from multiple locations and entities:

  • direct on the service using tcpdump / tshark
  • large capture probesy via API  
  • manually upload of trace files

 

User can import 1000ds of trace files for providing a constant monitoring history.

Trace files are organized in scenarios which can be easily correlated - such allowing analysis scenarios Like

  • Client verse. server
  • location A ves. other location B
  • User A verse. User B
  • Application A verse. Application B
  • Leaving country A very. entering country B (geo-political scenario)

 

It can use any metric which can be found in wireshark for monitoring – allowing deepest monitoring ability in the industry.

 

This allows usage in networking environments as

  • WAN / Network
  • datacenter
  • Cloud - IaaS / PaaS
  • WIRELESS INTERNET ACCESS
  • VPN
  • Industry / industrial ethernet
  • User endpoints