Multiside traces / auto-correlations

IT Performance analysis is always a comparison task – usually between at least 2 states – an estimated, or precalculated or guesses, or before or somewhere else experienced or simply expcted “good” state – and the experience of a bad state

comparisons can cover a multitude of scenarios like 

  • yesterday good – today bad 
  • my “guessed and understanding of “should be” vers. experienced bad 
  • location A vers. location B 
  • Application A vers. App B 
  • user A vers. user B

one of the most reliable ways to understand performance is analysis of packet data. 

It is like a blood test – compared to ask the patient or measuring temperature.

Traces cannot lie – because they just provide measured and non-interpreted, unchanged data.

in this scenario is an approach described to use pcap data from 2 sides to easy and fast troubleshooting Performance issues of IT -services

In each IT environment at least 3 areas are responsible for performance:

  • The client Side
  • The network
  • The application/server side (which can of course be very complex)

For a client the time between request and reply determines the external performance. ngs.

Performance issues

A client establishes a session to a server – and all data of this session should be in sync, like  the number of sent client packets = number off received server-side packets

  • Under optimal conditions – network time would not change – and performance is predictable.
  • Under “normal” conditions – network performance does change – and service performance changes.

Latency can rise – by delays on components, by rerouting – from Berlin to Munich via NYC – packets are not received and acked – so retransmissions triggerd by sequencing or timeout may occur.

And the service responsetime depends on server-resources, application design, architecture, backend and a high variety on components.

If performance issues occur

Which side is causing the performance issues experienced by clients

  • If clients can rule out local error – it should be caused by network or server
  • If server receives a request and processes it instantly and fast – the cause is at client – or network
  • if the network can guaranty fast and error free end2end throughput – the cause must be client or server

 This happens often – that all sides declare their innocence based on their available information – but the problems persist!

To resolve issues – people need to understand fast and precise – which side is responsible This is not blaming game – if it is done with the right methods.

Method of analysis using packet capture

One of most precise methods for such situation is packet capture & analytics.

It is like a blood test – compared to ask the patient or measuring temperature.

Traces cannot lie – because they just provide measured and non-interpreted, unchanged data.

 

Typical questions to clarify

  • A client sends a request – was it received at the server?
  • was it correct received?
  • What is the delay between send and receive?
  • A client sends request retransmission – why was retransmitted?
  • was first request NOT received by server ? did he replied and reply was not received at client ?
  • where both request received at the sever ?
  • If packet loss occurs – where does it happen ?
  • Are there any differences between send – and receive order?
  • client request – received out-of-order ?
  • Is TTL constant at the receiver side?

These are very essential questions – and answers of those can point to the cause of delay

But – using single trace file tools – this sounds like a huge manual work – comparing packets by packets in wireshark.  

.

Automization by multiside traces / Auto-Correlation (MTAC)

… a feature developed for SharkMon –  our solution for analyszing large number of pcap files with wireshark metrics to provide an ongoing monitoring.

SharkMon enables users to define packet analysis scenarios for 1000s of trace files – providing a constant monitoring – and to correlate and compare both sides based on free definable metrics of packet content by using same variety and syntax as wireshark.  

 

If traces/pcap files of sender (client-PC) and receiver (server-side tcpdump, datacenter capture probe) exist for same time – they can be synchronized and issues of data integrity, packet loss, timing, route changes, application performance etc. easily identified.

 

Process

First check: measure service time to identify issues.

This should be done on client side to understand clear the difference: how fast is the reply received on the client – and on the providing site/server sametime.

 

The picture below shows the results from 2 side locations :

  • Client side – the green line
  • Server-side – the red line

The gap between both lines is the network time

 

You can identify a spike on both lines at 1:10pm, which makes clear – that this is a server / backend issue – because the responsetime measured directly on the server via tcpdump shows the same outage – this is local server performance.

But people still want to understand the variance of network effects here.

The following picture shows the variation of network performance on 2 sides:

  • TTL – for measuring the number of routing hops end2end, a variation would point to rerouting and latency
  • iRTT – for the average latency based on 3way handshake
  • RTO – the time for retransmission effects

The results here are clear

  • TTL did not change – the route between both side is constant
  • iRTT is on both side constant – no much variation in latency
  • iRTO is at client side constant and higher than on server side – the experienced networking issue will be not on the server-side.

Summary

Using 2-side trace-correlation – client / Service effects are identified fast – and precise.

Finger-pointing is clearly avoided – Downtime reduced.

 

It can show clear if the incident was caused on the transfer network or at the server-side  infrastructure.

In case of network issues – it can clear identify the real important parameters.

 

SharkMon by interview network solutions

SharkMon can collect packet data from multiple locations and entities:

  • direct on the service using tcpdump/tshark
  • large capture probesy via API  
  • manually upload of tracefiles

 

User can import 1000ds of tracefiles for a providing a constant monitoring history.

Tracefiles are organized in scenarios which can be easily correlated – such allowing analysis scenarios like

  • Client vers. Server
  • location A ves. other location B
  • User A vers. User B
  • Application A vers. Application B
  • Leaving country A very. entering country B (geo-political scenario)

 

It can use any metric which can be found in wireshark for monitoring – allowing deepest monitoring ability in the industry.

 

This allows usage in networking environments  as

  • WAN / Network
  • datacenter
  • Cloud – IaaS/PaaS
  • WLAN
  • VPN
  • Industry / industrial ethernet
  • User endpoints