Multiside traces / auto-correlations

IT Performance analysis is always a comparison task – usually between at least 2 states – an estimated, or precalculated or guesses, or before or somewhere else experienced or simply expcted “good” state – and the experience of a bad state.

comparisons can cover a multitude of scenarios like

yesterday good – today bad
my “guessed and understanding of “should be” vers. experienced bad
location A vers. location B
Application A vers. App B
user A vers. user B

one of the most reliable ways to understand performance is analysis of packet data.

It is like a blood test – compared to ask the patient or measuring temperature.

Traces cannot lie – because they just provide measured and non-interpreted, unchanged data.

in this scenario is an approach described to use pcap data from 2 sides to easy and fast troubleshooting Performance issues of IT -services

In each IT environment at least 3 areas are responsible for performance:

The client Side
The network
The application/server side (which can of course be very complex)

For a client the time between request and reply determines the external performance. ngs.

Performance issues

A client establishes a session to a server – and all data of this session should be in sync, like the number of sent client packets = number off received server-side packets

Under optimal conditions – network time would not change – and performance is predictable.
Under “normal” conditions – network performance does change – and service performance changes.

Latency can rise – by delays on components, by rerouting – from Berlin to Munich via NYC – packets are not received and acked – so retransmissions triggerd by sequencing or timeout may occur.

And the service responsetime depends on server-resources, application design, architecture, backend and a high variety on components.

If performance issues occur

Which side is causing the performance issues experienced by clients

If clients can rule out local error – it should be caused by network or server
If server receives a request and processes it instantly and fast – the cause is at client – or network
if the network can guaranty fast and error free end2end throughput – the cause must be client or server

This happens often – that all sides declare their innocence based on their available information – but the problems persist!

To resolve issues – people need to understand fast and precise – which side is responsible This is not blaming game – if it is done with the right methods.

Method of analysis using packet capture

One of most precise methods for such situation is packet capture & analytics.

It is like a blood test – compared to ask the patient or measuring temperature.

Traces cannot lie – because they just provide measured and non-interpreted, unchanged data.

Typical questions to clarify

A client sends a request – was it received at the server?
was it correct received?
What is the delay between send and receive?
A client sends request retransmission – why was retransmitted?
was first request NOT received by server ? did he replied and reply was not received at client ?
where both request received at the sever ?
If packet loss occurs – where does it happen ?
Are there any differences between send – and receive order?
client request – received out-of-order ?
Is TTL constant at the receiver side?

These are very essential questions – and answers of those can point to the cause of delay

But – using single trace file tools – this sounds like a huge manual work – comparing packets by packets in wireshark.

.

Automization by multiside traces / Auto-Correlation (MTAC)

… a feature developed for SharkMon – our solution for analyszing large number of pcap files with wireshark metrics to provide an ongoing monitoring.

SharkMon enables users to define packet analysis scenarios for 1000s of trace files – providing a constant monitoring – and to correlate and compare both sides based on free definable metrics of packet content by using same variety and syntax as wireshark.

If traces/pcap files of sender (client-PC) and receiver (server-side tcpdump, datacenter capture probe) exist for same time – they can be synchronized and issues of data integrity, packet loss, timing, route changes, application performance etc. easily identified.

Process

First check: measure service time to identify issues.

This should be done on client side to understand clear the difference: how fast is the reply received on the client – and on the providing site/server sametime.

The picture below shows the results from 2 side locations :

Client side – the green line
Server-side – the red line

The gap between both lines is the network time

You can identify a spike on both lines at 1:10pm, which makes clear – that this is a server / backend issue – because the responsetime measured directly on the server via tcpdump shows the same outage – this is local server performance.

But people still want to understand the variance of network effects here.

The following picture shows the variation of network performance on 2 sides:

TTL – for measuring the number of routing hops end2end, a variation would point to rerouting and latency
iRTT – for the average latency based on 3way handshake
RTO – the time for retransmission effects

The results here are clear

TTL did not change – the route between both side is constant
iRTT is on both side constant – no much variation in latency
iRTO is at client side constant and higher than on server side – the experienced networking issue will be not on the server-side.

Summary

Using 2-side trace-correlation – client / Service effects are identified fast – and precise.

Finger-pointing is clearly avoided – Downtime reduced.

It can show clear if the incident was caused on the transfer network or at the server-side infrastructure.

In case of network issues – it can clear identify the real important parameters.

SharkMon by interview network solutions

SharkMon can collect packet data from multiple locations and entities:

direct on the service using tcpdump/tshark
large capture probesy via API
manually upload of tracefiles

User can import 1000ds of tracefiles for a providing a constant monitoring history.

Tracefiles are organized in scenarios which can be easily correlated – such allowing analysis scenarios like

Client vers. Server
location A ves. other location B
User A vers. User B
Application A vers. Application B
Leaving country A very. entering country B (geo-political scenario)

It can use any metric which can be found in wireshark for monitoring – allowing deepest monitoring ability in the industry.

This allows usage in networking environments as

WAN / Network
datacenter
Cloud – IaaS/PaaS
WLAN
VPN
Industry / industrial ethernet
User endpoints