Filtering bad, sampling good
This post gives an example on how to use adaptive sampling instead of filtering on “noisy” telemetry and explains the benefits of this approach.
The idea of this post comes as a result of two conversations. First, on twitter on how to filter out fast requests and dependency calls:
Second one is a design discussion on sampling per endpoint url path for Open Census:
I’ll give example in C# as this post adds a scenario to the msdn article. The same techniques can be used on any other language supported by Application Insights.
Idea
Application Insights collects information in form of C# object
RequestTelemetry
and DependencyTelemetry
about every incoming and outgoing
call to your app. This object is passed to the
processing pipeline
that will enrich it with an additional details or filter it out. This
approach requires to allocate an object and populate its properties for every
request, but it makes writing filters a straightforward task.
Let’s implement a processor which aggressively samples fast incoming and
outgoing calls. First, let’s filter out fast incoming requests. Processor is a
class inherited from the interface ITelemetryProcessor
. This class should have
a constructor that takes another ITelemetryProcessor
as an argument. When
pipeline will be instantiated - constructor will be called and the next
processor will be passed as a parameter. In a Process
method then processor
decides whether to call next.Process
. If it will be called - the next
processor will decide whether to pass an item further. If next.Process
will
not be called - telemetry object will be lost.
The idea is to create a new AdaptiveSamplingTelemetryProcessor
(which is one
of a standard processors) and pass next
to its constructor. So that processor
will decide what to do with this telemetry object - either filter it our or pass
it to the next
.
Basically, there will be fork for all telemetry - fast calls will go thru the
aggressive sampling processor. Slow calls - directly to the next
one.
Implementation
This is how code will look like:
internal class AggressivelySampleFastRequests : ITelemetryProcessor
{
private readonly ITelemetryProcessor next;
private readonly AdaptiveSamplingTelemetryProcessor samplingProcessor;
public AggressivelySampleFastRequests(ITelemetryProcessor next)
{
this.next = next;
this.samplingProcessor = new AdaptiveSamplingTelemetryProcessor(this.next);
}
public void Process(ITelemetry item)
{
// check the telemetry type and duration
if (item is RequestTelemetry)
{
var d = item as RequestTelemetry;
if (d.Duration < TimeSpan.FromMilliseconds(500))
{
// let sampling processor decide what to do
// with this fast incoming request
this.samplingProcessor.Process(item);
return;
}
}
// in all other cases simply call next
this.next.Process(item);
}
}
Further customization of a processor might include adjusting
AdaptiveSamplingTelemetryProcessor
parameters like this:
this.samplingProcessor = new AdaptiveSamplingTelemetryProcessor(this.next)
{
ExcludedTypes = "Event", // exclude custom events from being sampled
MaxTelemetryItemsPerSecond = 1, // default: 5 calls/sec
SamplingPercentageIncreaseTimeout = TimeSpan.FromSeconds(1), // default: 2 min
SamplingPercentageDecreaseTimeout = TimeSpan.FromSeconds(1), // default: 30 sec
EvaluationInterval = TimeSpan.FromSeconds(1), // default: 15 sec
InitialSamplingPercentage = 25, // default: 100%
};
And making threshold TimeSpan.FromMilliseconds(500)
configurable.
Similar code can be implemented for outgoing calls processing. And those two processors can be chained together:
configuration.TelemetryProcessorChainBuilder
.Use((next) => { return new AggressivelySampleFastRequests(next); })
.Use((next) => { return new AggressivelySampleFastDependencies(next); })
.Build();
Interestingly you can chain those two processors with the regular adaptive
sampling processor. Fast calls will not be processed by final
AdaptiveSamplingTelemetryProcessor
as they will already be marked as
sampled-in. Only slow requests and slow dependencies will be analyzed by the
global processor.
configuration.TelemetryProcessorChainBuilder
.Use((next) => { return new AggressivelySampleFastRequests(next); })
.Use((next) => { return new AggressivelySampleFastDependencies(next); })
.Use((next) => { return new AdaptiveSamplingTelemetryProcessor(next); })
.Build();
Results and benefits
So how the collected telemetry will look like? To illustrate it, I’d assume that fast requests and fast dependencies processors only preserved 5% of telemetry representing fast calls and global sampler preserved 20% of telemetry. I included this demo into Azure-Samples repository.
When you will analyze telemetry for this app, you will probably start with slow requests. You will find that all collected incoming requests have information about all slow outgoing calls. You may also discover that while browsing Azure portal in many cases you will see examples of slow incoming requests with both - slow and fast requests collected.
The reason for this is that adaptive sampling is using probability sampling algorithm to decide on sampling decision. If sampling score is lower than 20% - slow request and dependencies will be collected. Is sampling score is lower than 5% - fast dependencies will also be collected.
Azure portal highlights examples with lowest sampling score as the most interesting to analyze. This is why you will get many useful samples with all the details. When you scroll down or randomly will open examples - you will see many of them with slow dependency only collected.
In a given conditions all collected fast requests will have the details on all the fast dependency calls. Again, browsing those may be useful.
Here I should note that end-to-end transaction view that spans across many components will follow the same principles. If sampling percentages of collected telemetry generally the same and algorithms on deciding on cohorts (thresholds to decide slow vs. fast) is similar - there always would be examples of end-to-end execution. I might write more about it the next time.
There are many other benefits on aggressive sampling versus filtering than preserving 5% of end-to-end transaction views.
Application map will be more complete. With filtering out fast dependencies, you may filter out knowledge about the entire service you depend on if it runs fast.
Endpoint url path will be discoverable on performance page in Azure portal. You will be able to catch problems with major performance degradation of those. If calls that used to take 10 ms will start taking 200 ms - this change will be visible on chart even on 5% of telemetry. And it will likely be caught by smart detectors of Application Insights.
Latency problems caused by multiple calls to the fast dependent service will be caught - there probably be examples of those in the collected 5%.
Summary
ROI of monitoring is not a simple topic. You can collect tons of logs and traces and have no clue when something goes wrong. Or a single trace may give you information on root cause of a major app slowdown. Every time you filter out telemetry - you are loosing a signal that might have saved your day.
Application Insights allows a lot of flexibility in data processing configuration. So you can save on COGS and preserve some of the signal. Use it!
Comments
comments powered by Disqus