APM tips blog

Blog about application monitoring.

Clone Application Insights Dashboard for a Different Application

| Comments

You may have many environments where your application is running. For every environment you will create a separate Application Insights resource so you can set up access and billing for the production telemetry differnetly from the QA environment. However you may want to have the same dashboard for every environment. You may even want to deploy the dashboard updates alongside the application deployment. So when your application exposes new telemetry - dashboard will visualize it.

This blog post explains how to clone the dashboard and retarget it to the different Application Insights component using Azure Resource Management (ARM).

Let’s say you have a Dashboard A for the component A and you want to create the same dashboard for component B. In my example I simply pinned the servers chart to the dashboard, but it may be way more advanced in your case.

In order to clone the dashboard you need to share it first. Sharing places the dashboard definition into the resource group so you can see it in Azure Resource Management portal.

Once shared the URL for your dashboard will look like this: https://portal.azure.com/#dashboard/arm/subscriptions/6b984a40-aa54-452b-b975-acc3bf105fa7/resourcegroups/dashboards/providers/microsoft.portal/dashboards/7a2a64c5-a661-47c1-a1a3-afae823d7533. It includes subscription, resource group and dashboard unique name. Copy the dashbaord unique name (in this case 7a2a64c5-a661-47c1-a1a3-afae823d7533) and find it at https://resources.azure.com

Direct URL to your dashboard definition will look like this: https://resources.azure.com/subscriptions/6b984a40-aa54-452b-b975-acc3bf105fa7/resourceGroups/dashboards/providers/Microsoft.Portal/dashboards/7a2a64c5-a661-47c1-a1a3-afae823d7533

Now you can copy the dashboard definition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
{
  "properties": {
    "lenses": {
      "0": {
        "order": 0,
        "parts": {
          "0": {
            "position": {
              "x": 0,
              "y": 0,
              "rowSpan": 5,
              "colSpan": 6
            },
            "metadata": {
              "inputs": [
                {
                  "name": "ComponentId",
                  "value": {
                    "SubscriptionId": "6b984a40-aa54-452b-b975-acc3bf105fa7",
                    "ResourceGroup": "A",
                    "Name": "A"
                  }
                },
                {
                  "name": "MetricsExplorerJsonDefinitionId",
                  "value": "pinJson:?name={\n  \"version\": \"1.4.1\",\n  \"isCustomDataModel\": false,\n  \"items\": [\n    {\n      \"id\": \"b2f8708b-4a48-4b35-b96e-7622caca21ce\",\n      \"chartType\": \"Area\",\n      \"chartHeight\": 4,\n      \"metrics\": [\n        {\n          \"id\": \"performanceCounter.percentage_processor_time.value\",\n          \"metricAggregation\": \"Avg\",\n          \"color\": \"msportalfx-bgcolor-g0\"\n        }\n      ],\n      \"priorPeriod\": false,\n      \"clickAction\": {\n        \"defaultBlade\": \"SearchBlade\"\n      },\n      \"horizontalBars\": true,\n      \"showOther\": true,\n      \"aggregation\": \"Avg\",\n      \"percentage\": false,\n      \"palette\": \"blueHues\",\n      \"yAxisOption\": 0\n    },\n    {\n      \"id\": \"093583d1-bc86-4c2e-91d8-527a2411910b\",\n      \"chartType\": \"Area\",\n      \"chartHeight\": 1,\n      \"metrics\": [\n        {\n          \"id\": \"performanceCounter.available_bytes.value\",\n          \"metricAggregation\": \"Avg\",\n          \"color\": \"msportalfx-bgcolor-j1\"\n        }\n      ],\n      \"priorPeriod\": false,\n      \"clickAction\": {\n        \"defaultBlade\": \"SearchBlade\"\n      },\n      \"horizontalBars\": true,\n      \"showOther\": true,\n      \"aggregation\": \"Avg\",\n      \"percentage\": false,\n      \"palette\": \"greenHues\",\n      \"yAxisOption\": 0\n    },\n    {\n      \"id\": \"03fd5488-b020-417b-97e2-bf7564568d3b\",\n      \"chartType\": \"Area\",\n      \"chartHeight\": 1,\n      \"metrics\": [\n        {\n          \"id\": \"performanceCounter.io_data_bytes_per_sec.value\",\n          \"metricAggregation\": \"Avg\",\n          \"color\": \"msportalfx-bgcolor-g0\"\n        }\n      ],\n      \"priorPeriod\": false,\n      \"clickAction\": {\n        \"defaultBlade\": \"SearchBlade\"\n      },\n      \"horizontalBars\": true,\n      \"showOther\": true,\n      \"aggregation\": \"Avg\",\n      \"percentage\": false,\n      \"palette\": \"blueHues\",\n      \"yAxisOption\": 0\n    },\n    {\n      \"id\": \"c31fd4cc-be41-449e-a657-d16d2e9c8487\",\n      \"chartType\": \"Area\",\n      \"chartHeight\": 1,\n      \"metrics\": [\n        {\n          \"id\": \"performanceCounter.number_of_exceps_thrown_per_sec.value\",\n          \"metricAggregation\": \"Avg\",\n          \"color\": \"msportalfx-bgcolor-d0\"\n        }\n      ],\n      \"priorPeriod\": false,\n      \"clickAction\": {\n        \"defaultBlade\": \"SearchBlade\"\n      },\n      \"horizontalBars\": true,\n      \"showOther\": true,\n      \"aggregation\": \"Avg\",\n      \"percentage\": false,\n      \"palette\": \"fail\",\n      \"yAxisOption\": 0\n    },\n    {\n      \"id\": \"8b942f02-ef58-46ac-877a-2f4c16a17a4f\",\n      \"chartType\": \"Area\",\n      \"chartHeight\": 1,\n      \"metrics\": [\n        {\n          \"id\": \"performanceCounter.requests_per_sec.value\",\n          \"metricAggregation\": \"Avg\",\n          \"color\": \"msportalfx-bgcolor-b2\"\n        }\n      ],\n      \"priorPeriod\": false,\n      \"clickAction\": {\n        \"defaultBlade\": \"SearchBlade\"\n      },\n      \"horizontalBars\": true,\n      \"showOther\": true,\n      \"aggregation\": \"Avg\",\n      \"percentage\": false,\n      \"palette\": \"warmHues\",\n      \"yAxisOption\": 0\n    }\n  ],\n  \"title\": \"Servers\",\n  \"currentFilter\": {\n    \"eventTypes\": [\n      10\n    ],\n    \"typeFacets\": {},\n    \"isPermissive\": false\n  },\n  \"jsonUri\": \"MetricsExplorerPinJsonDefinitionId - Dashboard.f9bfee41-bd32-47a7-ae11-7d2038cd3c44 - Pinned from 'AspNetServersMetrics'\"\n}"
                },
                {
                  "name": "BladeId",
                  "value": "Dashboard.f9bfee41-bd32-47a7-ae11-7d2038cd3c44"
                },
                {
                  "name": "TimeContext",
                  "value": {
                    "durationMs": 86400000,
                    "createdTime": "2017-03-23T19:54:01.552Z",
                    "isInitialTime": false,
                    "grain": 1,
                    "useDashboardTimeRange": false
                  }
                },
                {
                  "name": "Version",
                  "value": "1.0"
                },
                {
                  "name": "DashboardTimeRange",
                  "value": {
                    "relative": {
                      "duration": 1440,
                      "timeUnit": 0
                    }
                  },
                  "isOptional": true
                }
              ],
              "type": "Extension/AppInsightsExtension/PartType/MetricsExplorerOutsideMEBladePart",
              "settings": {},
              "viewState": {
                "content": {}
              },
              "asset": {
                "idInputName": "ComponentId",
                "type": "ApplicationInsights"
              }
            }
          }
        }
      }
    },
    "metadata": {
      "model": {
        "timeRange": {
          "value": {
            "relative": {
              "duration": 24,
              "timeUnit": 1
            }
          },
          "type": "MsPortalFx.Composition.Configuration.ValueTypes.TimeRange"
        }
      }
    }
  },
  "id": "/subscriptions/6b984a40-aa54-452b-b975-acc3bf105fa7/resourceGroups/dashboards/providers/Microsoft.Portal/dashboards/7a2a64c5-a661-47c1-a1a3-afae823d7533",
  "name": "7a2a64c5-a661-47c1-a1a3-afae823d7533",
  "type": "Microsoft.Portal/dashboards",
  "location": "centralus",
  "tags": {
    "hidden-title": "Dashboard A"
  }
}

In order to retarget the dashboard just find all mentions of your Application Insights component and replace it to the new component. In my example there were only one mention:

1
2
3
4
5
6
7
8
9
"inputs": [
{
    "name": "ComponentId",
    "value": {
        "SubscriptionId": "6b984a40-aa54-452b-b975-acc3bf105fa7",
        "ResourceGroup": "B",
        "Name": "B"
    }
},.

Then rename the dashboard:

1
2
3
4
5
6
7
8
"id": "/subscriptions/6b984a40-aa54-452b-b975-acc3bf105fa7/resourceGroups
                    /dashboards/providers/Microsoft.Portal/dashboards/DashboardB",
"name": "DashboardB",
"type": "Microsoft.Portal/dashboards",
"location": "centralus",
"tags": {
    "hidden-title": "Dashboard B"
}

You can create the new dashboard in the ARM portal now. Type “DashboardB” as {Resource Name} and updated JSON as definition.

and start using your dashboard in the portal. Note, one perk of creating the dashboard manually is that the unique name of the dashboard you created is human readable, not the guid: https://portal.azure.com/#dashboard/arm/subscriptions/6b984a40-aa54-452b-b975-acc3bf105fa7/resourcegroups/dashboards/providers/microsoft.portal/dashboards/dashboardb

With Azure Resource Management you can automate this process and configure dashboards update/deployments alongside with the application. So the monitoring configuration will be a part of your service definition.

When 404 Is Not Tracked by Application Insights

| Comments

Sometimes Application Insights wouldn’t track web requests made with the bad routes resulting in the response code 404. The reason may not be clear initially. However once you opened the application from localhost and see the standard IIS error page - it become clearer. Without the default route set up in your applicaiton - 404 will be returned by StaticFile handler, not by the managed handler. This is what the error page says:

Easiest and most straightforward workaround is to change a web.config according to this blog post - add runAllManagedModulesForAllRequests="true" and remove preCondition="managedHandler":

1
2
3
4
5
<modules runAllManagedModulesForAllRequests="true">
  <remove name="ApplicationInsightsWebTracking" />
  <add name="ApplicationInsightsWebTracking"
   type="Microsoft.ApplicationInsights.Web.ApplicationInsightsHttpModule, Microsoft.AI.Web"/>
</modules>

This way Application Insights http module will be working on every request and you’ll capture all requests made to the bad routes.

Enable Application Insights Live Metrics From Code

| Comments

Small tip on how to enable Application Insights Live Metrics from code.

Application Insights allows to view telemetry like CPU and memory in real time. The feature is called Live Metrics. We also call it Quick Pulse. You’d typically use it when something is happenning with your application. Deploying a new version, investigating an ongoing incident or scaling it out. You can use it free of charge as a traffic to Live Stream endpoint is not counted towards the bill.

The feature is implemented in a NuGet Microsoft.ApplicationInsights.PerfCounterCollector. If you are using ApplicationInsights.config to configure monitoring you need to add a telemetry module and telemetry processor like you’d normally do:

1
2
3
4
5
6
7
8
9
<TelemetryModules>
  <Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.
    QuickPulse.QuickPulseTelemetryModule, Microsoft.AI.PerfCounterCollector"/>
</TelemetryModules>

<TelemetryProcessors>
  <Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.
    QuickPulse.QuickPulseTelemetryProcessor, Microsoft.AI.PerfCounterCollector"/>
<TelemetryProcessors>

However simply adding them in code like you’d expect wouldn’t work:

1
2
3
4
5
6
7
8
9
TelemetryConfiguration configuration = new TelemetryConfiguration();
configuration.InstrumentationKey = "9d3ebb4f-7a11-4fb1-91ac-7ca8a17a27eb";

configuration.TelemetryProcessorChainBuilder
    .Use((next) => { return new QuickPulseTelemetryProcessor(next); })
    .Build();

var QuickPulse = new QuickPulseTelemetryModule();
QuickPulse.Initialize(configuration);

You need to “connect” module and processor. So you’d need to store the processor when constructing the chain and register it with the telemetry module. The code will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
TelemetryConfiguration configuration = new TelemetryConfiguration();
configuration.InstrumentationKey = "9d3ebb4f-7a11-4fb1-91ac-7ca8a17a27eb";

QuickPulseTelemetryProcessor processor = null;

configuration.TelemetryProcessorChainBuilder
    .Use((next) =>
    {
        processor = new QuickPulseTelemetryProcessor(next);
        return processor;
    })
    .Build();

var QuickPulse = new QuickPulseTelemetryModule();
QuickPulse.Initialize(configuration);
QuickPulse.RegisterTelemetryProcessor(processor);

Now with the few lines of code you can start monitoring your application in real time for free.

Fast OPTIONS Response Using Url Rewrite

| Comments

Imagine you run a high load web application. If this application should be accessible from the different domains you need to configure your server to correctly respond to OPTIONS requests. With IIS - it is easy to configure UrlRewrite rule that will reply with the preconfigured headers without any extra processing cost.

You need to configure inbound rule that matches {REQUEST_METHOD} and reply 200 immidiately. Also you’d need a set of outbound rules which will set a proper response headers like Access-Control-Allow-Methods. It will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
<rewrite>
    <outboundRules>
        <rule name="Set Access-Control-Allow-Methods for OPTIONS response" preCondition="OPTIONS" patternSyntax="Wildcard">
            <match serverVariable="RESPONSE_Access-Control-Allow-Methods" pattern="*" negate="false" />
            <action type="Rewrite" value="POST" />
        </rule>
        <rule name="Set Access-Control-Allow-Headers for OPTIONS response" preCondition="OPTIONS" patternSyntax="Wildcard">
            <match serverVariable="RESPONSE_Access-Control-Allow-Headers" pattern="*" negate="false" />
            <action type="Rewrite" value="Origin, X-Requested-With, Content-Name, Content-Type, Accept" />
        </rule>
        <rule name="Set Access-Control-Allow-Origin for OPTIONS response" preCondition="OPTIONS" patternSyntax="Wildcard">
            <match serverVariable="RESPONSE_Access-Control-Allow-Origin" pattern="*" negate="false" />
            <action type="Rewrite" value="*" />
        </rule>
        <rule name="Set Access-Control-Max-Age for OPTIONS response" preCondition="OPTIONS" patternSyntax="Wildcard">
            <match serverVariable="RESPONSE_Access-Control-Max-Age" pattern="*" negate="false" />
            <action type="Rewrite" value="3600" />
        </rule>
        <rule name="Set X-Content-Type-Options for OPTIONS response" preCondition="OPTIONS" patternSyntax="Wildcard">
            <match serverVariable="RESPONSE_X-Content-Type-Options" pattern="*" negate="false" />
            <action type="Rewrite" value="nosniff" />
        </rule>
        <preConditions>
            <preCondition name="OPTIONS">
                <add input="{REQUEST_METHOD}" pattern="OPTIONS" />
            </preCondition>
        </preConditions>
    </outboundRules>
    <rules>
    <rule name="OPTIONS" patternSyntax="Wildcard" stopProcessing="true">
        <match url="*" />
        <conditions logicalGrouping="MatchAny">
            <add input="{REQUEST_METHOD}" pattern="OPTIONS" />
        </conditions>
        <action type="CustomResponse" statusCode="200" subStatusCode="0" statusReason="OK" statusDescription="OK" />
    </rule>
    </rules>
</rewrite>

I did some measurements locally and found that this simple rule saves a lot of CPU under high load. You can add this rule to your site web.config or for Azure Web Apps you can configure these rules using applicationHost.xdt file.

Now you configured it - how will you make sure it is working in production? Application Insights allows to run a multi-step availability tests. Configuring one for OPTIONS required two hacks.

First, Visual Studio didn’t allow to pick OPTIONS http method. Only GET and POST. To workaround this issue I simply opened my .webtest file in text editor and manually set the method to the value I needed:

1
<Request Method="OPTIONS" Version="1.1" Url="https://dc.services.visualstudio.com/v2/track"..

Second, there is no built-in response header value validator. So I configured the web test to run “bad” request if the value of extracted response header doesn’t match the expected value.

After I configured my web test I can see the test results in standard UI or simply run a query like in Application Analytics.

1
2
3
4
availabilityResults
| where timestamp > ago(1d)
| where name == "OPTIONS"
| summarize percentile(duration, 99) by location, bin(timespan, 15m)

Deployments, Scale Ups and Downs

| Comments

I lost track of what deployed in staging slot of a cloud service once. I also was wondering how other people deploying that service. This post shows how you can answer questions like this using Application Insights Analytics queries.

The service I am looking at is deployed as two cloud services in different regions. It uses automatic code versioning using BuildInfo.config file. New version is deployed in staging slot and then VIP swapped into production.

As I said Application Insights is configured to report application version with every telemetry item. So you can group by application version and find when new version got deployed.

1
2
3
4
performanceCounters
| where timestamp >= ago(5d)
| where name == "Requests/Sec" 
| summarize dcount(cloud_RoleInstance) by application_Version, bin(timestamp, 15m)

The query above detects deployments to staging, but it will not detect the VIP swap accurately. When VIP swap happens the same computers are running the same code. So the number of role instances reporting specific application version in the query above does not change. The only thing changes during the VIP swap is a virtual IP address of those computers.

I posted before how Application Insights will associate the IP address of incoming connection with the telemetry item if telemetry item by itself doesn’t have it specified. So all the performance counters will have client_IP field of the incoming connection. In case of cloud service it will be an IP address of the slot sending telemetry. Let’s use this fact and extend application_Version with the client_IP.

1
2
3
4
5
6
7
let interval = 5d;
performanceCounters
| where timestamp >= ago(interval)
| where name == "Requests/Sec" 
| extend deployment = strcat(application_Version, " ", client_IP)
| summarize dcount(cloud_RoleInstance) by deployment, bin(timestamp, 5m)
| render areachart

This query gave me this picture:

There are two regions this application is deployed to. Hence two general areas - 5 instances and 3 instances. You can also see the spikes when deployments were happening. You can also notice that staging slot doesn’t last long. Spike is very short. Turns out that the staging computers are shut down as part of a release procedure. Typically you would see scaled down number of staging computers running all the time to speed up the rollback when it’s needed.

Let’s zoom into the single deployment:

1
2
3
4
5
6
7
8
9
let fromDate = datetime(2017-01-18 21:50:00z);
let toDate = datetime(2017-01-18 22:15:00z);
performanceCounters
| where timestamp >= fromDate
| where timestamp <= toDate
| where name == "Requests/Sec" 
| extend deployment = strcat(application_Version, " ", client_IP)
| summarize dcount(cloud_RoleInstance) by deployment, bin(timestamp, 1m)
| render areachart  

The result is quite interesting:

You can see the new version of an application deployed into the staging environment in one region and running for ~10 minutes. The same version was deployed in the staging of a different region for much shorter time. It seems that the production traffic started the application initialization after VIP swap. Which typically a bad practice, by the way. At least some smoke tests needs to be run against the staging slot to validate the configuration.

Dig deeper

Analyzing the picture is not easy. Let’s modify the query to print out every deployment, scale up and scale down. Basically, we need to query for every time interval when the previous interval had a different number of role instances reporting the same application version.

Here is a query that returns number of instances per minute:

1
2
3
4
5
6
7
8
let query = (_fromDate:datetime, _toDate:datetime) 
{ 
performanceCounters
| where timestamp >= _fromDate
| where timestamp <= _toDate
| where name == "Requests/Sec" 
| summarize num_instances = dcount(cloud_RoleInstance) 
    by application_Version, client_IP, bin(timestamp, 1m) };

You can call this query query(fromDate, toDate). Now let’s join it with the same results a minute back:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let fromDate = datetime(2017-01-18 21:50:00z);
let toDate = datetime(2017-01-18 22:15:00z);
let query = (_fromDate:datetime, _toDate:datetime) 
{ 
  performanceCounters
    | where timestamp >= _fromDate
    | where timestamp <= _toDate
    | where name == "Requests/Sec" 
    | summarize num_instances = dcount(cloud_RoleInstance) 
        by application_Version, client_IP, bin(timestamp, 1m) 
};
query(fromDate, toDate) | extend ttt = timestamp | join kind=leftouter 
(
  query(fromDate - 1m, toDate + 1m) | extend ttt = timestamp + 1m
) on ttt, application_Version, client_IP

Note the use of leftouter join in the query. The only thing left is to filter the results and make it more human readable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
let fromDate = datetime(2017-01-18 21:50:00z);
let toDate = datetime(2017-01-18 22:15:00z);
let query = (_fromDate:datetime, _toDate:datetime) 
{ 
performanceCounters
| where timestamp >= _fromDate
| where timestamp <= _toDate
| where name == "Requests/Sec" 
| summarize num_instances = dcount(cloud_RoleInstance) by application_Version, client_IP, bin(timestamp, 1m) };
query(fromDate, toDate) | extend ttt = timestamp | join kind=leftouter (
query(fromDate - 1m, toDate + 1m) | extend ttt = timestamp + 1m
) on ttt, application_Version, client_IP
| project timestamp, before = num_instances1, after = num_instances, application_Version, client_IP
| where after != before
| extend name = 
  strcat( 
      iff(isnull(before), "Deployment", iff(after > before, "Scale Up", "Scale Down")),
      " in ",
      iff(client_IP == "52.175.18.0" or client_IP == "13.77.108.0", "Production", "Staging")
  )
| order by timestamp 

The resulting table will look like this:

timestamp before after application_Version client_IP name
2017-01-18T21:54:00Z null 2 vstfs:///Build/Build/3562348 13.77.107.0 Deployment in Staging
2017-01-18T21:59:00Z 2 3 vstfs:///Build/Build/3562348 13.77.107.0 Scale Up in Staging
2017-01-18T22:06:00Z 3 2 vstfs:///Build/Build/3555787 52.175.18.0 Scale Down in Production
2017-01-18T22:07:00Z 2 3 vstfs:///Build/Build/3555787 52.175.18.0 Scale Up in Production
2017-01-18T22:07:00Z 5 1 vstfs:///Build/Build/3555787 13.77.108.0 Scale Down in Production
2017-01-18T22:07:00Z null 3 vstfs:///Build/Build/3555787 13.77.107.0 Deployment in Staging
2017-01-18T22:08:00Z null 3 vstfs:///Build/Build/3562348 13.77.108.0 Deployment in Production
2017-01-18T22:09:00Z 3 5 vstfs:///Build/Build/3562348 13.77.108.0 Scale Up in Production
2017-01-18T22:09:00Z 3 2 vstfs:///Build/Build/3555787 52.175.18.0 Scale Down in Production
2017-01-18T22:09:00Z null 1 vstfs:///Build/Build/3555787 168.63.221.0 Deployment in Staging
2017-01-18T22:10:00Z null 3 vstfs:///Build/Build/3562348 52.175.18.0 Deployment in Production

Using ad hoc analytical queries I found that deployments of this service can be improved. Smoke tests should be added for the staging deployment and staging machines should run for some time after deployment in case you’d need to VIP swap the deployment back.

Automatically detect deployments and scale up and downs may be useful in other scenarios. You may want to notify the service owner by writing a connector for your favorite chat platform. Or you can list the latest deployment to production and staging to know what and when was deployed. You can even report those deployments back to Application Insights as release annotation to see markers on charts. With the power of Analytical Queries in Application Insights it is easy to automate any of these scenarios.

Alerting Over Analytics Queries

| Comments

This is DYI post on how you can use Availability Tests and Data Access API together to enable most popular requests in Application Insights uservoice.

Application Insights uservoce has these 4 very popular items. It is not hard to implement them yourself using Application Insights extensibility points.

Let’s start with the alert on segmented metric. Let’s say I want to recieve alert when nobody opens any posts on this site. Posts differ from the default and about page by /blog/ substring in url. You can go to Application Insights Analytics and write a query like this to get the number of viewed posts:

1
2
3
4
5
pageViews
| where timestamp > ago(10min)
| where timestamp < ago(5min)
| where url !contains "/blog/" 
| summarize sum(itemCount)

Note also that I’m using 5 minutes in the past to allow some time for data to arrive. Typical latency for the telemetry is under the minute. I’m being on a safe side here.

In order to convert this query into a Pass/Fail statement I can do something like this:

1
2
3
4
5
6
pageViews
| where timestamp > ago(10min)
| where timestamp < ago(5min)
| where url !contains "/blog/" 
| summarize isPassed = (sum(itemCount) > 1)
| project iff(isPassed, "PASSED", "FAILED")

This query will return a single value PASSED or FAILED.

Now I can go to the query API explorer at dev.applicationinsights.io. Enter appId and API key and the query. You will get the URL like this:

1
2
3
GET /beta/apps/cbf775c7-b52e-4533-8673-bd6fbd7ab04a/query?query=pageViews%7C%20where%20timestamp%20%3E%20ago(10min)%7C%20where%20timestamp%20%3C%20ago(5min)%7C%20where%20url%20!contains%20%22%2Fblog%2F%22%20%7C%20summarize%20isPassed%20%3D%20(sum(itemCount)%20%3E%201)%7C%20project%20iff(isPassed%2C%20%22PASSED%22%2C%20%22FAILED%22) HTTP/1.1
Host: api.applicationinsights.io
x-api-key: 8083guxbvatm4bq7kruraw8p8oyj7yd2i2s4exnr

Instead of a header you can pass api key as a query string parameter. Use the parameter name &api_key. Resulting URL will look like this:

1
2
3
https://api.applicationinsights.io/beta/apps/cbf775c7-b52e-4533-8673-bd6fbd7ab04a/query
?query=pageViews%7C%20where%20timestamp%20%3E%20ago(10min)%7C%20where%20timestamp%20%3C%20ago(5min)%7C%20where%20url%20!contains%20%22%2Fblog%2F%22%20%7C%20summarize%20isPassed%20%3D%20(sum(itemCount)%20%3E%201)%7C%20project%20iff(isPassed%2C%20%22PASSED%22%2C%20%22FAILED%22)
&api_key=8083guxbvatm4bq7kruraw8p8oyj7yd2i2s4exnr

Final step will be to set up a ping test that will query this Url and make a content match success criteria to search for the keyword PASSED.

You can change queries to satisfy other requests. You can query customEvents by name same way as I queried pageViews by url. You can set an alert when CPU is very high at least on one instance instead of standard averge across all instances:

1
2
3
4
5
6
performanceCounters
| where timestamp > ago(10min) and timestamp < ago(5min)
| where category == "Process" and counter == "% Processor Time"
| summarize cpu_per_instance = avg(value) by cloud_RoleInstance
| summarize isPassed = (max(cpu_per_instance) > 80)
| project iff(isPassed, "PASSED", "FAILED")

You can also join multiple metrics or tables:

1
2
3
4
5
6
7
8
exceptions
| where timestamp > ago(10min) and timestamp < ago(5min)
| summarize exceptionsCount = sum(itemCount) | extend t = "" | join
(requests 
| where timestamp > ago(10min) and timestamp < ago(5min)
| summarize requestsCount = sum(itemCount) | extend t = "") on t
| project isPassed = 1.0 * exceptionsCount / requestsCount > 0.5
| project iff(isPassed, "PASSED", "FAILED")

Some thoughts about this implementation:

  • Availability tests runs once in 5 minutes from a single location. With 5 locations analytics query will run about every minute.
  • The limit on number of analytics queries is 1500 per day. It allows to run a single ping test once a minute or more tests more rarely
  • If query is too long you may need to use POST instead of GET. You can implement POST as multi-step test. But multi-step tests costs money. So you may be better off implementing a simple proxy that will run queries. Same way as I set certificate expiration monitoring.

Update to the Last Post - Set the Name in MVC Web API

| Comments

Answering the quesiton in this comment - how to set the name of the request for attribute-based MVC Web API routing. It can be done as an extension to the previous post. Something like this would work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class ApplicationInsightsCorrelationHttpActionFilter : System.Web.Http.Filters.ActionFilterAttribute, ITelemetryInitializer
{
    private static AsyncLocal<RequestTelemetry> currentRequestTelemetry = new AsyncLocal<RequestTelemetry>();

    public override Task OnActionExecutingAsync(HttpActionContext actionContext, CancellationToken cancellationToken)
    {
        var template = actionContext.RequestContext.RouteData.Route.RouteTemplate;

        var request = System.Web.HttpContext.Current.GetRequestTelemetry();
        request.Name = template;
        request.Context.Operation.Name = request.Name;

        currentRequestTelemetry.Value = request;

        return base.OnActionExecutingAsync(actionContext, cancellationToken);
    }
}

Update: More complete version of this filter is posted by @snboisen at github.

This is an action filter for Web API. In the beggining of action execution the name can be taken from the route data.

Action filter wouldn’t work when execution didn’t reach the controller. So you may need to duplicate the logic in telemetry initializer Initialize method itself. However in this case you’d need to get the currently executing request and it may not always be available.

Manual Correlation in ASP.NET MVC Apps

| Comments

I already wrote that correlation is not working well in ASP.NET MVC applications. Here is how you can fix it manually.

Assuming you are using Microsoft.ApplicationInsights.Web nuget package - you will have access to the RequestTelemetry stored in HttpContext.Current. You can store it in AsyncLocal (for FW 4.5 you can use CallContext) so it will ba available for all telemetry - sync and async run inside the action.

This is an example implementation that uses the same class as Action Filter and Telemetry Initializer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
namespace ApmTips
{
    public class ApplicationInsightsCorrelationActionFilter : ActionFilterAttribute, ITelemetryInitializer
    {
        private static AsyncLocal<RequestTelemetry> currentRequestTelemetry = new AsyncLocal<RequestTelemetry>();

        public override void OnActionExecuting(ActionExecutingContext filterContext)
        {
            var request = HttpContext.Current.GetRequestTelemetry();
            currentRequestTelemetry.Value = request;

            base.OnActionExecuting(filterContext);
        }

        public override void OnActionExecuted(ActionExecutedContext filterContext)
        {
            currentRequestTelemetry.Value = null;

            base.OnActionExecuted(filterContext);
        }

        public override void OnResultExecuting(ResultExecutingContext filterContext)
        {
            var request = HttpContext.Current.GetRequestTelemetry();
            currentRequestTelemetry.Value = request;

            base.OnResultExecuting(filterContext);
        }

        public override void OnResultExecuted(ResultExecutedContext filterContext)
        {
            currentRequestTelemetry.Value = null;

            base.OnResultExecuted(filterContext);
        }

        public void Initialize(ITelemetry telemetry)
        {
            var request = currentRequestTelemetry.Value;

            if (request == null)
                return;

            if (string.IsNullOrEmpty(telemetry.Context.Operation.Id) && !string.IsNullOrEmpty(request.Context.Operation.Id))
            {
                telemetry.Context.Operation.Id = request.Context.Operation.Id;
            }

            if (string.IsNullOrEmpty(telemetry.Context.Operation.ParentId) && !string.IsNullOrEmpty(request.Id))
            {
                telemetry.Context.Operation.ParentId = request.Id;
            }

            if (string.IsNullOrEmpty(telemetry.Context.Operation.Name) && !string.IsNullOrEmpty(request.Name))
            {
                telemetry.Context.Operation.Name = request.Name;
            }

            if (string.IsNullOrEmpty(telemetry.Context.User.Id) && !string.IsNullOrEmpty(request.Context.User.Id))
            {
                telemetry.Context.User.Id = request.Context.User.Id;
            }

            if (string.IsNullOrEmpty(telemetry.Context.Session.Id) && !string.IsNullOrEmpty(request.Context.Session.Id))
            {
                telemetry.Context.Session.Id = request.Context.Session.Id;
            }
        }
    }
}

Here is how you’d register it in Global.asax.cs:

1
2
3
var filter = new ApplicationInsightsCorrelationActionFilter();
GlobalFilters.Filters.Add(filter);
TelemetryConfiguration.Active.TelemetryInitializers.Add(filter);

You can always use one of community-supported MVC monitoring NuGets which will be doing a similar things to enable this correlation.

Build Information in Different Environments

| Comments

I wrote before about automatic telemetry versioning you can implement for ASP.NET apps. With a single line change in the project file you can generate the BuildInfo.config file. This file contains the basic build information including build id.

Note, that when you build an application locally - BuildInfo.config will be generated under bin/ folder and will have AutoGen_<GUID> build id. With the new VSTS build infrastracture, the same AutoGen_ appears in production builds as well.

The reason is that VSTS build infrastructure defined a new build property names. Specifically, BuildUri was renamed to Build.BuildUri. Here is the list of all predefined variables in VSTS builds. So the fix for BuildInfo.config generation is easy:

1
2
<BuildUri Condition="$(BuildUri) == ''">$(Build_BuildUri)</BuildUri>
<GenerateBuildInfoConfigFile>true</GenerateBuildInfoConfigFile>

You can review the file C:\Program Files (x86)\MSBuild\Microsoft\VisualStudio\v14.0\BuildInfo\Microsoft.VisualStudio.ReleaseManagement.BuildInfo.targets for other properties that got broken. For instance, you may want to fix BuildLabel as well. Fix above will make BuildLabel to use BuildId:

1
2
<BuildLabel kind="label">vstfs:///Build/Build/3497900</BuildLabel>
<BuildId kind="id">vstfs:///Build/Build/3497900</BuildId>

instead of

1
2
build id: 3497900
build name: 20161214.1

You can use the same trick for Azure Web Apps. When you set continues integration for Azure Web App from github - Kudu will download sources and build them locally. Every deployment is identified by commit ID. So you can set buildId like I did it in this commit in Glimpse.ApplicationInsights:

1
2
<BuildId Condition="$(BuildId) == ''">$(SCM_COMMIT_ID)</BuildId>
<GenerateBuildInfoConfigFile>true</GenerateBuildInfoConfigFile>

Once implemented I can see the deployment id as an application version in Glimpse:

You can also filter by it in azure portal:

Using this deployment ID you can query deployment information using the link https://%WEBSITE_HOSTNAME%.scm.azurewebsites.net/api/deployments/<deployment id> to see something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "id": "7e5aeb37764b195a721d193be2b3ab8601276ef4",
  "status": 4,
  "status_text": "",
  "author_email": "SergKanz@microsoft.com",
  "author": "Sergey Kanzhelev",
  "deployer": "GitHub",
  "message": "commit ID\n",
  "progress": "",
  "received_time": "2016-12-14T21:59:50.8705503Z",
  "start_time": "2016-12-14T21:59:51.0919654Z",
  "end_time": "2016-12-14T22:05:29.940095Z",
  "last_success_end_time": "2016-12-14T22:05:29.940095Z",
  "complete": true,
  "active": true,
  "is_temp": false,
  "is_readonly": false,
  "url": "https://ai-glimpse-web-play-develop.scm.azurewebsites.net/api/deployments/7e5aeb37764b195a721d193be2b3ab8601276ef4",
  "log_url": "https://ai-glimpse-web-play-develop.scm.azurewebsites.net/api/deployments/7e5aeb37764b195a721d193be2b3ab8601276ef4/log",
  "site_name": "ai-glimpse-web-play"
}

As mentioned in this issue you may also override BuildId for other platforms. For AppVeyor it seems that this property will work: APPVEYOR_BUILD_VERSION.

Request Success and Response Code

| Comments

Application Inisghts monitors web application requests. This article explains the difference between two fields representing the request - success and responseCode.

There are many ways you use an application monitoring tool. You can use it for the daily status check, bugs triage or deep diagnostics. For the daily status check you want to know quickly whether anything unusual is going on. The commonly used chart is the number of failed requests. When this number is higher then yesterday - time comes for triage and deep diagnositcs. You want to know how exactly these requests failed.

For the web applications Application Inisghts defines request as failed when the response code is less the 400 or equal to 401. Quite straightforward. So why there are two fields being sent - responseCode and success. Wouldn’t it be easier to map the response code to success status on backend?

Response code 401 is marked as “successful” as it is part of a normal authentication handshake. Marking it as “failed” can cause an alert in the middle of a night when people on the different continent just came to work and login to the application. However this logic is oversimplified. You probably want to get notified when all these people who just came to work cannot login to the applicaiton because of some recent code change. Those 401 responses would be a legitimate “faiures”.

So you may want to override the default success = true value for 401 response code when the authentication has actually failed.

There are other cases when response code is not mapped directly to the request success.

Response code 404 may indicate “no records” which can be part of regular flow. It also may indicate a broken link. For the broken links you can even implement a logic that will mark broken links as failures only when those links are located on the same web page (by analyzing urlReferrer) or accessed from the company’s mobile application. Similarly 301 and 302 will indicate failure when accessed from the client that doesn’t support redirect.

Partially accepted content206 may indicate a failure of an overall request. For instance, Application Insights endpoint allows to send a batch of telemetry items as a single request. It will return 206 when some items sent in request were not processed successfully. Increasing rate of 206 indicates a problem that needs to be investigated. Similar logic applies to 207 Multi-Status where the success may be the worst of separate response codes.

You may want to set success = false for 200 responses representing an error page.

And definitely set success = true for the 418 I'm a teapot (RFC 2324) as request for cofffee should never fail.

Here is how you can set the success flag for the request telemetry.

Implement telemetry initializer

You can write a simple telemetry initializer that override the default behavior:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class SetFailedFor401 : ITelemetryInitializer
{
    public void Initialize(ITelemetry telemetry)
    {
        if (telemetry is RequestTelemetry)
        {
            var r = (RequestTelemetry)telemetry;

            if (r.ResponseCode == "401")
            {
                r.Success = false;
            }
        }
    }
}

You can make telemetry initializer configurable. This telemetry initializer will set success to true for the 404 requests from the external sites.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public class SetSuccesFor404FromExternalSite : ITelemetryInitializer
{
    public string ApplicationHost { get; set; }

    public void Initialize(ITelemetry telemetry)
    {
        if (telemetry is RequestTelemetry)
        {
            var r = (RequestTelemetry)telemetry;

            if (r.ResponseCode == "404" &&
                (HttpContext.Current.Request.UrlReferrer != null &&
                 !HttpContext.Current.Request.UrlReferrer.Host.Contains(this.ApplicationHost)
                )
               )
            {
                r.Success = true;
            }
        }
    }
}

You’d need to configure it like this:

1
2
3
<Add Type="SetSuccesFor404FromExternalSite, WebApplication1" >
    <ApplicationHost>apmtips.com</ApplicationHost>
</Add>

From Code

From anywhere in code you can set the succes status of the request. This value will not be overriden by standard request telemetry collection code.

1
2
3
4
if (returnEmptyCollection)
{
    HttpContext.Current.GetRequestTelemetry().Success = true;
}