runtime

Analyzing Failures with Build Analysis and Known Issues

Triaging errors seen in CI

Summary

Passing Build Analysis is required to merge into the runtime repo.

To resolve failures, do the following, in order:

  1. Fix the problem if your PR is the cause.
  2. For all failures not in the “Known test errors” section, try to file a Known Build Error issue.
  3. If all else fails, perform a manual bypass.

Details

In case of failure, any PR on the runtime will have a failed GitHub check - PR Build Analysis - which has a summary of all failures, including a list of matching known issues as well as any regressions introduced to the build or the tests. This tab should be your first stop for analyzing the PR failures.

Build analysis check

This check tries to bubble as much useful information about all failures for any given PR and the pipelines it runs. It tracks both build and test failures and provides quick links to the build/test legs, the logs, and other supplemental information that Azure DevOps may provide. The idea is to minimize the number of links to follow and tries to surface well known issues that have already been previously identified. It also adds a link to the Helix Artifacts tab of a failed test, as it often contains more detailed logs of the execution or a dump that’s been collected at fault time.

Validation may fail for several reasons, and for each one we have a different recommended action:

Option 1: You have a defect in your PR

Option 3: The state of the main branch HEAD is bad.

Additional information:

What to do if you determine the failure is unrelated

An issue that has not been reported before will look like this in the Build Analysis check tab:

failed test

You can use the console log, any potential attached dumps in the artifacts section, or any other piece of information printed to help you decide if it’s a regression caused by the change. Similarly, for runtime tests we will try to print the crashing stacks to aid in the investigation.

If you have considered all the diagnostic artifacts and determined the failure is definitely not caused by changes in your PR, please do this:

  1. Identify a string from the logs that uniquely identifies the issue at hand. A good example of this the string The system cannot open the device or file specified. : 'NuGet-Migrations' for issue https://github.com/dotnet/runtime/issues/80619.
  2. On the test failure in the tab you can select Report repository issue. This will prepopulate an issue with the appropriate tags and with a body similar to:
     Build Information
     Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=242380
     Build error leg or test failing: Build / linux-arm64 Release AllSubsets_Mono_Minijit_RuntimeTests minijit / Build Tests
     Pull request: https://github.com/dotnet/runtime/pull/84716
     <!-- Error message template  -->
     ## Error Message
     Fill the error message using [known issues guidance](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#how-to-fill-out-a-known-issue-error-section).
    
     ```json
     {
         "ErrorMessage": "",
         "BuildRetry": false,
         "ErrorPattern": "",
         "ExcludeConsoleLog": false
     }
     ```
    

    It already contains most of the essential information, but it is very important that you fill out the json blob.

    • You can now use the Build Analysis Known Issue Helper to create an issue. It assists in adding the right set of labels, fill the necessary paths in the json blob, and it will validate that it matches the text presented for the issue found in the logs.
    • You can add into the ErrorMessage field the string that you found uniquely identifies the issue. In case you need to use a regex, use the ErrorPattern field instead. This is a limited to a single-line, non-backtracking regex as described here. This regex also needs to be appropriately escaped. Check the arcade known issues documentation for a good guide on proper regex and JSON escaping.
    • The field ExcludeConsoleLog describes if the execution logs should be considered on top of the individual test results. For most cases, this should be set to true as the failure will happen within a single test. Setting it to false will mean all failures within an xUnit set of tests will also get attributed to this particular error, since there’s one log describing all the problems. Due to limitations in Known Issues around rate limiting and xUnit resiliency, setting ExcludeConsoleLog=false is necessary in two scenarios:
      • Nested tests as reported to Azure DevOps. Essentially this means theory failures, which look like this when reported in Azure DevOps: xUnit theory seen in azure devops. Adding support for this requires too many API calls, so using the console log here is necessary.
      • Native crashes in libraries also require using the console log. This is needed as the crash corrupts the test results to be reported to Azure DevOps, so only the console logs are left.
    • Optionally you can add specifics as needed like leg, configuration parameters, available dump links.

Once the issue is open, feel free to rerun the Build Analysis check and the issue should be recognized as known if all was filed correctly and you are ready to merge once all unrelated issues are marked as known. However, there are some known limitations to the system as previously described. Additionally, the system only looks at the error message the stacktrace fields of an Azure DevOps test result, and the console log in the helix queue.

The Build Analysis requests are sent to a queue. In certain scenarios, this queue can have many items to process and it can take a while for the status to be updated. If you do not see the status getting updated, be patient and wait at least 10 minutes before investigating further.

If rerunning the check doesn’t pick up the known issue and you feel it should, feel free to tag @dotnet/runtime-infrastructure to request infrastructure team for help.

After you do this, if the failure is occurring frequently as per the data captured in the recently opened issue, please disable the failing test(s) with the corresponding tracking issue link in a follow-up Pull Request.

There are plenty of intermittent failures that won’t manifest again on a retry. Therefore these steps should be followed for every iteration of the PR build, e.g. before retrying/rebuilding.

Bypassing build analysis

To unconditionally bypass the build analysis check (turn it green), you can add a comment to your PR with the following text:

/ba-g <reason>

The Build Analysis requests are sent to a queue. In certain scenarios, this queue can have many items to process and it can take a while for the status to be updated. If you do not see the status getting updated, be patient and wait at least 10 minutes before investigating further.

For more information, see https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/EscapeMechanismforBuildAnalysis.md

Examples of Build Analysis

Good usage examples

{
  "ErrorPattern": "The system cannot open the device or file specified. : (&#39;|')NuGet-Migrations(&#39;|')",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

This is a case where the issue is tied to the machine the workitem falls on. Everything would fail in that test group, so ExcludeConsoleLog isn’t harmful and the string is specific to the issue. The proper usage of this provides useful insight such as an accurate count of the impact of the issue without blocking other devs:

issue impact with data for investigation

Bad usage examples