Skip to content

[8.0] Update CI OSes #115502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: release/8.0-staging
Choose a base branch
from

Conversation

richlander
Copy link
Member

@richlander richlander commented May 13, 2025

@Copilot Copilot AI review requested due to automatic review settings May 13, 2025 00:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the CI pipeline configuration for Helix queues by modifying the OS image definitions used in various Linux job conditions.

  • Added new entries for AzureLinux.3.0.Amd64.Open
  • Adjusted OS image selections for different conditional branches
  • Reordered some entries, including reintroducing the Centos.9.Amd64.Open image in one branch
Comments suppressed due to low confidence (2)

eng/pipelines/libraries/helix-queues-setup.yml:62

  • The OS image tag in this line uses 'open' in lowercase, while other similar entries use 'Open'. Standardize the casing to ensure consistency.
- (AzureLinux.3.0.Amd64.Open)Ubuntu.2204.Amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:azurelinux-3.0-helix-amd64

eng/pipelines/libraries/helix-queues-setup.yml:71

  • The casing of 'open' in the OS image tag does not match the uppercase pattern seen in other entries. Consider using 'Open' to maintain consistency.
- (AzureLinux.3.0.Amd64.Open)Ubuntu.2204.Amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:azurelinux-3.0-helix-amd64

@richlander
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop-linux

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop-linux

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

These failures all look existing.

@carlossanlop

Copy link
Member

@mthalman mthalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richlander
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop-linux

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

richlander commented Jul 2, 2025

Linux-x64 extra platforms has this failure. It doesn't seem to be in the rolling build. I'll run it again to see if it is just flakiness, since the error is a timeout. Everything else looks like failures in others branches or that seem to be in multiple OSes.

Note: Azure Linux 3 is/was tested in extra platforms before and after this PR. However, it transitioned from being container-based to VM based (which we also did in main). Perhaps there is an issue there?

Libraries Test Run release coreclr linux x64 Release

Console log: 'System.Net.Sockets.Tests' from job d391c3db-4624-42f9-ad04-d66836807c35 workitem e364940e-0f90-4302-82c3-e866a22447ee (azurelinux.3.amd64.open.svc) executed on machine a0004S4 running Linux-6.6.92.2-1.azl3-x86_64-with-glibc2.38
+ ./RunTests.sh --runtime-path /datadisks/disk1/work/A9B908EB/p
----- start Tue Jul 1 09:09:24 PM UTC 2025 =============== To repro directly: =====================================================
pushd .
/datadisks/disk1/work/A9B908EB/p/dotnet exec --runtimeconfig System.Net.Sockets.Tests.runtimeconfig.json --depsfile System.Net.Sockets.Tests.deps.json xunit.console.dll System.Net.Sockets.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/datadisks/disk1/work/A9B908EB/w/9D8108B8/e /datadisks/disk1/work/A9B908EB/w/9D8108B8/e
  Discovering: System.Net.Sockets.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Sockets.Tests (found 1447 of 1812 test cases)
  Starting:    System.Net.Sockets.Tests (parallel test collections = on, max threads = 2)
    System.Net.Sockets.Tests.CreateSocket.Ctor_Raw_Supported_Success [SKIP]
      Condition(s) not met: "SupportsRawSockets"
    System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_AnyInterface_Succeeds [FAIL]
      System.TimeoutException : The operation has timed out.
      Stack Trace:
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/SocketOptionNameTest.cs(106,0): at System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_Helper(Int32 interfaceIndex)
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/SocketOptionNameTest.cs(72,0): at System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_AnyInterface_Succeeds()
        --- End of stack trace from previous location ---
    System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_IPv6_AnyInterface_Succeeds [FAIL]
      System.TimeoutException : The operation has timed out.
      Stack Trace:
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/SocketOptionNameTest.cs(213,0): at System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_IPv6_Helper(Int32 interfaceIndex)
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/SocketOptionNameTest.cs(136,0): at System.Net.Sockets.Tests.SocketOptionNameTest.MulticastInterface_Set_IPv6_AnyInterface_Succeeds()
        --- End of stack trace from previous location ---
  Finished:    System.Net.Sockets.Tests
=== TEST EXECUTION SUMMARY ===
   System.Net.Sockets.Tests  Total: 2302, Errors: 0, Failed: 2, Skipped: 1, Time: 63.735s

@dotnet/ncl

@richlander
Copy link
Member Author

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richlander
Copy link
Member Author

@wfurt
Copy link
Member

wfurt commented Jul 2, 2025

it may be OS configuration. Multicast is not that common. We may either investigate environment differences and/or make the test conditional.

@richlander
Copy link
Member Author

Except it seems to be passing on Azure Linux 3 already.

This is from yesterday's rolling run, including Azure Linux 3 (just in container not VM).

https://dev.azure.com/dnceng-public/public/_build/results?buildId=1082706&view=logs&j=52ad279b-b059-569b-870a-4d7b21a81589&t=2ee877e9-a459-5e35-8acf-a38e5009bee3

@wfurt
Copy link
Member

wfurt commented Jul 2, 2025

right. But there is single kernel where routing and configuration happen. It is different between docker and VM. e.g. AZLinux in docker does not use kernel from AZLinux and also the configuration is likely different. I don't think that would be difficult to fix but fundamentally they are two different environments.

@richlander
Copy link
Member Author

Got it. Yes of course. So, this is the first time these tests are seeing the AL3 kernel, which is indeed the goal. I just double checked ... all the containers are (prior to this change) using the Ubuntu 22.04 VM/kernel.

@richlander
Copy link
Member Author

I am not seeing this failure in main (which is running Azure Linux VMs) which suggests to me that there is something different about this branch, code or tests.

I made this change to double-validate: #117439

@richlander
Copy link
Member Author

Scratch that. I can see this in main as well if I change the tests being run a bit.

This suggests to me (A) that we should change main to more aggressively run Azure Linux (like is being done in this branch), and (B) this issue is blocking us from doing that. We should make the change in main first.

It seems like we're seeing a different in behavior for this test running in VM rather than container. It seems like the failure is in the former case and not the latter (even with the VM being the host for the container). That doesn't make obvious sense, so there must be something additional at play.

@wfurt
Copy link
Member

wfurt commented Jul 10, 2025

it looks like azurelinux.3 images have firewall rules to filter traffic.

helixbot@toweinfu-azl3 [ ~ ]$ sudo iptables -L
Chain INPUT (policy DROP)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:ssh
ACCEPT     icmp --  anywhere             anywhere             icmp time-exceeded
ACCEPT     icmp --  anywhere             anywhere             icmp destination-unreachable

When I disable them, all the networking tests pass

helixbot@toweinfu-azl3 [ /mnt/runtime/src/libraries/System.Net.Sockets/tests/FunctionalTests ]$  /mnt/runtime/dotnet.sh build /t:test

  Determining projects to restore...
  All projects are up-to-date for restore.
  TestUtilities -> /mnt/runtime/artifacts/bin/TestUtilities/Debug/net8.0/TestUtilities.dll
  StreamConformanceTests -> /mnt/runtime/artifacts/bin/StreamConformanceTests/Debug/net10.0/StreamConformanceTests.dll
  System.Net.Sockets.Tests -> /mnt/runtime/artifacts/bin/System.Net.Sockets.Tests/Debug/net10.0-unix/System.Net.Sockets.Tests.dll
  ========================= Begin custom configuration settings ==============================
  export __IsXUnitLogCheckerSupported=1
  export XUNIT_HIDE_PASSING_OUTPUT_DIAGNOSTICS=1
  ========================== End custom configuration settings ===============================
  ----- start Thu Jul 10 23:50:24 UTC 2025 =============== To repro directly: =====================================================
  pushd /mnt/runtime/artifacts/bin/System.Net.Sockets.Tests/Debug/net10.0-unix
  /mnt/runtime/artifacts/bin/testhost/net10.0-linux-Debug-x64/dotnet exec --runtimeconfig System.Net.Sockets.Tests.runtimeconfig.json --depsfile System.Net.Sockets.Tests.deps.json /home/helixbot/.nuget/packages/microsoft.dotnet.xunitconsolerunner/2.9.2-beta.25359.101/build/../tools/net/xunit.console.dll System.Net.Sockets.Tests.dll -xml testResults.xml -nologo -notrait category=OuterLoop -notrait category=failing
  popd
  ===========================================================================================================
  /mnt/runtime/artifacts/bin/System.Net.Sockets.Tests/Debug/net10.0-unix /mnt/runtime/src/libraries/System.Net.Sockets/tests/FunctionalTests
    Discovering: System.Net.Sockets.Tests (method display = ClassAndMethod, method display options = None)
    Discovered:  System.Net.Sockets.Tests (found 1511 of 1875 test cases)
    Starting:    System.Net.Sockets.Tests (parallel test collections = on [2 threads], stop on fail = off)
      System.Net.Sockets.Tests.CreateSocket.Ctor_Raw_Supported_Success [SKIP]
        Condition(s) not met: "SupportsRawSockets"
      System.Net.Sockets.Tests.ConnectEap.Connect_ExposeHandle_FirstAttemptSucceeds(connectMode: "multi") [SKIP]
        EAP does not support IPAddress[] connect
      System.Net.Sockets.Tests.ConnectEap.MultiConnect_LingerState_Preserved(dnsConnect: False) [SKIP]
        EAP does not support IPAddress[] connect
      System.Net.Sockets.Tests.ConnectEap.MultiConnect_KeepAliveOptionsPreserved(dnsConnect: False) [SKIP]
        EAP does not support IPAddress[] connect
      System.Net.Sockets.Tests.ConnectEap.MultiConnect_ExposeHandle_TerminatesAtFirstFailure(dnsConnect: False) [SKIP]
        EAP does not support IPAddress[] connect
      System.Net.Sockets.Tests.ConnectEap.MultiConnect_MiscProperties_Preserved(dnsConnect: False) [SKIP]
        EAP does not support IPAddress[] connect
      System.Net.Sockets.Tests.ConnectEap.MultiConnect_DualMode_Preserved [SKIP]
        EAP does not support IPAddress[] connect
    Finished:    System.Net.Sockets.Tests
  === TEST EXECUTION SUMMARY ===
     System.Net.Sockets.Tests  Total: 2416, Errors: 0, Failed: 0, Skipped: 7, Time: 28.312s

@wfurt
Copy link
Member

wfurt commented Jul 11, 2025

we can try to detect that but it may be tricky. It seems like disabling the firewall may be the best option. It would probably take till next week to get the updates out. Any thought on thins @dotnet/ncl ?

@richlander
Copy link
Member Author

richlander commented Jul 11, 2025

Thanks for investigating!

We want our default test environment to be as close as possible to the default environment used by customers.

Per: #115415 (comment)

We cannot change the IP tables settings. We need to make some change to the tests. We can set a ENV in the Azure Linux helix images to signal a generic test configuration (not IsAzureLinux).

When we make the change, let's make it in main first. We can do that in the branch I already created (PR above).

@wfurt
Copy link
Member

wfurt commented Jul 11, 2025

I'm not sure that comment is applicable. We can disable the tests. But I see no benefit for customers. I personally see more valuable making sure multicast works when system allows it. The other way is not interesting IMHO e.g. - test it breaks when blocked by firewall.

maybe @karelz and @jkotas can comment

@richlander
Copy link
Member Author

I think it applies the same. It is possible if we disable the firewall that some tests (at some point) will "false positive pass" in that configuration. It plays both ways.

My take is that we test in the default configuration. If we get sufficient signal, we can fund a second Azure Linux configuration.

This feature is still being tested, generally. We can accept that we are not getting coverage for this feature on Azure Linux (in absence of more user signal).

@jkotas
Copy link
Member

jkotas commented Jul 13, 2025

maybe @karelz and @jkotas can comment

We want our tests to be passing on the default OS configuration (for supported OSes at least).

If the test is not compatible with given OS default configuration, it should be disabled. It can be disabled by either detecting the incompatible configuration (preferred - example of prior art PlatformDetection.SupportsSha3) or by distro specific check (e.g. PlatformDetection.IsAzureLinux).

There are number of options for non-default configurations with different cost/benefits tradeoffs. In this case, I think it is fine to depend on indirect coverage via other distros.

@richlander
Copy link
Member Author

@wfurt @dotnet/ncl

Ping. Can we make resolving this issue, in main and then in two release branches a priority this week?

@wfurt
Copy link
Member

wfurt commented Jul 16, 2025

BTW there are other modification in the Helix code - like increasing # of file descriptors. Should we also change that back to OS default?

@richlander
Copy link
Member Author

Thanks for asking. No. Slightly hypocritical, but we need that for our tests to run at all.

Context: dotnet/dnceng#5728 (comment)

I am hoping that AL4 matches the other distros.

@ManickaP
Copy link
Member

Ping. Can we make resolving this issue, in main and then in two release branches a priority this week?

@richlander #117694 is merged in main. Lmk if you want this in 9.0 and 8.0 and I'll put up a backports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants