Description
Summary
Much like the issue 2416, there seems to be an issue with the Windows_Server-2022-English-Full-ECS_Optimized AMIs, where the ECS-Agent is sometimes having issues connecting to the ECS Cluster due to some virtual hardware issues (the VMNetwork cannot be found). Like the other issue, this, too, seems random but will happen sporadically on our windows image.
Description
Using packer we create our own AMIs based on the Windows_Server-2022-English-Full-ECS_Optimized AMIs. On the AMI we install ssh, then pull our windows docker images, and finally terminate it by installing EC2Launchv2. Once the AMI is ready we use it on our ECS cluster with the user data :
# configure ecs cluster
[Environment]::SetEnvironmentVariable("ECS_CLUSTER", "cluster-x86_64-windows","Machine")
[Environment]::SetEnvironmentVariable("ECS_IMAGE_PULL_BEHAVIOR","prefer-cached","Machine")
[Environment]::SetEnvironmentVariable("ECS_AWSVPC_BLOCK_IMDS","true ","Machine")
[Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE","true","Machine")
# init ecs agent
Import-Module ECSTools
Initialize-ECSAgent -EnableTaskIAMRole -EnableTaskENI -LoggingDrivers "['json-file','awslogs']"
Periodically one of the instances in the ASG fails to get attached to the ECS Cluster with the following errors:
2024-11-25T10:07:52Z - [INFO]:ScheduledTask Initialize-ECSHostReboot created.
2024-11-25T10:07:52Z - [INFO]:Configuring ECS Host for Task IAM Roles...
2024-11-25T10:07:52Z - [INFO]:Server Edition: Microsoft Windows Server 2022 Datacenter
2024-11-25T10:07:55Z - [INFO]:Attempt#: 10, Adapters:
2024-11-25T10:07:55Z - [INFO]:VMNetwork adapter 'vEthernet (nat)*' not found
2024-11-25T10:07:55Z - [INFO]:Retrying after sleeping 1sec
This error makes the instance unusable to the cluster, so the ASG launches a new one while the old one is left dangling unused.
Expected Behavior
The ECS-Agent reliably connects to the ECS cluster without errors.
Observed Behavior
The ECS-Agent will sometimes fail, and the instance will not be attached to the ECS cluster and will just continue running. Rebooting the instance fixes the issues and the agent no longer produces the error.
Before the reboot we get:
PS C:\Windows\system32> Get-NetAdapter
Name InterfaceDescription ifIndex Status MacAddress LinkSpeed
---- -------------------- ------- ------ ---------- ---------
Ethernet 4 Amazon Elastic Network Adapter #2 8 Up 06-D5-5D-A5-67-E1 5.0 Gbps
After reboot when it starts to work :
PS C:\Windows\system32> Get-NetAdapter
Name InterfaceDescription ifIndex Status MacAddress LinkSpeed
---- -------------------- ------- ------ ---------- ---------
Ethernet 4 Amazon Elastic Network Adapter #2 6 Up 06-4C-57-A5-89-89 5.0 Gbps
vEthernet (nat) Hyper-V Virtual Ethernet Adapter 12 Up 00-15-5D-03-19-C0 10 Gbps
Environment Details:
PS C:\Windows\system32> docker info
Client:
Version: 25.0.6.m
Context: default
Debug Mode: false
Server:
ERROR: error during connect: in the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect: Get "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.44/info": open //./pi
pe/docker_engine: The system cannot find the file specified.
errors pretty printing info
PS C:\Windows\system32> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing
StatusCode : 200
StatusDescription : OK
Content : {"Cluster":"x86_64-windows-2022","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:123456789011:container-instance/cluster-x86_64-windows-2022/a4c4329a0392450
ba9e659b6b...
RawContent : HTTP/1.1 200 OK
Content-Length: 259
Content-Type: application/json
Date: Mon, 02 Dec 2024 10:02:18 GMT
{"Cluster":"x86_64-windows-2022","ContainerInstanceArn":"arn:aws:ecs:...
Forms :
Headers : {[Content-Length, 259], [Content-Type, application/json], [Date, Mon, 02 Dec 2024 10:02:18 GMT]}
Images : {}
InputFields : {}
Links : {}
ParsedHtml :
RawContentLength : 259