Giter Site home page Giter Site logo

microsoft / service-fabric-poa Goto Github PK

View Code? Open in Web Editor NEW
39.0 5.0 25.0 1.52 MB

Patch Orchestration Application (POA) is an Azure Service Fabric application that automates operating system patching on a Service Fabric cluster without downtime.

License: MIT License

PowerShell 4.55% C# 94.66% Batchfile 0.79%
servicefabric patching windowsupdate

service-fabric-poa's Introduction

services platforms author
service-fabric
.NET, windows
raunakpandya, brkhande

Service-Fabric-POA

Build Status

Patch Orchestration Application(POA) is an Azure Service Fabric application that automates operating system patching on a Service Fabric cluster without downtime. This repo only contains code for orchestrating Windows operating system updates.

The patch orchestration app provides the following features:

  • Automatic operating system update installation. Operating system updates are automatically downloaded and installed. Cluster nodes are rebooted as needed without cluster downtime.

  • Cluster-aware patching and health integration. While applying updates, the patch orchestration app monitors the health of the cluster nodes. Cluster nodes are upgraded one node at a time or one upgrade domain at a time. If the health of the cluster goes down due to the patching process, patching is stopped to prevent aggravating the problem.

Internal details of the app

The patch orchestration app is composed of the following subcomponents:

  • Coordinator Service: This stateful service is responsible for:
    • Coordinating the Windows Update job on the entire cluster.
    • Storing the result of completed Windows Update operations.
  • Node Agent Service: This stateless service runs on all Service Fabric cluster nodes. The service is responsible for:
    • Bootstrapping the Node Agent NTService.
    • Monitoring the Node Agent NTService.
  • Node Agent NTService: This Windows NT service runs at a higher-level privilege (SYSTEM). In contrast, the Node Agent Service and the Coordinator Service run at a lower-level privilege (NETWORK SERVICE). The service is responsible for performing the following Windows Update jobs on all the cluster nodes:
    • Disabling automatic Windows Update on the node.
    • Downloading and installing Windows Update according to the policy the user has provided.
    • Restarting the machine post Windows Update installation.
    • Uploading the results of Windows updates to the Coordinator Service.
    • Reporting health reports in case an operation has failed after exhausting all retries.

For more details, please visit this link

Developer Help & Documentation

Build Application

To build the application you need to first setup the machine for Service Fabric application development.

Setup your development environment with Visual Studio 2017.

Once setup, make a clone of this repo. Then, open PowerShell command prompt and move to the root of this repo and run build.ps1 script.

PS E:\SF-POS> .\build.ps1

It should produce an output like below.

Source root is E:\SF-POS\src\PatchOrchestrationApplication\PatchOrchestrationApplication
Restoring NuGet package Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7.
Adding package 'Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7' to folder 'E:\SF-POS\packages'
Added package 'Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7' to folder 'E:\SF-POS\packages'

NuGet Config files used:
    C:\Users\brkhande\AppData\Roaming\NuGet\NuGet.Config

Feeds used:
    C:\Users\brkhande\AppData\Local\NuGet\Cache
    C:\Users\brkhande\.nuget\packages\
    https://api.nuget.org/v3/index.json

Installed:
    1 package(s) to packages.config projects
Changing the working directory to E:\SF-POS\src\PatchOrchestrationApplication\PatchOrchestrationApplication
Using msbuild from C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\15.0\Bin\MSBuild.exe
  Restoring packages for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\NodeAgentSFUtility.csproj...
  Restoring packages for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\NodeAgentService.csproj...
  Restoring packages for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\NodeAgentNTService.csproj...
  Restoring packages for E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.csproj...
  Restoring packages for E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\TelemetryLib.csproj...
  Installing Newtonsoft.Json 10.0.2.
  Installing Microsoft.VisualStudio.Azure.Fabric.MSBuild 1.6.7.
  Installing Microsoft.ServiceFabric 5.4.145.
  Installing Microsoft.ApplicationInsights 2.2.0.
  Installing Microsoft.ServiceFabric.Data 2.4.145.
  Installing Microsoft.ServiceFabric.Services 2.4.145.
  Installing Microsoft.AspNet.WebApi.Client 5.2.3.
  Installing Unity 4.0.1.
  Installing Microsoft.AspNet.WebApi.Core 5.2.3.
  Installing Unity.WebAPI 5.2.3.
  Installing Owin 1.0.
  Installing Microsoft.Owin.Hosting 2.0.2.
  Installing Microsoft.Owin 2.0.2.
  Installing CommonServiceLocator 1.3.
  Installing Microsoft.AspNet.WebApi.Owin 5.2.3.
  Installing Microsoft.Owin.Host.HttpListener 2.0.2.
  Installing Newtonsoft.Json 6.0.4.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\obj\NodeAgentService.csproj.nuget.g.props.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\obj\NodeAgentSFUtility.csproj.nuget.g.props.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\obj\NodeAgentNTService.csproj.nuget.g.props.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\obj\CoordinatorService.csproj.nuget.g.props.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\obj\TelemetryLib.csproj.nuget.g.props.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\obj\NodeAgentService.csproj.nuget.g.targets.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\obj\NodeAgentSFUtility.csproj.nuget.g.targets.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\obj\CoordinatorService.csproj.nuget.g.targets.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\obj\TelemetryLib.csproj.nuget.g.targets.
  Generating MSBuild file E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\obj\NodeAgentNTService.csproj.nuget.g.targets.
  Restore completed in 3.77 sec for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\NodeAgentService.csproj.
  Restore completed in 3.77 sec for E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.csproj.
  Restore completed in 3.77 sec for E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\TelemetryLib.csproj.
  Restore completed in 3.77 sec for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\NodeAgentNTService.csproj.
  Restore completed in 3.77 sec for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\NodeAgentSFUtility.csproj.
  Restore completed in 6.53 ms for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\NodeAgentSFUtility.csproj.
  Restore completed in 6.57 ms for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\NodeAgentService.csproj.
  Restore completed in 6.12 ms for E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\TelemetryLib.csproj.
  Restore completed in 6.56 ms for E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.csproj.
  Restore completed in 6.59 ms for E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\NodeAgentNTService.csproj.
  TelemetryLib -> E:\SF-POS\src\PatchOrchestrationApplication\TelemetryLib\src\bin\Release\TelemetryLib.dll
  CoordinatorService -> E:\SF-POS\src\PatchOrchestrationApplication\CoordinatorService\src\bin\Release\CoordinatorService.exe
  NodeAgentNTService -> E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentNTService\src\bin\Release\NodeAgentNTService.exe
  NodeAgentSFUtilityDll -> E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\bin\Release\CommandProcessor.dll
  NodeAgentSFUtility -> E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\bin\Release\NodeAgentSFUtility.exe
  NodeAgentService -> E:\SF-POS\src\PatchOrchestrationApplication\NodeAgentService\src\bin\Release\NodeAgentService.exe
  PatchOrchestrationApplication -> E:\SF-POS\src\PatchOrchestrationApplication\PatchOrchestrationApplication\pkg\Release

By default the script will create a release package of the application in out\Release folder.

Deploy Application

Choose one of the following methods for deployment. Service Fabric best practice is to use arm templates for application deployments. Example template and argument files are located in /arm directory.

Click button below to deploy using azure arm app template

Deploy to Azure

Using powershell

  • Open PowerShell command prompt and go to the root of the repository.

  • Connect to the Service Fabric Cluster where you want to deploy the application using Connect-ServiceFabricCluster PowerShell command.

  • Deploy the application using the following PowerShell command.

    . out\Release\Deploy.ps1 -ApplicationPackagePath 'out\Release\PatchOrchestrationApplication' -ApplicationParameter @{ }
  • Deploy the application using the following PowerShell command, in case you want to change the application parameters default values. You can do that as shown below:

    . out\Release\Deploy.ps1 -ApplicationPackagePath 'out\Release\PatchOrchestrationApplication'  -ApplicationParameter @{ 'WURescheduleCount'='10'; 'WUFrequency'= 'Weekly, Tuesday, 12:22:32'; }

Note

The above deployment procedure should only be used in case one wants to test changes made to this application. For production/test environment, one should always use the officially released version of the application. Application along with installation scripts can be downloaded from release section of this repo. Deployment steps for this application can be found here

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Ideas and Improvements

We encourage community feedback and contribution to improve this application. To make any contribution, our contribution guidelines needs to be followed. If you have any new idea, please file an issue for that.

Contribution Guidelines:

Please create a branch and push your changes to that and then, create a pull request for that change. These is the check list that would be required to complete, for pushing your change to master branch.

  1. Create Service Fabric cluster on Azure. You can find the steps for creating Service Fabric cluster on Azure here
  2. Build the application with your change.
  3. Deploy the application and validate that the application is healthy and working as expected.
  4. Resolve all the comments from owners.

service-fabric-poa's People

Contributors

anantshankar17 avatar dependabot[bot] avatar jagilber avatar khandelwalbrijesh avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar raunakpandya avatar yosal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

service-fabric-poa's Issues

POA reporting invalid DateTime setting after upgrade to 1.4.1

After upgrading to 1.4.1 the POA keeps reporting this error. I upgraded POA to 1.4.1 on our beta cluster and this has not had a problem. I tried rolling back to 1.3.2 and the error remains now. I've unprovisioned the type and tried a clean deploy with the same warning. The only setting overridden was the WUFrequency to run "Daily,14:00:00".

Warning
'Patch Orchestration Node Agent Service' reported Warning for property 'WUOperationSetting-_sfsandbox_0'.
Exception while starting timer. System.FormatException: String was not recognized as a valid DateTime.
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Manager.TimerManager.ReadCheckpointFile() in D:\a\1\s\src\PatchOrchestrationApplication\NodeAgentNTService\src\Manager\TimerManager.cs:line 453
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Manager.TimerManager.LoadSettings() in D:\a\1\s\src\PatchOrchestrationApplication\NodeAgentNTService\src\Manager\TimerManager.cs:line 376
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Manager.TimerManager.StartTimer() in D:\a\1\s\src\PatchOrchestrationApplication\NodeAgentNTService\src\Manager\TimerManager.cs:line 64

WUFrequency - Week, Day, HH:MM:SS not working (v1.4.9)

Hi,

Recently I have installed PoA v1.4.9 and configured "WUFrequency - for the week of every month" as per document

<Parameter Name="WUFrequency" DefaultValue="3,Sunday,5:00:00" />

Based on the above parameter, PoA should be scheduled for 3rd Sunday every month, instead of that PoA scheduled for weekly Sunday.

TimeManager.cs uses non-FIPS approved MD5 algorithm.

The MD5 algorithm used in TimeManager.cs is not FIPS-approved and will throw cryptograpic exceptions on FIPS mode enabled clusters (see below).

I've seen the Azure Storage SDK use a native MD5 implementation to help with this issue:

https://github.com/Azure/azure-storage-net/blob/4bd9460e0b9b007bdb34eaa1329de2257f7a34dc/Lib/ClassLibraryCommon/Core/Util/NativeMD5.cs

Patch Orchestration support for specifying day-of-week for monthly frequency

We use POA to update on a monthly frequency (WUFrequency). Ideally, we would like to start patching Friday night on the week of Patch Tuesday and let the patching continue over the weekend.

Currently, POA only supports specifying the day (1-28) of the month for patching. Add support to specify the day of week on a monthly frequency e.g. 2nd week / Friday 10 pm or start auto-patching on weekend after Patch Tuesday?

POA is throwing warning that the repair task with id “POS__nodeId_9f62fad1-9f3b-485b-bd63-f8a1e19742b8” is not completed after timeout.

POA got stuck as the node with name "nodeId" was unhealthy for a long time and the got deleted from the cluster as a resolution for some other independent issue. So, Node with name "nodeId" does not exist on the cluster any more.
But Its repair task remained in executing state with sub state in restartRequested. When this happens POA is not able to garbage collect the repair task and it gets stuck.
In this case, POA will keep throwing warning on Cluster saying that the repair task is not completed within timeout. But node does not exist in the cluster so, repair task will never get completed.

Exception in NodeAgentNTService.exe: The process was terminated due to an unhandled exception.

Dear,

We are getting all the time the following exception. We just upgraded to version 1.5.0 and the problem remains. The service fabric cluster is showing no issues, but this windows service has.

What could be the reason for this, and how can it be solved?

Thank you for any help.

Application: NodeAgentNTService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Exception
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Utility.NodeAgentSfUtility.ReportHealth(System.String, System.String, Microsoft.ServiceFabric.PatchOrchestration.Common.HealthState, Int64, System.TimeSpan)
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Manager.TimerManager.StartTimer()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()

CoordinatorService.exe consumes all memory

We hade an incident, ultimately tracked down to being an issue with the certificate trust chain for the cluster. (An intermediate certificate had been replaced such that the configured issuer thumbprint no longer matched). This caused communication between the service fabric resource provider and the cluster, and between internal clients and the cluster to fail.

During the incident we noticed that CoordinatorService.exe process started to grow its memory consumption uncontrollably and in roughly 20 hours it had managed to allocate 11GB of RAM. I'm not sure if the patching itself was the trigger of the incident, or if it was only the first to get hit by it, but the ram consumption as such was disrupting other services on the machines.

At this time we hade version 1.4.5 running in the cluster.

Reboot loop on failed KB5034439. Possible to skip?

Our VMSS cluster gets stuck in a reboot loop due to the failure of problematic update KB5034439. Is it possible to skip individual updates, or even prevent the nodes from rebooting unless required by the update?

I have run "wushowhide" and configured the nodes to hide that update, but POA ignores the setting and tries to install it anyway. I can manually run Windows Updates and it does not try to install KB5034439.

Patching issue with 1.4.1

Hi,
If Azure Fabric Cluster is unhealthy due to some application being in error state cluster halt patching, but when error is fixed cluster does patching immediately. Is it possible to define maintenance window for example 0100 - 0400 weekly or daily? And if out of that window patching will not be conducted until next window. This is for monthly windows updates, not the cluster image swap.
Thanks!

POA doesnt fetch updates for other Microsoft products

Windows update has an option to enable WU to "Give me update for other Microsoft products when I update Windows". POA doesn't enable that setting or pull those updates by default. If one enables that setting manually today, POA would start pulling updates fine hence the issue is just that if POA can enable that setting itself based on some flag in settings.xml.

Some options to enable this programmatically -

Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU

AllowMUUpdateService REG_DWORD 1

or

Script RegisterMicrosoftUpdateForAutoUpdates
{
GetScript = {
$isMuRegisteredForAU = $false
foreach ($updateService in (new-object -ComObject "microsoft.update.servicemanager").Services) {if ($updateService.Name -eq "Microsoft Update"){ $isMuRegisteredForAU=$updateService.IsRegisteredWithAU }}
Return @{'Result' = "Microsoft Update registration status with Auto Updates: $isMuRegisteredForAU"}
}
TestScript = {
$isMuRegisteredForAU = $false
foreach ($updateService in (new-object -ComObject "microsoft.update.servicemanager").Services) {if ($updateService.Name -eq "Microsoft Update"){ $isMuRegisteredForAU=$updateService.IsRegisteredWithAU }}
Return $isMuRegisteredForAU
}
SetScript = {(new-object -ComObject "microsoft.update.servicemanager").addservice2("7971f918-a847-4430-9279-4a52d1efe18d",7,"")}
}

Timeout repair task tries to complete the Repair task and this leads to crash in PatchOrchestrationApplication

TimeoutRepairTask in coordinator Service tries to complete the repair task from Executing state to Completed state. This operation is not supported by repair manager and this leads to crash in Coordinator Service
Application: CoordinatorService.exe
Framework Version: v4.0.30319
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.InvalidOperationException: The state transition is not allowed: (Executing,Completed) ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BFD
at System.Fabric.Interop.NativeClient.IFabricRepairManagementClient2.EndUpdateRepairExecutionState(IFabricAsyncOperationContext context)
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.RepairManagerHelper.d__36.MoveNext() in D:\a\1\s\src\PatchOrchestrationApplication\CoordinatorService\src\RepairManagerHelper.cs:line 709
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.RepairManagerHelper.d__35.MoveNext() in D:\a\1\s\src\PatchOrchestrationApplication\CoordinatorService\src\RepairManagerHelper.cs:line 635
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.CoordinatorService.d__9.MoveNext() in D:\a\1\s\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.cs:line 109
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.CoordinatorService.d__8.MoveNext() in D:\a\1\s\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.cs:line 89
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__23.MoveNext()
Stack:
at System.Environment.FailFast(System.String)
at System.Threading.Tasks.Task.Execute()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
at System.Threading.Tasks.Task.ExecuteEntry(Boolean)
at System.Threading.ThreadPoolWorkQueue.Dispatch()

POA should check for chained updates by looking for updates immediately after completing installing one

WU releases some updates which are chained. Which means if update A isnt installed update B wont be visible. Immediately after installing udpate A, update B is visible and available. This would mean when POA runs, update A would get installed in first update cycle and update B in the next one. For some reason if the cluster was unhealthy for sometime, it would be behind on such updates for some time. Having the capability to check for new updates immediately post an update installation would probably be a good idea.

HMACSHA1 broke verification of settings

The latest update broke file comparison. Here's a sample code to test, and as you can see, comparing the same file returns false with the new hash

using System;
using System.IO;
using System.Security.Cryptography;
using System.Text;

namespace ThePlaceToRunABitOfCode
{
    class Program
    {
        /// <summary>
        /// Outputs
        /// f1 vs f2 Using HMACSHA1: False
        /// f1 vs f1 Using HMACSHA1: False
        /// f2 vs f2 Using HMACSHA1: False
        /// 
        /// f1 vs f2 Using MD5: False
        /// f1 vs f1 Using MD5: True
        /// f2 vs f2 Using MD5: True
        /// 
        /// </summary>
        static void Main(string[] args)
        {
            const string f1 = @"C:\temp\f1.xml";
            const string f2 = @"C:\temp\f2.xml";

            var random = new Random(42);
            var randomBytes = new byte[1024];
            random.NextBytes(randomBytes);

            File.WriteAllText(f2,Encoding.Default.GetString(randomBytes));
            random.NextBytes(randomBytes);
            File.WriteAllText(f1,Encoding.Default.GetString(randomBytes));

            Console.WriteLine($"f1 vs f2 Using HMACSHA1: {AreFilesEqual(f1,f2)}");
            Console.WriteLine($"f1 vs f1 Using HMACSHA1: {AreFilesEqual(f1, f1)}");
            Console.WriteLine($"f2 vs f2 Using HMACSHA1: {AreFilesEqual(f2, f2)}");
            Console.WriteLine();
            Console.WriteLine($"f1 vs f2 Using MD5: {AreFilesEqual(f1, f2, true)}");
            Console.WriteLine($"f1 vs f1 Using MD5: {AreFilesEqual(f1, f1, true)}");
            Console.WriteLine($"f2 vs f2 Using MD5: {AreFilesEqual(f2, f2, true)}");
        }

        private static bool AreFilesEqual(string f1, string f2, bool useMd5 = false)
        {
            string f1Hash; 
            string f2Hash;
            if (useMd5)
            {
                f1Hash = GetMd5(f1);
                f2Hash = GetMd5(f2);
            }
            else
            {
                f1Hash = GetHMACSHA1(f1);
                f2Hash = GetHMACSHA1(f2);
            }
            return f1Hash.Equals(f2Hash);
        }

        private static string GetHMACSHA1(string filename)
        {
            using (var hmac = HMACSHA1.Create())
            {
                using (var stream = File.OpenRead(filename))
                {
                    return Encoding.Default.GetString(hmac.ComputeHash(stream));
                }
            }
        }
        private static string GetMd5(string filename)
        {
            using (var hmac = MD5.Create())
            {
                using (var stream = File.OpenRead(filename))
                {
                    return Encoding.Default.GetString(hmac.ComputeHash(stream));
                }
            }
        }

    }
}

Support for POA in Ephemeral OS Disks

Currently POA is not supported in Azure on VMs with Ephemeral OS disks, since POA has dependency on the data on C Drive. We need to get around this in order to support POA for Ephemeral OS disks

Environment considerations during windows update

Hi there,

We are starting to look at POA for our env and wondering how other people cope with making changes to load balancers especially to prevent any hiccups there.

Prior to a reboot of a node it would make send to have a hook to inform the load balancer that the node is going out of load and then add it back in when the service fabric services have restarted.

Also how do people cope installing other updates ( dot net core X runtime) that may or may not require a reboot of the node ?

Can you share any thoughts ?

Dave

NodeAgentService instance stops during uncertain time

Hello!

I have a standalone cluster with some microservices and have the POA installed so that it makes my life easier and does the Windows Updates monthly for me.
What I figure out was that for some reason, some instances of NodeAgentService just stop for an indefinite time, and this month I noticed that only half of the Repair Jobs were created.

Does someone have any idea of what happened?

image

NodeAgentNTService crashes while reporting health.

Call Stack:
NodeAgentNTService!Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Utility.NodeAgentSfUtility.ReportHealth+0x31e [E:\bt\891126\repo\src\PatchOrchestrationApplication\NodeAgentNTService\src\Utility\NodeAgentSfUtility.cs @ 194]
NodeAgentNTService!Microsoft.ServiceFabric.PatchOrchestration.NodeAgentNTService.Manager.TimerManager.StartTimer+0x10a [E:\bt\891126\repo\src\PatchOrchestrationApplication\NodeAgentNTService\src\Manager\TimerManager.cs @ 74]
mscorlib_ni!System.Threading.ExecutionContext.RunInternal+0x163 [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
mscorlib_ni!System.Threading.ExecutionContext.Run+0x14 [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
mscorlib_ni!System.Threading.ExecutionContext.Run+0x52 [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 891]
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart+0x52 [f:\dd\ndp\clr\src\BCL\system\threading\thread.cs @ 111]

Add ability to space out the update timing for services with long startup time.

When POA starts the update sequence for the nodes, it don’t take into consideration the time it takes to the
service to become “ready to serve requests”. For long updates which take a long time, it might be ok.
But for very quick updates which requires a restart, this might cause the entire cluster to become nonfictional.

Right after the restart of a node, and after completing the updates and without waiting for the service to be available and ready to serve, the next node (which might contain the same service) will start the update and restart.
This means that lease services will be up and ready to serve requests and this number will keep growing as long as the time to update and restart is smaller than the time the service needs to load it’s data into cache and becoming ready to serve.
This will cause a big lose in compute power and for services with very high RPS rate this is a big problem.

Adding a setting for minimum time between node updates will solve this issue.
This will allow us to configure the POA with constant timespan which POA will wait between the last Completed Repair task
and the next Repair task that needs to be approved.

In case it is acceptable, I can add this functionality and create a PR.

POA should not trigger downloading of the update all at the same time

It seems like POA coordinates the installation of the patches using update domain but it download the updates at the same time across all nodes. In theory, this process should be (?) light weight but this last patch caused CPU to spike up on all of the nodes in every region.

image

Based on POA logs, the initial spike that happened on all nodes corresponds to the time POA started to download the update. After that, you can see CPU spikes in phases probably due to POA actually installing the patches update domain by domain.

I don't know why the CPU spiked during the downloading phase (maybe the system was scanning or updates or malware?). Regardless of the cause, this could bring the entire system down if something bad happens say consumed all of the CPUs.

To avoid this from happening, should download of the patches also happen in phases?

Retries should be added in POA FabricClient Calls.

RunAsync failed due to an unhandled FabricException causing replica to fault: System.Fabric.FabricTransientException: Operation canceled. ---> System.Runtime.InteropServices.COMException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))
at System.Fabric.Interop.NativeClient.IFabricRepairManagementClient2.EndGetRepairTaskList(IFabricAsyncOperationContext context)
at System.Fabric.FabricClient.RepairManagementClient.GetRepairTaskListAsyncEndWrapper(IFabricAsyncOperationContext context)
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.RepairManagerHelper.d__21.MoveNext() in E:\bt\926921\repo\src\PatchOrchestrationApplication\CoordinatorService\src\RepairManagerHelper.cs:line 266
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.CoordinatorService.d__9.MoveNext() in E:\bt\926921\repo\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.cs:line 110
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.PatchOrchestration.CoordinatorService.CoordinatorService.d__8.MoveNext() in E:\bt\926921\repo\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.cs:line 94
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__12.MoveNext()

v1.4.6 release sfpkg is failing to deploy

Azure Service Fabric version: 7.1.458.9590
OS: Windows Server 2016 Datacenter
OS version: 10.0.14393
POA version: 1.4.6

CoordinatorService.exe and NodeAgentNTService.exe are crashing with unhandled exception.

From Application event log:

.NET Runtime:

Application: CoordinatorService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: exception code e0434352, exception address 00007FFD60DF4F38
Stack:

Application: NodeAgentNTService.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: exception code e0434352, exception address 00007FFD60DF4F38
Stack:

Application Error
Faulting application name: NodeAgentNTService.exe, version: 0.0.0.0, time stamp: 0x5f718322
Faulting module name: KERNELBASE.dll, version: 10.0.14393.3659, time stamp: 0x5e9140ed
Exception code: 0xe0434352
Fault offset: 0x0000000000034f38
Faulting process id: 0x4520
Faulting application start time: 0x01d69b012ccc90ed
Faulting application path: C:\PatchOrchestrationApplication\NodeAgentNTService\NodeAgentNTService.exe
Faulting module path: C:\windows\System32\KERNELBASE.dll
Report Id: c441e4c2-1431-4233-a9ac-c624697177fc
Faulting package full name:
Faulting package-relative application ID:

POA version 1.4.5 works just fine

Image based patching

In one of the QA-session it was mention that the Patch Orchestration application was not the recommended approach, and that instead "image based patching" should be used.

Could you clarify what was meant with this? Is the POA deprecated? With image based patching did you simply mean this?

NodeAgentNTService is not able to update its CopyOfSettings.xml file when user passes corrupt settings while deploying the application

If by mistake user passes some corrupt configuration while deploying the application, then, NodeAgentNTService keeps crashing while throwing an that settings are not valid. NTService is not able to update its settings even after user updates the app with correct settings.

Mitigation:
Delete the local copyOfSettings,.xml file on all the nodes.

Fix is already checked-in, but it will be released in the upcoming refresh of the application.

NodeAgentNTService is stuck in stopping state and NodeAgentService is not able to come up on few nodes.

NodeAgentService's setup scripts fails while throwing this:"

D:\SvcFab_App\PatchOrchestrationApplicationType_App18\NodeAgentServicePkg.Code.1.3.2>REM Stop the service and uninstall the current version

D:\SvcFab_App\PatchOrchestrationApplicationType_App18\NodeAgentServicePkg.Code.1.3.2>sc stop POSNodeSvc
[SC] ControlService FAILED 1061:

The service cannot accept control messages at this time.

D:\SvcFab_App\PatchOrchestrationApplicationType_App18\NodeAgentServicePkg.Code.1.3.2>sc delete POSNodeSvc
[SC] DeleteService FAILED 1072:

The specified service has been marked for deletion.

Cannot register PatchOrchestrationApplication_v1.4.5.sfpkg with SF Cluster - invalid path

The .sfpkg file from the 1.4.5 Release is adding a sub-folder called "PatchOrchestrationApplication" that prevents SF from registering the .sfpkg because it cannot find the ApplicationManifest.xml:

The Application Manifest file 'POA\ApplicationManifest.xml' is not found in the store.

Upon inspecting the package and comparing to the previous version v1.4.3 that did work, it seems that the latest version added a subfolder that shouldn't be there:
image

It is looking for the ApplicationManifest.xml file to be at the \POA\ApplicationManifest.xml path instead.

Reference: v1.4.5.sfpkg release

CoordinatorService crash if no completed repair tasks

on version 1.4.2

If there are no completed repair tasks and approval policy is set to NodeWise, The Coordinator service will crash on startup.

Aggregate will throw with "System.InvalidOperationException: Sequence contains no elements"

RepairTask lastCompletedTask = (await this.GetCompletedRepairTasks(nodeList, cancellationToken))?.Aggregate(

POA constantly creates repair items when non-Windows OS updates fail to install.

SF repair tasks are constantly disabling our nodes when certain non-Windows OS updates fail to install. In this case, there was a driver that was failing. Is there a way to limit the number of retries for a specific type of update (non-Windows OS)? The WURescheduleCount is set to 5, however, that does not seem to impact driver updates.

Fiddler Logs Attached with PA Update History
sf-update-history.zip

Specific Failure

{
    "OperationResult": 2,
    "NodeName": "_awet201_2",
    "OperationTime": "2020-03-27T02:03:12.240683Z",
    "OperationStartTime": "2020-03-27T01:58:02.8892952Z",
    "UpdateDetails": [
        {
            "UpdateId": "f9fcbc6f-5349-4f1f-bc13-2f098ddf9622",
            "Title": "Microsoft driver update for Generic / Text Only",
            "Description": "This driver was provided by Microsoft for support of Generic / Text Only",
            "ResultCode": 2,
            "HResult": -2145124329
        }
    ],
    "OperationType": 1,
    "WindowsUpdateQuery": "IsInstalled=0",
    "WindowsUpdateFrequency": "Daily,01:00:00",
    "RebootRequired": false
}

Patch Orchestration Application (POA) does not install updates in gMSA security cluster

I have standalone cluster 6.5.664.9590 on Windows Server 2019 with gMSA security.
Successfully deployed Patch Orchestration Application (POA) v1.4.1.
In GPO set "Notify to download updates".
POA successfully find and download updates, but does not install updates. The Node Agent NTService does not creates repair tasks for installing updates on the nodes.
Get-ServiceFabricRepairTask empty.
every ~3 minutes in System log: "The Windows Modules Installer service entered the running state." after a few seconds "The Windows Modules Installer service entered the stopped state."
Scr_1
Scr_2
In unsecure dev cluster POA works good.

Patch Orchestration application is not able to come up on some of the nodes, fails with the error that ServiceType is not registeret within configured timeout

image

It happens because POA tries to execute logman stop in setup scripts which is not able to complete as the task scheduler is in hanged state on the node.

To Confirm that issue:
1. Nodes should be Windows 2016 server
2. should have defender version of 1.1.157000.* installed on them.
3. TaskScheduler is in hang state.
For more information check this link, https://blogs.msdn.microsoft.com/azureservicefabric/2019/03/08/known-issue-for-service-fabric-windows-server-2016-clusters/

POA fails when WindowsUpdate registry key is missing

On some windows server (fresh) installs, the following registry key is missing:

HKLM\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU

This missing registry key appears to be a widespread issue. Incidentally on these machines, Windows Update appears to be by default configured to auto-download and install.

This then causes WindowsAutoUpdateUtility\LogCurrentAUValues to crash with a NullReferenceException on line 43, as auKey is null.

I've fixed this in my cluster by rolling out a Group Policy forcing Automatic Updates to "2 - Notify Download and Install", which subsequently creates the above registry key.

This could also be fixed if the POA creates the "WindowsUpdate\AU" subkeys under "HKLM\SOFTWARE\Policies\Microsoft\Windows".

At the very least, if the key is not expected to exist unless Windows Update is manually configured, the error message should be helpful, as opposed to crashing on a Null Reference Exception.

Build.ps1 fails with 401 unauthorized error.

PS E:\service-fabric-POA> .\build.ps1
Source root is E:\service-fabric-POA\src\PatchOrchestrationApplication\PatchOrchestrationApplication
Restoring NuGet package Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7.
Adding package 'Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7' to folder 'E:\service-fabric-POA\packages'
Added package 'Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.6.7' to folder 'E:\service-fabric-POA\packages'

NuGet Config files used:
E:\service-fabric-POA\nuget.config

Feeds used:
C:\Users\brkhande\AppData\Local\NuGet\Cache
C:\Users\brkhande.nuget\packages
https://api.nuget.org/v3/index.json

Installed:
1 package(s) to packages.config projects
Changing the working directory to E:\service-fabric-POA\src\PatchOrchestrationApplication\PatchOrchestrationApplication
Using msbuild from C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\15.0\Bin\MSBuild.exe
Restoring packages for E:\service-fabric-POA\src\PatchOrchestrationApplication\TelemetryLib\src\TelemetryLib.csproj...
Restoring packages for E:\service-fabric-POA\src\PatchOrchestrationApplication\NodeAgentService\src\NodeAgentService.csproj...
Restoring packages for E:\service-fabric-POA\src\PatchOrchestrationApplication\NodeAgentNTService\src\NodeAgentNTService.csproj...
Restoring packages for E:\service-fabric-POA\src\PatchOrchestrationApplication\NodeAgentSFUtility\src\NodeAgentSFUtility.csproj...
Restoring packages for E:\service-fabric-POA\src\PatchOrchestrationApplication\CoordinatorService\src\CoordinatorService.csproj...
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\CommonExtensions\Microsoft\NuGet\NuGet.targets(114,5): warning : The plugin credential provider could not acquire credentials. Authentication may require manual action. Consider re-ru
nning the command with --interactive for dotnet, /p:NuGetInteractive="true" for MSBuild or removing the -NonInteractive switch for NuGet [E:\service-fabric-POA\src\PatchOrchestrationApplication\PatchOrchestrationApplication\PatchOrchestrationApplication.
sfproj]
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\CommonExtensions\Microsoft\NuGet\NuGet.targets(114,5): error : Unable to load the service index for source https://msazure.pkgs.visualstudio.com/_packaging/ServiceFabricESTools/nuget/
v3/index.json. [E:\service-fabric-POA\src\PatchOrchestrationApplication\PatchOrchestrationApplication\PatchOrchestrationApplication.sfproj]
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\CommonExtensions\Microsoft\NuGet\NuGet.targets(114,5): error : Response status code does not indicate success: 401 (Unauthorized). [E:\service-fabric-POA\src\PatchOrchestrationAppli
cation\PatchOrchestrationApplication\PatchOrchestrationApplication.sfproj]

Patch operation getting stuck in the stages after installation is completed and restart is attempted

This happens because windows system service(posnodesvc) orchestrating install and restart on the nodes does not come up after restart. So, to mitigate this, repair task associated with it, needs to be completed manually.

Posnodesvc is started from the folder D:\PatchOrchestrationApplication and D drive is temporary. So, our theory is that in few cases we hit this issue, D drive is wiped out as it is temporary and system service is not able to come up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.