Skip to main content

Web App outage during maintenance window - File Server is down. What should I do?

If you want to host your web application or a REST API inside Microsoft Azure then App Services should be on your shortlist. The SLA of the service is 99.95% availability, but we need to be careful on how to handle maintenance windows. 

PaaS and Maintenance Window
Like any other PaaS service, there are specific maintenance windows that are used by Microsoft to upgrade the software and the hardware that is behind the Azure Services. When the availability of your system is important, you need to have the application deployed in two different App Services Plans (ideally from different Azure Regions).
You might ask yourself why this is important? You might want to be protected for situations when things don't go as planned. For example, there are isolated situations when during maintenance windows things don't go as planned and your web application is unavailable until you restart it.

What happens when File Server of App Service goes down?
Let's analyze the following case and see what are the options that we have. The File Server that connects the App Service Worker to the Blob storage where the files are hosted crash. This might happen when:
  • An update to the File Server during a maintenance window does not end with success and the crash is not detected by the health monitoring system
  • The file access latency increase and the File Server is replaced with another one. During the replacement procedure, something happens and the switch is not done with success
In these situations, the end customer will not able to access the application that is hosted by App Service. In the monitoring dashboard of the App Service, there are no errors, you will just see that there is no traffic. No logs, no data, no error - nothing.
You might ask why this is happening, why I cannot see anything? It is kind of normal because the application package was not loaded from the blob storage. There is no application running anymore, no files of the system or of your application. The application cannot be loaded fully and crash during the initialization phase. 


Possible solutions
There are two action points that you can do to mitigate this situation.
1. Local Cache
The first one is to activate the local cache of the application. When you have the local cache active, a local copy of the web application content is copied on the worker role itself. In the case of a failure of the storage, you would have the content cached locally. 
There are multiple benefits of having the local cache active like:
  • Fewer application restart because of shared storage changes
  • Low latency of storage access because you don't need to access the Azure Storage anymore
  • During a maintenance window, the disruptions caused by an upgrade of the storage layer (e.g. File Server) are reduced drastically 
The default size of the cache is 1000MB. I would recommend to start with this value and only if necessary to increase it. A higher value of the cache size can increase the loading time of the application. You could reduce the cache size to improve the loading time. 
The local cache can be activated if you:
  • Per web application by adding to App Settings the WEBSITE_LOCAL_CACHE_OPTION setting with value 'Always'
  • Per App Service by adding WEBSITE_LOCAL_CACHE_OPTION to the properties list of the ARM template of the App Service. 

2. Enable stdout logs
To be able to identify why your application is not starting, because of some error that might occur during the process start, you should ensure that the stdout logs are logged. Using Kudu or other similar tools you could visualize the output.
You should ensure that these logs are written on the Worker as file not pushed to an endpoint (e.g. Application Insights) because the application might not be able to load the required packages to do an HTTP request.
      <handlers>
        <add name="aspNetCore" path="*" verb="*" modules="AspNetCoreModule" resourceType="Unspecified" />
      </handlers>
      <aspNetCore processPath="dotnet" arguments=".\"<project name="">.dll" stdoutLogEnabled="true" stdoutLogFile="\\?\%home%\LogFiles\stdout" />  
As you can see above, you can had a new handler in the system.webServer that enables the stdout for all assemblies and redirect the output to a local file that can accessed using FTP or Kudu. The logs would contain the errors related to file not found or loading issues. 

Remarks
You might ask yourself why not to define an Azure Functions that check if the application is alive every 5 minutes and force a restart of the web application. NO, you shall not do such a thing because the situation described above is not common. Even if I was unlucky to have the same issue twice in a 3 months time interval, this is an isolated exception.

Conclusion
When you use a cloud service and a specific technology stack (e.g. .NET, Java), you should all the time check what are the best practices and recommendations. The issue related to the failure of File Server that is part of an App Service during an update window could be avoided if we would activate the local cache of the website and ensure that stdout logs are persisted. 

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine: threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration: TeamCity .NET 4.51 EF 6.0.2 VS2013 It see...

Navigating Cloud Strategy after Azure Central US Region Outage

 Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days. The overlap of these two issues was a nightmare for travellers. In addition to blue screens in the airport terminals, they could not get additional information from the airport website, airline personnel, or the support line because they were affected by the outage in the Central US region or the CrowdStrike outage.   But what happened in reality? A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provi...