Skip to main content

Web App outage during maintenance window - File Server is down. What should I do?

If you want to host your web application or a REST API inside Microsoft Azure then App Services should be on your shortlist. The SLA of the service is 99.95% availability, but we need to be careful on how to handle maintenance windows. 

PaaS and Maintenance Window
Like any other PaaS service, there are specific maintenance windows that are used by Microsoft to upgrade the software and the hardware that is behind the Azure Services. When the availability of your system is important, you need to have the application deployed in two different App Services Plans (ideally from different Azure Regions).
You might ask yourself why this is important? You might want to be protected for situations when things don't go as planned. For example, there are isolated situations when during maintenance windows things don't go as planned and your web application is unavailable until you restart it.

What happens when File Server of App Service goes down?
Let's analyze the following case and see what are the options that we have. The File Server that connects the App Service Worker to the Blob storage where the files are hosted crash. This might happen when:
  • An update to the File Server during a maintenance window does not end with success and the crash is not detected by the health monitoring system
  • The file access latency increase and the File Server is replaced with another one. During the replacement procedure, something happens and the switch is not done with success
In these situations, the end customer will not able to access the application that is hosted by App Service. In the monitoring dashboard of the App Service, there are no errors, you will just see that there is no traffic. No logs, no data, no error - nothing.
You might ask why this is happening, why I cannot see anything? It is kind of normal because the application package was not loaded from the blob storage. There is no application running anymore, no files of the system or of your application. The application cannot be loaded fully and crash during the initialization phase. 


Possible solutions
There are two action points that you can do to mitigate this situation.
1. Local Cache
The first one is to activate the local cache of the application. When you have the local cache active, a local copy of the web application content is copied on the worker role itself. In the case of a failure of the storage, you would have the content cached locally. 
There are multiple benefits of having the local cache active like:
  • Fewer application restart because of shared storage changes
  • Low latency of storage access because you don't need to access the Azure Storage anymore
  • During a maintenance window, the disruptions caused by an upgrade of the storage layer (e.g. File Server) are reduced drastically 
The default size of the cache is 1000MB. I would recommend to start with this value and only if necessary to increase it. A higher value of the cache size can increase the loading time of the application. You could reduce the cache size to improve the loading time. 
The local cache can be activated if you:
  • Per web application by adding to App Settings the WEBSITE_LOCAL_CACHE_OPTION setting with value 'Always'
  • Per App Service by adding WEBSITE_LOCAL_CACHE_OPTION to the properties list of the ARM template of the App Service. 

2. Enable stdout logs
To be able to identify why your application is not starting, because of some error that might occur during the process start, you should ensure that the stdout logs are logged. Using Kudu or other similar tools you could visualize the output.
You should ensure that these logs are written on the Worker as file not pushed to an endpoint (e.g. Application Insights) because the application might not be able to load the required packages to do an HTTP request.
      <handlers>
        <add name="aspNetCore" path="*" verb="*" modules="AspNetCoreModule" resourceType="Unspecified" />
      </handlers>
      <aspNetCore processPath="dotnet" arguments=".\"<project name="">.dll" stdoutLogEnabled="true" stdoutLogFile="\\?\%home%\LogFiles\stdout" />  
As you can see above, you can had a new handler in the system.webServer that enables the stdout for all assemblies and redirect the output to a local file that can accessed using FTP or Kudu. The logs would contain the errors related to file not found or loading issues. 

Remarks
You might ask yourself why not to define an Azure Functions that check if the application is alive every 5 minutes and force a restart of the web application. NO, you shall not do such a thing because the situation described above is not common. Even if I was unlucky to have the same issue twice in a 3 months time interval, this is an isolated exception.

Conclusion
When you use a cloud service and a specific technology stack (e.g. .NET, Java), you should all the time check what are the best practices and recommendations. The issue related to the failure of File Server that is part of an App Service during an update window could be avoided if we would activate the local cache of the website and ensure that stdout logs are persisted. 

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...