Web App outage during maintenance window - File Server is down. What should I do?

If you want to host your web application or a REST API inside Microsoft Azure then App Services should be on your shortlist. The SLA of the service is 99.95% availability, but we need to be careful on how to handle maintenance windows.

PaaS and Maintenance Window

Like any other PaaS service, there are specific maintenance windows that are used by Microsoft to upgrade the software and the hardware that is behind the Azure Services. When the availability of your system is important, you need to have the application deployed in two different App Services Plans (ideally from different Azure Regions).

You might ask yourself why this is important? You might want to be protected for situations when things don't go as planned. For example, there are isolated situations when during maintenance windows things don't go as planned and your web application is unavailable until you restart it.

What happens when File Server of App Service goes down?

Let's analyze the following case and see what are the options that we have. The File Server that connects the App Service Worker to the Blob storage where the files are hosted crash. This might happen when:

An update to the File Server during a maintenance window does not end with success and the crash is not detected by the health monitoring system
The file access latency increase and the File Server is replaced with another one. During the replacement procedure, something happens and the switch is not done with success

In these situations, the end customer will not able to access the application that is hosted by App Service. In the monitoring dashboard of the App Service, there are no errors, you will just see that there is no traffic. No logs, no data, no error - nothing.

You might ask why this is happening, why I cannot see anything? It is kind of normal because the application package was not loaded from the blob storage. There is no application running anymore, no files of the system or of your application. The application cannot be loaded fully and crash during the initialization phase.

Possible solutions

There are two action points that you can do to mitigate this situation.

1. Local Cache

The first one is to activate the local cache of the application. When you have the local cache active, a local copy of the web application content is copied on the worker role itself. In the case of a failure of the storage, you would have the content cached locally.

There are multiple benefits of having the local cache active like:

Fewer application restart because of shared storage changes
Low latency of storage access because you don't need to access the Azure Storage anymore
During a maintenance window, the disruptions caused by an upgrade of the storage layer (e.g. File Server) are reduced drastically

The default size of the cache is 1000MB. I would recommend to start with this value and only if necessary to increase it. A higher value of the cache size can increase the loading time of the application. You could reduce the cache size to improve the loading time.

The local cache can be activated if you:

Per web application by adding to App Settings the WEBSITE_LOCAL_CACHE_OPTION setting with value 'Always'
Per App Service by adding WEBSITE_LOCAL_CACHE_OPTION to the properties list of the ARM template of the App Service.

2. Enable stdout logs

To be able to identify why your application is not starting, because of some error that might occur during the process start, you should ensure that the stdout logs are logged. Using Kudu or other similar tools you could visualize the output.

You should ensure that these logs are written on the Worker as file not pushed to an endpoint (e.g. Application Insights) because the application might not be able to load the required packages to do an HTTP request.

</handlers>

<aspNetCore processPath="dotnet" arguments=".\"<project name="">.dll" stdoutLogEnabled="true" stdoutLogFile="\\?\%home%\LogFiles\stdout" />

As you can see above, you can had a new handler in the system.webServer that enables the stdout for all assemblies and redirect the output to a local file that can accessed using FTP or Kudu. The logs would contain the errors related to file not found or loading issues.

Remarks

You might ask yourself why not to define an Azure Functions that check if the application is alive every 5 minutes and force a restart of the web application. NO, you shall not do such a thing because the situation described above is not common. Even if I was unlucky to have the same issue twice in a 3 months time interval, this is an isolated exception.

Conclusion

When you use a cloud service and a specific technology stack (e.g. .NET, Java), you should all the time check what are the best practices and recommendations. The issue related to the failure of File Server that is part of an App Service during an update window could be avoided if we would activate the local cache of the website and ensure that stdout logs are persisted.

Cloud as a Story - Vunvulea Radu

Search This Blog

Web App outage during maintenance window - File Server is down. What should I do?

Comments

Post a Comment

Popular posts from this blog

How to audit an Azure Cosmos DB

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Migrating to the cloud is quick and easy (Pill 2 of 5 / Cloud Pills)