Debugging in production

Abstract
In this article we have discovered how we can create a dump and which the basic tools to analyze it are. By means of dump files we can access the information we could not normally access. Some data can only be accessed through these dumps and not by other means (Visual Studio debug).
We could state that these tools are very powerful, but they are rather difficult to use, as they require quite a high degree of knowledge.

How many times has it happened to you to have a problem in production or in the testing environment which you are unable to reproduce on the development machine? When this thing happens, things can go off the track, and we try out different ways of remote debugging. Without even knowing, helpful tools can be right near at hand, but we ignore them or simply don’t know how to use them.
In this article I will present different ways in which we can debug without having to use Visual Studio.
Why not using Visual Studio?
Though Visual Studio is an extremely good product, which helps us when we need to discover bugs and to debug, it won’t be of great help to us in production. The moment when we have a bug in production, the rules of the game change. In production, the application is compiled for release, and debugging is no longer possible.
When do we need these tools?
The moment we cannot reproduce the problem on our development machines. No matter what we do, we are not able to reproduce the problem we are dealing with. Since we cannot reproduce the problem, it’s like looking for a needle in haystack.
If by chance the problem appears, but we do not have a reproduction scenario, we find ourselves back in the situation above mentioned.
Another case is when the memory occupied by our application increases in time, the phenomenon emerging only on the production machines. We can only guess what the problem is, but we do not know the exact cause. That is why we may “fix” totally different code areas.
What solutions do we have?
Generally, there are two possibilities at hand. The first one is entirely based on logs. Through logs we are able to identify the application areas which do not function appropriately. But using logs can be two-edged. It is required that you know what exactly must appear in logs and how often. Otherwise, you may end up with thousands of pages of useless and almost impossible to analyze logs. If we end up having too many logs, we may be taken by surprise by the alteration of the behavior of the application.
When it is possible, we can send the PDBs on the production machine. This way we will have access to the entire stack trace generated by an exception.
Logs can be of great help for us to solve different problems that emerge in production. But even if logs are very useful, they won’t help us every time. There are different problems which can appear and which are extremely difficult to identify by using logs. For example, a dead-lock would be almost impossible to identify by means of logs.
Another alternative that is available for us is creating memory dumps and analyzing them.
What is a memory dump?
A memory dump is a snapshot of the process on a certain moment. Besides the information regarding the allocation of memory, a snapshot also contains information on the state of different threads, objects and cod. By using this information we can obtain very valuable information regarding the process that is running. This snapshot represents the image of the memory in 32 or 64 bits format, depending on the system.
Generally, there are two types of memory dump. The first one is minidump. This is the most uncomplicated memory dump that can be done, which consists of mere information on the stack – the state of the process or on the calls that are made and so on.

The second type of memory dump is full dump. It contains all of the information that can be obtained, including a snapshot on memory. It takes much longer to obtain a full dump compared to a minidump and the dump file itself is much bigger.
How can we generate a memory dump?
There are different applications which allow us to do this. Some of them allow us to automatically generate a dump, according to different parameters.
In the case we need to generate a memory dump, the easiest solution is the Task Manager. All we have to do is to click right on a process and select “Create dump file”. We can do the same thing also by using Visual Studio or “adplus.exe”. The last alternative is a debug tool for Windows which can be found on almost all machines on which Windows runs.
In the following example, we place an order in adplus to create a memory dump at this moment:

adplus –hang –o C:myDump –pn MyApp.exe

By means of pn option we specify the name of the process for which we wish to create a dump. If we want to create a dump automatically we can use the –crash option.

adplus –crash –o C:myDump –pn MyApp.exe
adplus –crash –o C:myDump –sc MyApp.exe

If it is necessary for us to automatically create a dump, besides “adplus.exe” we can use DebugDiag and “clrdmp.dll”. The three options we have in order to automatically create a dump are rather similar. DebugDump allows us to set up the system so that it automatically generates a memory dump the moment when the CPU level is higher than X% within a certain time span.
Besides these tools there are many others on the market. Depending on your requirements, you can use any tool of this type.
How do we analyze a dump?
The native debugger for a dump is represented by Windbg. This is a powerful tool, by means of which one can get very valuable information. The only problem of this tool is that it is not very friendly. We will see a little later what the alternatives to Windbg are. We must remember that in almost all the cases, the alternatives to Windbg are using this debugger behind – it’s just that they display a friendlier and more useful interface.
An alternative to Windbg is any Visual Studio that is more recent than Visual Studio 2010. Beginning with Visual Studio 2010, they offer us the possibility to analyze the dumps for .NET 4.0+. What we can do in Visual Studio is not as advanced as what Windbg allows us to do, but generally it can suffice.
Windbg

The first step we need to take after opening Windbg is to upload a dump (CTRL+D). Once uploaded, a dump can be visualized in different manners. For example, we can analyze the threads, the memory, the allocated resources and so on.
In order to be able to do more, for instance to visualize and analyze the managed code, we need to upload additional libraries such as Son of Strike (SOS) or Son of Strike Extension (SOSEX). These two libraries open new doors for us, as they are able to analyze the data from the dump in an extremely useful way.
Son of Strike (SOS)
SOS allows us to visualize the process in itself. It allows us to access the objects, threads and information from the garbage collector. We can even visualize names of variables and their value.
One must know that all the information that can be accessed is part of the managed memory. Therefore, SOS is highly connected to CLR and its version. When we upload the SOS module, we must make sure we are uploading the one that is correspondent to the .NET version of our application.

.loadby sos mscorks
.loadby sos clr

It the examples above, we have uploaded the SOS module for .NET 3.5-, and in the second example, we have uploaded SOS for .NET 4.0+.
All the SOS orders start with “!”. The basic order is “!help”. If we wish to visualize the threads list, we can employ the “!threads” order which has an output that is similar to the following:

0:000> !threads
ThreadCount: 5
UnstartedThread: 0
BackgroundThread: 2
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
     ID    OSID    ThreadOBJ Count Apt Exception
…

Debug a crash
So far we have seen there are many tools available for us to create and analyze a dump. Time has come now to see what we have to do in order to be able to analyze a crash.

1. Launch the process
2. Before it “crashes”, we order adplus to create a dump the moment when the process “crashes”
adplus –crash –pn [numeProcesor]
3. Launch Windbg (after the crash)
3.1 Upload the dump
3.2 Upload SOS
3.3 !threads (to see which thread has crashed)
3.4 !PrintException (on the thread that has crashed in order to see the exception)
3.5 !clrstack (to see the stack of calls)
3.6 !clrstack –a (to see the stack together with the parameters)
3.7 !DumpHeap –type Exception (it lists all the exceptions that are not related to GC).

One must know that the results are according to the way in which the application is compiled. For instance, if there has been a code optimization performed during the compilation. Moreover, the exception list we can get may be quite long due to some orders such as !DumpHeap, which returns all the exceptions encountered – even the ones which have been pre-created, such as ThreadAbord.
How do we identify a deadlock?
A deadlock emerges when two or more threads are waiting for the same resource. In these cases, a part of the application, if not the entire application, gets blocked.
In this case, the first step is to create a dump using the order:

Addplus –hang –o –c:myDump –pn [NumeProces]

Then, it is necessary for us to analyze the stack trace for each thread and see whether it is blocked (Monitor.Enter, ReadWriteLock.Enter…). Once we have identified these threads, we can find the resources used by each thread, together with the thread that keeps these resources blocked.
For these final steps, the order “!syncblk” comes to our help. It lists for us the units of memory for a certain thread.

This article was written by Radu Vunvulea for Today Software Magazine.

AI ROI without hype: a practical way to measure value using risk adjustment + Azure Copilot example

Most people know what ROI means, but it’s harder to calculate for AI projects. The numbers are less predictable than with traditional platforms because many AI projects never reach stable production. IDC says only about 44% of custom AI apps and 53% of third-party AI apps make it from proof of concept to production. That’s why it’s important to look at ROI through a risk lens, not just cost versus benefit. One useful approach is to use a risk-adjusted formula: AI ROI = (AI Business Value Income / (Initial Investment + Annual Costs)) × Success Probability where, >AI Business Value Income (over N years) Consider a 2 to 3 year period and include both direct and indirect value: Direct: time saved, fewer tickets, higher conversion, lower fraud. Indirect: improved customer or employee experience and quicker decisions. For these, use measurable stand-ins like CSAT, churn, time to resolution, or hours saved, and estimate conservatively. >Initial Investment This covers more than just buil...

Cloud as a Story - Vunvulea Radu

Search This Blog

Debugging in production

Labels

Comments

Post a Comment

Popular posts from this blog

How to audit an Azure Cosmos DB

Why Database Modernization Matters for AI

AI ROI without hype: a practical way to measure value using risk adjustment + Azure Copilot example