There’s a famous Sherlock Holmes story whose resolution depends on Holmes recognizing that a dog not barking in the night during a break-in was a critical piece of information. Because the dog must’ve known the attacker.
I could’ve used Sherlock the last few days dealing with one of the most subtle — but still simple to solve — bugs I’ve encountered in a long time.
Here’s the context. I’d written my first WinUI 3 app. This is Microsoft’s latest framework — at v1 but still under intense development — for writing Windows desktop apps. It’s the successor to Windows Forms, WPF and UWP. So a good thing to learn :).
The app ran flawlessly inside the Visual Studio 2022 debugger. It also ran flawlessly after I published it as an MSIX package and installed it on my development machine.
And it failed, silently and almost instantly on the two test Windows machines I installed it on. It wasn’t an installation problem — it successfully installed. But run? Almost nada. Worse yet, the log file contained only a single entry, from the initialization phase when the dependency injection framework was initializing.
The Windows event log wasn’t of any use. Yes, it recorded the failure. But the details were ambiguous and, in the end, misleading (which, as it turns out, is pretty common in the situation I was in).
So how do you debug something on a machine that doesn’t have a debugger installed?
Well, you could install Visual Studio. But that’s a PITA and, more importantly, it creates a complex environment that isn’t going to be duplicated on most client machines. Who wants to tell people “install my cool app, but first install 20 GB of stuff so it will work”? Not me.
There are other debuggers you can install, though, which are much simpler. I used WinDbg (preview edition). It lets you run both executables and MSIX install packages. Unfortunately, while it worked, the exceptions it caught were impossible for me to interpret. I’m not dissing the software — it’s valuable for gathering information under difficult circumstances. But the result is really raw…which may mean something to a more skilled programmer. But not to me.
I also tried running ProcMan. It, too, identified issues. But in addition to also being pretty raw information, the volume of actions it logged (basically every DLL that got opened by the app every time it accessed them) was overwhelming. So no joy there, either.
It was time for desperate measures.
So I started by commenting out code lines, one by one. Compile. Package. Sign package. Install on test machine. See what happens. Repeat. Not an efficient way to debug. But it let me make progress.
I discovered that, for some reason, calls to my logging subsystem (which is based on Serilog, a great library) were causing crashes. It wasn’t clear that was the ultimate root cause. But it led me to remove logging calls, allowing me to move forward. In retrospect that should’ve been a huge tell…but that was the dog in the night: the fact “turning off” the logging system should’ve told me the problem, or at least a problem, was in the logging subsystem itself.
I knew my app was able to access the local file system. So in desperation I wrote a simple logger — basically something that just appended whatever you gave it to a file in the app’s local storage area — and duplicated all those logging calls I’d commented out. Which eventually led me to the exception that was causing the crash. Which I finally realized was being thrown by a utility function used by the logging subsystem.
Once I figured out where the exception was occurring recognizing the mistake took about two seconds. My logging subsystem includes information about the exact object type, source code file and source code line number where the log event is recorded. That’s very helpful during development (and is why I chose to extend Serilog, to provide all that)…but the source code paths get really long. So I wrote a utility function to trim the absolute path back to being a path relative to the project directory.
It works great on my development machine, either within Visual Studio or when the app is running standalone. But it can’t possibly work on any machine which doesn’t have my source code tree installed. Which, of course, no production machine will.
The fix was simple: wrap the trimmer in a try/catch block, return a default value if you’re not on my development machine, and voila1! Problem solved!
Too bad about the 2+ days of development time I won’t get back, though :).
I’ll probably look for a way to simply tell if I’m in a production environment. But for now things work. ↩