CASA Crash Reporting Reference

Overview

Like many software packages, CASA and its related applications occasionally crash; a crash is when the program stops abruptly such as when it seg-faults, etc. A crash potentially loses some of the user’s work and also undermines the user’s confidence in the program. Unfortunately, when a crash occurs there is currently little information available to diagnose the problem. If it occurs frequently and reproducibly then it likely results in a helpdesk ticket which may result in the problem being investigated and fixed.

The purpose of the Crash Reporting System (CRS) is to capture useful debugging information when a crash occurs and automatically forwarding that information to the CASA development team. The crash report will be stored and if a pattern is detected, then a JIRA issue maybe be created and worked.

The CRS involves several steps. First, the crash must be caught by the software. Second, a set of potentially relevant information is captured (e.g., stack trace, platform configuration, user environment, CASA configuration, etc.). Third, a crash reporting posting feature was developed which posts the report to an NRAO web site.

CASA Integration

Breakpad for Crash Dump Generation

The Google opensource package breakpad was used to capture crash information when the application terminates. A Breakpad initialization routine is called at application startup which installs signal handlers for fatal signals (e.g., SIGSEGV, SIGABRT, etc.) that might be generated by the program and would result in program termination unless handled. Since unhandled C++ exceptions result in the generation of a SIGABRT, breakpad will catch those as well. If a fatal signal is routed to breakpad's signal handler, the crash handling process will begin. A separate thread (its stack was preallocated buring breakpad initialization) is created to copy the processes stacks (the stacks of all threads are captured); other useful information such as shared library information is also captured by the crash thread. Once the crash thread has completed operation a fork is performed to create a new, uncrippled process to finalize and post the crash report while the original process is allowed to terminate. The actions of the posting application are discussed below.

Using the Crash Reporter

In 5.0 the crash reporter feature was released as opt-in and selected users were provided with instructions on how to enable and test the feature. The above hyperlink goes to a copy of the documentation since I'm not sure how to link to the document as it appears on plone.

CASA Integration Details

The casapy application uses a python interface to initialize breakpad. A new method, _crash_reporter_initialize was added to the utilstool. This method is called from one of the python modules executed during casapy startup. The tool's C++ layer calls routines in stdcasa/StdCasa/CrashReporter.cc.

The plotms application calls directly to the routines in CrashReporter.cc since it is a C++ application.

Web Interface for Receiving Reports

Crash Poster Application

The Crash Poster application is exec'd from a fork the crashing CASA application (the source is located in the code/crash/apps/reporter directory). As command line parameters this application expects the complete path to the crash dump file created by breakpad, the URL that the crash report is to be posted, and optionally the path to the casapy log. The poster app captures some additional information about the host using various shell facilities (e.g., uname, etc.). All of these information files are then collected into a compressed tar file which is then posted (i.e., an HTTP POST operation) to the URL provided to the application. The crash dump is located in the appropriate temporary storage for the platform (e.g., /tmp on Linux) and all the additional files created by the poster app are also created there. When the crash report is successfully posted to the web, all these files are removed from the disk. The operations of the poster are appended to a log file, CrashReporter.log located in the temporary directory.

Web Application

The web application accepting the HTTP POST operation was written by Stephan Witz and services the URL https://casa.nrao.edu/cgi-bin/crash-report.pl. When it accepts a crash report it puts the file into a directory of the form /home/casa.nrao.edu/upload/$DATETIME where DATETIME is a timestamp for the reception time (e.g., "2017-01-18_13:45:09"). The directory normally will contain just the compressed archive file. Currently, the files are owned by user apache and belong to the group casaweb. Below are the contents of a typical crash report archive (note newer ones will also contain a casalog when the failing process is casapy):

0bdbf403-e526-0518-1b809f7b-557415b7.dmp
cpuinfo.txt
meminfo.txt
mountinfo.txt
lsbinfo.txt
unameinfo.txt

Designated people (me as of this writing) will receive an email message each time a crash report is received.

Crash Report Analysis Process

Once the crash reporter feature is fully depoyed, the incoming crash reports will need to be analyzed. The various .txt files are immediately readable but the dump (.dmp) file requires further processing to be useful. The dump file needs to be combined with the debug symbols extracted from the various binaries that make up the application (e.g., the executable and the shared libraries). The symbol extraction process captures the symbol information so that it need not be included in the delivered binaries. The breakpad application dump_syms is used to create extract the symbols. The symbol file contains a hashcode uniquely identifying the built binary. More details can be found in the "linux_starter_guide.md" contained in the breakpad distribution. To decode the dump file, the breakpad application minidump_stackwalk is used; this takes a path to the dump file and the root directory of the folder containing the symbol files (see the "linux_starter_guide.md" mentioned above).

Required Build Support

Building Breakpad

Breakpad is distributed as a source archive that must be built on the local platform. Logic was added to the CASA cmake system to download the package, build it, and make the binaries available during the CASA build process.

Extracting Symbols

The crash feature should only be used on released software (this includes interim and pre-releases). Thus the normal build process need not extract the symbols. For release builds, logic will need to be added to the build system such that symbols are created during the build and then the breakpad application dump_syms is run on all of the CASA and CasaCore binaries. The extracted symbols will need to be retained in an orderly fashion so that they can be used in analyzing incoming crash dumps.

Because the symbol extraction needs to be made part of the release process, it is not currently implemented.

Future work

As the feature matures, it might be desirable to turn the poster application into a GUI application which would allow the user to enter comments about the nature of the crash before it is posted to CASA.

-- JimJacobs - 2017-02-03
Topic revision: r2 - 2017-02-03, JimJacobs
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback