CASA Crash Reporting Reference
Overview
Like many software packages, CASA and its related applications occasionally crash; a crash is when the program stops abruptly such as when it seg-faults, etc. A crash potentially loses some of the users work and also undermines the users confidence in the program. Unfortunately, when a crash occurs there is currently little information available to diagnose the problem. If it occurs frequently and reproducibly then it likely results in a helpdesk ticket which may result in the problem being investigated and fixed.
The purpose of the Crash Reporting System (CRS) is to capture useful debugging information when a crash occurs and automatically forwarding that information to the CASA development team. The crash report will be stored and if a pattern is detected, then a JIRA issue maybe be created and worked.
The CRS involves several steps. First, the crash must be caught by the software. Second, a set of potentially relevant information is captured (e.g., stack trace, platform configuration, user environment, CASA configuration, etc.). Third, a crash reporting posting feature was developed which posts the report to an NRAO web site.
CASA Integration
Breakpad for Crash Dump Generation
The Google opensource package
breakpad was used to capture crash information when the application terminates. A Breakpad initialization routine is called at application startup which installs signal handlers for fatal signals (e.g., SIGSEGV, SIGABRT, etc.) that might be generated by the program and would result in program termination unless handled. Since unhandled C++ exceptions result in the generation of a SIGABRT,
breakpad will catch those as well. If a fatal signal is routed to
breakpad's signal handler, the crash handling process will begin. A separate thread (its stack was preallocated buring
breakpad initialization) is created to copy the processes stacks (the stacks of all threads are captured); other useful information such as shared library information is also captured by the crash thread. Once the crash thread has completed operation a fork is performed to create a new, uncrippled process to finalize and post the crash report while the original process is allowed to terminate. The actions of the posting application are discussed below.
Using the Crash Reporter
In 5.0 the crash reporter feature was released as opt-in and selected users were provided with
instructions on how to
enable and test the feature. The above hyperlink goes to a copy of the documentation since I'm not sure how to link to the document as it appears on
plone.
CASA Integration Details
The
casapy application uses a python interface to initialize
breakpad. A new method,
_crash_reporter_initialize
was added to the
utilstool. This method is called from one of the python modules executed during
casapy startup. The tool's C++ layer calls routines in
stdcasa/StdCasa/CrashReporter.cc
.
The
plotms application calls directly to the routines in
CrashReporter.cc
since it is a C++ application.
Web Interface for Receiving Reports
Crash Poster Application
The Crash Poster application is
exec'd from a
fork the crashing CASA application (the source is located in the
code/crash/apps/reporter
directory). As command line parameters this application expects the complete path to the crash dump file created by
breakpad, the URL that the crash report is to be posted, and optionally the path to the
casapy log. The poster app captures some additional information about the host using various shell facilities (e.g., uname, etc.). All of these information files are then collected into a compressed tar file which is then posted (i.e., an HTTP POST operation) to the URL provided to the application. The crash dump is located in the appropriate temporary storage for the platform (e.g.,
/tmp
on Linux) and all the additional files created by the poster app are also created there. When the crash report is successfully posted to the web, all these files are removed from the disk. The operations of the poster are appended to a log file,
CrashReporter.log
located in the temporary directory.
Web Application
The web application accepting the HTTP POST operation was written by Stephan Witz and services the URL
https://casa.nrao.edu/cgi-bin/crash-report.pl
. When it accepts a crash report it puts the file into a directory of the form
/home/casa.nrao.edu/upload/$DATETIME
where
DATETIME
is a timestamp for the reception time (e.g., "2017-01-18_13:45:09"). The directory normally will contain just the compressed archive file. Currently, the files are owned by user
apache
and belong to the group
casaweb
. Below are the contents of a typical crash report archive (note newer ones will also contain a
casalog
when the failing process is
casapy):
0bdbf403-e526-0518-1b809f7b-557415b7.dmp
cpuinfo.txt
meminfo.txt
mountinfo.txt
lsbinfo.txt
unameinfo.txt
Designated people (me as of this writing) will receive an email message each time a crash report is received.
Crash Report Analysis Process
Once the crash reporter feature is fully depoyed, the incoming crash reports will need to be analyzed. The various
.txt
files are immediately readable but the dump (
.dmp
) file requires further processing to be useful. The dump file needs to be combined with the debug symbols extracted from the various binaries that make up the application (e.g., the executable and the shared libraries). The symbol extraction process captures the symbol information so that it need not be included in the delivered binaries. The
breakpad application
dump_syms
is used to create extract the symbols. The symbol file contains a hashcode uniquely identifying the built binary. More details can be found in the "linux_starter_guide.md" contained in the
breakpad distribution. To decode the dump file, the
breakpad application
minidump_stackwalk
is used; this takes a path to the dump file and the root directory of the folder containing the symbol files (see the "linux_starter_guide.md" mentioned above).
Required Build Support
Building Breakpad
Breakpad is distributed as a source archive that must be built on the local platform. Logic was added to the CASA cmake system to download the package, build it, and make the binaries available during the CASA build process.
The crash feature should only be used on released software (this includes interim and pre-releases). Thus the normal build process need not extract the symbols. For release builds, logic will need to be added to the build system such that symbols are created during the build and then the
breakpad application
dump_syms
is run on all of the CASA and CasaCore binaries. The extracted symbols will need to be retained in an orderly fashion so that they can be used in analyzing incoming crash dumps.
Because the symbol extraction needs to be made part of the release process, it is not currently implemented.
Future work
As the feature matures, it might be desirable to turn the poster application into a GUI application which would allow the user to enter comments about the nature of the crash before it is posted to CASA.
--
JimJacobs - 2017-02-03