Grail Development Notes

12/13/2006

Field Callback improvements

Grail has been modified to improve the field level callback services.

In addition to providing callbacks where an entire parameter or sampler may be returned, Grail now accepts callback subscriptions for individual sampler/parameter fields. This can significantly simplify the processing of the callback, and also significantly reduce the Grail client's processing overhead, especially when all that is needed is one or two fields of a complex sampler.

All examples that follow use the quadrant detector and assume the following:

>>> from gbt.ygor import GrailClient
>>> from pprint import pprint
>>> cl = GrailClient("goauld", 18000, cb_port=19592)
>>> # Call-back function:
>>> def cat(device, sampler, value):
...      pprint(value)
...
>>> qd = cl.create_manager('QuadrantDetector')

To register a callback for a sampler/parameter field, one provides the desired field(s) -- minus the root sampler/parameter name -- as an extra parameter to both versions of GrailClient 's reg_sampler() or reg_param(). To illustrate, this call registers a callback for the entire 'monitorData' sampler:

>>> qd.reg_sampler('monitorData', cat)

but this call registers a callback only for the 'n12VOK' field of the 'monitorData' sampler:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK')

The value will be returned just as if qd.get_sampler_value('monitorData,n12VOK') had been called:

'1'

(this indicates that this power supply is OK).

Multiple fields for the sampler/parameter may be specified in the same subscription. These are separated by semicolons:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK')

The values returned will now be a semicolon delimited list of values, with the ordering matching the order in which they were subscribed:

'1;1;1'

(Note that the callback still will be called every time the sampler is updated, not when the field is updated. So if a sampler updates, but the registered field has not changed, callbacks are received nonetheless.)

Using field callbacks for samplers/parameters with no fields

The field callback syntax requires a field name to specify a field. What of parameters/samplers that have only one root field? Up do now, there was no way to specify this. Grail has been modified to allow for this. For such samplers/parameters, one specifies the root field using '.'. The following example registers a field callback for the quadrant detector's 'state' parameter:

>>> qd.reg_param('state', cat, '.')

This is desirable because now there is no need to parse the entire 'state' parameter to get the value; the value is returned just as if qd.get_value('state') had been called:

>>> qd.off()
'OK'
>>> 'Off'
'Off'

>>> qd.on()
'OK'
>>> 'Activating'
'Activating'
'Aborting'
'Ready'

(Note that the 'cat' callback function above does nothing but print the value.)

Time stamps and field callbacks

In Ygor, samplers are returned with a time stamp. The time stamp is not an integral part of the sampler structure itself, however; it is merely returned with the sampler structure. Therefore, the normal sampler callback mechanism does not return one (yet). However, the time stamp values may be obtained using the field callback mechanism. Two special fields are recognized: TS:MJD and TS:seconds. If these fields are provided during the field callback subscription, the MJD and seconds-in-the-day values will be returned along with the values of any other subscribed fields:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK;TS:MJD;TS:seconds')
>>>
'1;1;1;54082;69355.437475'

Here, returned with the three requested values, are the MJD and number of seconds elapsed during the day (UT).

NOTE: Grail can return time stamps this way for both samplers and parameters. It should be noted however that Ygor only supports time stamps for samplers; Grail adds the parameter time stamp when it receives the parameter callback from the manager.

05/30/2006

Grail Call Execution Speed

Occasionally, questions arise about the speed with which Grail can handle an individual SOAP request. This appears to be prompted by a perception that SOAP is slow. Therefore I have conducted a series of tests to lay this issue to rest. In order to differentiate between the performance of Grail and the performance of the Python clients, I completed the C++ Grail client library, at least enough to conduct the tests. I assumed that the tests using the C++ Grail client would provide a minimum baseline time. The call times would be shared between both client and server, but would be the smallest possible times that could reasonably be expected to perform an RPC call to Grail. If RPC calls using Python clients show a significant time difference, these then could be ascribed to the Python client, and not to Grail. The test was conducted with two versions of Grail: non-multithreaded SOAP, with HTTP 1.0 protocol, and multithreaded SOAP, which can support HTTP 1.1 (SOAPpy clients use HTTP 1.0 only). I wrote a test program using the C++ Grail client which did the following:
  • Gets a list of managers, through Grail::show_managers()
  • Sends down a different projectId value 30 times, in 30 separate calls to Grail::set_value()
  • Shows all the parameters for the MotorRack using Grail::show_params()
  • Shows all the samplers for the MotorRack using Grail::show_samplers()
This amounts to 33 RPC calls to Grail in all. The calls were made from a client running on colossus to a Grail instance running on goauld. In the Python clients, RPC get and set calls were timed separately (more on this below). The times are as follows:
  • C++ client, on the single-threaded, HTTP 1.0 SOAP Grail: 151 mS total, or 4.58 mS per RPC call.
  • C++ client, on the multi-threaded, HTTP 1.1 SOAP Grail: 55 mS total, or 1.67 mS per RPC call.
  • Python (SOAPpy, Ygor GrailClient) client, HTTP 1.0 SOAP Grail: 23 mS per RPC call
  • Python (SOAPpy, Sparrow GrailClient) client, HTTP 1.0 SOAP Grail: 23 mS per RPC get call, 63 mS per RPC set call!
These RPC calls were fairly simple, but most calls to Grail are. The big difference between the two C++ times is that for an HTTP 1.0 call, a new connection is made for every RPC call; for HTTP 1.1, the same connection is reused for every RPC call. (This is why the single-threaded version of Grail's SOAP server must be HTTP 1.0; otherwise, one client would lock out all others.) The cause of the difference observed between the times obtained with the Python clients vs. the C++ client is quite different. Python is inherently slower, which accounts for the base difference between the C++ and Python client calls. But note the difference in the set calls between Ygor GrailClient.py and Sparrow GrailClient.py. Set RPC calls are the ones that modify some parameter or state in a manager. The Sparrow GrailClient runs these calls through the Sparrow security module, which checks the gateway file for permissions on every RPC set call. A different paradigm may have to be used here to improve response times.

11/22/2005

  • Found and fixed a race condition during initialization that was exposed when running Grail on Red Hat Enterprise Linux 4 machines. The problem occurs because all RPC servers must be created in the same thread that runs the polling loop, and because DeviceClientMap, an RPCserver, is a singleton, which means the first function to call DeviceClientMap::instance() will create it. When initializing, Grail first creates the rpc_task thread to make sure DeviceClientMap gets built in the right thread. On the RHEL4 machines, this thread did not run far enough before DeviceClientMap::instance() was called somewhere else in Grail's initialization procedure, causing DeviceClientMap to fail to respond to connections. Since the function of this service is to enable the creation of DeviceClients in the RPC service thread, this meant that no DeviceClient objects could be created when Grail runs on RHEL4 machines. The solution was to place a condition variable that allowed the Grail initialization to hold until rpc_task finishes creating the DeviceClientMap object, thus ensuring that it gets built in the right thread.

11/18/2005

  • Removed 2 sleep() calls in ProcessParameterList(), in soap_handlers.cc, which together amounted to 3 seconds per configured manager.
  • Added finer grained control to ProcessParemeterList() allowing the Grail client to specify whether, after setting all values, through the 'prepare' flag:
    1. A value of 0 sends no values to managers and does not 'prepare' the manager
    2. A value of 1 sends all values to the managers and does 'prepare' the manager
    3. A value of 2 sends all values to the manager but does not 'prepare' the manager

These changes help speed the telescope configuration by removing the delays and by allowing the config_tool to use the set_values_array() interface as needed, rather than having to make individual set_value calls.

08/16/2005

  • Found the problem that was allowing sampler subscribers to be dropped. The symptom was that Dynamic Corrections did not update. AntennaCharacterization was using samplers to obtain the values needed to compute corrections, and these values stopped updating. I changed the code in SamplerCache.cc, which used reference counts to keep track of whether a sampler stream should be stopped or started. Reference counts were lost if the new sampler stream did not respond with a value, causing an exception prior to the reference count increment. The new code drops the reference count idea, instead deciding whether to start a sampler stream by checking directly with the sampler stream to see if it is already started. This required modifications to the Monitor library. The code also now checks to see if there are any subscribers to the sampler before stopping the sampler stream by checking the EventManager. This required changes to the EventManager library. This is much more positive and does not result in a sampler stream being shut off while there are subscribers. For polls to sampler values, the call will re-start the stream if it has been shut down before, since it does not rely on reference counts. Finally, the problem where the sampler does not return a value when data is started has been solved in the newest sampler/monitor libraries.

  • Tim has been upgraded to provide some additional utility:
    • Two new commands, sampler and parameter can now return internal information on the handling by Grail of the specified parameter. Type help sampler or help parameter on the tim command line for more information.
    • Commands manager, samplers, and parameters now accept a switch, -r, which specifies that only registered managers, samplers and parameters, respectively, should be displayed.

02/24/2005

An issue arose concerning the total number of threads that Grail can spawn. When Ron DuPlain made an attempt to register every manager in the GBT system, Grail failed to allocate more than 253 threads. Since Grail starts at least 3 threads per Manager client, and there are currently 142 managers in the current GBT Telescope system (release 5.1), Grail was attempting to create some 426 threads. The problem turns out to be that the default stack size per thread is 8MB. Since the process space is 3GB, this works out to about 380 threads before process space is exhausted, if the process space was only given over to thread stacks. In reality, other parts of the process get process space, so this number is closer to 250. Note that this memory is not actually used! The process space is simply the amount of memory that the process could conceivably address given the memory addressing hardware of the machine's architecture. Only portions of memory actually used are mapped to physical memory by the hardware. Thus, one can conceivably exhaust the entire process space by reserving portions of it for potential use (as happened here) but only be actually using a few megabytes of physical memory.

I tested to see if this was the problem by using ulimits to temporarily set the default thread stack size to 128K and ran Grail on leeloo. This time Grail was able to create all the needed threads. I am implementing a more permanent fix by explicitly setting the stack size for threads created in Grail by using the pthread_attr_setstacksize() call before every call to pthread_create(). I am also modifying the Thread template class in ygor/libraries/Threads to allow this size to be specified (if not, the default is the ulimits set default.) I am leaving alone threads spawned by other libraries, such as RPC++ (in ShVxClient), to minimize impact on any other applications. Though one of these is created for every manager client Grail creates, the savings elsewhere should allow Grail to meet its needs.

02/22/2005

Toney Minter found a bug in the way Grail handles dynamic String parameters. Ygor has two kinds of String parameters. Static and Dynamic. The static kind are the most common. These parameters use a programmer defined hard limit on their length. The dynamic kind are much more rare. Their size is set by the user when the user sets the parameter. Grail was mishandling these so that they could not be set at all (for example the polycoDatFile parameter in the SpectralProcessor.) Grail was only allowing access to dynamic parameters by requiring an index to access their individual elements, but the individual string elements are not of interest, the entire string is (thus no index is needed to get/set the parameter.)

Fixed this by testing the parameter to see if it is a BasicType::String, and if so, handling it differently. Dynamic non-String parameters are still handled the same as before.

01/06/2005

  • Grail now supports callbacks by field. This supports both parameters and samplers callbacks. This feature is invoked by making the standard reg_sampler() or reg_parameter() SOAP interface call, except that these now support an extra parameter, a ';' delimited list of fields for the sampler or parameter. On callback, a ';' delimited list of values will be returned, in the same order as was specified in the registration call.

12/22/2004

  • Refactored the soap_handlers.cc file to remove the GrailCallback and GrailCallbackMap classes into their own files, GrailCallback.[h,cc]. This was done to achieve a more logical code layout: these two classes had little bearing on SOAP interface handlers.
  • Fixed the SOAP cleanup in the GrailCallback class.
  • Separated the RPC Grail Status Server from the DeviceClientMap class. The latter retains the RPC call used to construct DeviceClient objects. This was done because adding status functionality to the Grail Status Server required knowlege of components in the Grail namespace, and this was inapropriate in the DeviceClient libray, which has no dependencies on anything in the Grail namespace. This does not affect the SOAP or M&C interfaces, or tim, the status client.
  • Added the RPC procedure CB_CLIENTS to the Status Server. This allows tim to display the callback URLs of all clients that have registered a callback. Useful for investigating problems.
  • Fixed a compiler dependency in DeviceClient::send_values(). This function was making a call to PanelRemote::newParam() with two Parameter member functions as arguments, Parameter::data() and Parameter::data_length(). The order of execution of the two arguments was significant, and is implementation dependent when placed in an argument list. On the Solaris version, the calls executed in the expected order, but on the Linux version, the order was reversed. This meant that the newParam() call was being made (on the Linux version) with new data but with the old data size. Fixed this by removing the significance of the order of execution.

12/16/2004

Grail Quality of Service Improvements

Grail suffered from not being able to handle large loads, and also did not insulated itself and its clients from a problem client. This means that heavy use of callbacks, or a misbehaving client, could freeze Grail and any other Grail clients. (see also Grail Test Plan.)

The following has been done to fix this problem:

  • Grail uses the new RPC++ Quality Of Service poll routine (ServerBase::poll_with_qos()). This means that no matter how busy Grail gets, the polling loop will still operate correctly, ensuring reconnection to re-started managers.
  • Grail now uses two separate threads for each DeviceClient, one each to handle Parameter and Sampler callbacks. The RPC++ thread does no more than post the index of the Sampler/Parameter to queues (one each for Samplers and Parameters) and then moves on. The callback threads read the head of the queues, then process any Parameters/Samplers indicated. Thus, no matter how long it takes the callback threads to process a callback, the RPC++ thread will not be impacted and Grail will remain responsive to the M&C side.
  • Grail now uses one callback thread per Grail client. This thread takes a copy of the Parameter data provided by the DeviceClient callback threads mentioned in the previous item and then makes the SOAP call to the Client's callback server. This insulates all other clients from a misbehaving client; if the client does not process its callback, only its own thread will be blocked. All other client threads will go on processing.

One issue remains: SOAP Cleanup on the server side if a client dies is not yet working properly. This should be completed soon.

11/04/2004

Grail memory leaks fixed

The following Grail SOAP interface functions produced memory leaks:
  • get_parameter()
  • get_sampler()
  • get_values_array()
  • show_managers()
  • show_params()
  • show_samplers()
In addition, callbacks from samplers and parameters also produced memory leaks.

These have all been fixed, and in the process I've streamlined the SOAP data structures I use, along with some of the code in these functions. None of these changes require any changes to any Grail clients. All have been tested by sending up to 100,000 requests to each of these functions and observing Grail memory use with top. All changes apply to both the Linux and Solaris Grail.

The problem was that I misunderstood how gSOAP manages memory in the SOAP serializer. After carefully reading of the gSOAP manual (A comprehensive though not very direct document, and short on meaningful examples) I was able to figure out how to manage memory in dynamic gSOAP arrays. More details about gSOAP memory management can be found here.

10/28/2004

Grail ported to Linux!

Grail has been ported to Linux. In the Solaris Grail the SOAP interface thread would create the DeviceClients (including the recipient RPC server file handles) while handling a client request, and a different thread, the M&C RPC++ recipient server thread, would service the DeviceClient recipient servers. In Solaris, file handles are global and can be used in any thread like this. Under Linux, file handles to be used by a select() call must have been created in the same thread that runs the select() call. Somehow the SOAP client thread had to be able to notify the RPC select() thread (waiting on socket file handles) to create the DeviceClient for it. The solution was for Grail to make an RPC call to itself, one thread to the other, to have the RPC thread jump out of the select() call and create DeviceClient on behalf of the SOAP service thread. This was done by adding a CREATE_CLIENT RPC method to the Grail Status Service RPC service. The GrailStatusService class in turn has been merged into the DeviceClientMap class to allow this to be easily done.

Other Grail enhancements

  • Grail now supports the auto-registration of samplers if no callbacks are needed. Thus, a call to Grail::get_sampler_value('path') or Grail::get_sampler('sampler_name') will start the monitor running without the need to call Grail::reg_sampler(). This last is only required when specifying callbacks.
  • The string length bug has been fixed. Grail was treating string parameters incorrectly, with the result that as the string parameter was used, it would shorten to the shortest string used and could then not accomodate a longer string.
  • Grail has been modified to support the M&C Secure Panel Server. All Grail interface functions that interact with the M&C system now have a 'user' and 'host' parameter that will be used for this authentication. These default to NULL, in which case Grail behaves just as before. However, these parameters must be set by the client if a secure panel server manager is to accept a request.
  • Grail now supports a get_values_array() call. Melinda Mello added this feature to cut down on the network traffic to Grail. Clients that must make repeated periodic calls to Grail for sampler and parameter values may now use get_values_array() to batch these requests into one single call. The result is far fewer sockets left in a TIME_WAIT state.

08/03/2004

Grail now has a multithreaded SOAP server. Grail no longer blocks out service requests until the previous service request is handled. Instead, it starts a new thread to handle the request and immediately resumes listening for more requests. This allows Grail to support by default HTTP 1.1 connections, which keep the socket open by default until the client closes it. It also allows Grail to be more responsive to requests. Now a client will block only if the client desires access to a device that anoter client is currently accessing. In this case, as soon as the previous client finishes with the device, the next client can access it, even if the previous client keeps its connection open.

I have also modified the SOAP interface for Grail to fully support WSDL and also to support anonymous (as before) and named parameters (new) simultaneously. I am trying to bring the Grail SOAP interface up to the latest standards while getting it ready to migrate to gSOAP 2.6. WSDL has been successfuly tested with SOAPpy and Grail built with gSOAP 2.3 and 2.6.

Part of the motivation to use HTTP 1.1 Keep Alive connections was to improve throughput and turn-around times for clients who wish to periodically and repeatedly make requests to Grail, as noted in an earlier entry. I found that SOAPpy does not support this, but does allow different transport classes to be specified in the SOAPProxy class. So I wrote an HTTP 1.1 transport class (based largely on the older one) to do some timing tests. The results were surprising. First, here is the test code

from grailclient import *
from time import time

cl = GrailClient("titan", 18000, cb_port=19591)

def test():
   begin = time();
   cl.get_value('Accelerometer', 'state')
   end = time();
   print "Call took", end - begin, "seconds"

On executing test() with the old HTTP 1.0 transport class, this call typically took about 10 mS. Using the new HTTP11Transport class, the client behaved as expected: the connection to Grail remained open between calls. However, the timing went up by an order of magnitude, taking approximately 100 mS per call!

This counter-intuitive result led me to try another Python SOAP client, ZSI, with the same results. I also tried to test this using Perl and SOAP::Lite, but gave up (for now) because of my lack of Perl knowlege. I measured the time Grail was taking to process the requests, hoping to find something. Grail was spending most of its time waiting in the soap_recv() call. This means that the client still may be at fault, if it does not finish the transmission in good time. Building Grail with gSOAP 2.6 did not improve things. The bottom line is that with respect to execution times Grail will behave as before when called from an unmodified SOAPpy library.

07/22/2004

Optional verbosity re-introduced

Amy Shelton requested that I add back into Grail some of the verbosity that I eliminated for release 4.4. The reason she wanted this is for feedback during testing of turtle on the Antenna simulator. For this, she runs Grail interactively, and this feedback is useful in ensuring that Grail requests are going to this Grail and not to the real system on vortex.

To accomodate this request, I added the command line switches

    -v, --verbose

to Grail. This allows Grail to remain quiet when being run by TaskMaster, but print out useful information when run interactively.

Inconsistent state bug fixed

Bug found and fixed and patched in Grail 4.4 on 7/20. The bug manifested itself whenever an M&C device (say, DCR) went down and came back on-line again. The DeviceClient on Grail for that device would be left by this in a state where it believed it was subscribed to some parameters on the device, but was not. Thus values for 'state' would eventually become inconsistent with the actual value on the device and could conflict with values reported by CLEO, for example. Control of the device was unaffected, but, as far as previously registered parameters were concenrned, Grail was blind (new parameter registrations worked OK).

The problem was caused by the recovery method not having been refactored to the new model of Grail parameter handling. This was fixed by having the recovery method re-register all subscribed parameters with the new recipient client on the resurrected device.

The TIME_WAIT problem

The current Grail has a single threaded SOAP server, and requires clients to connect, transact, and disconnect for every transaction made with Grail (This is the default model for HTTP 1.0).

This causes a problem that was noted recently when Paul Marganian started working on a lightweight M&C status screen that uses Grail. Paul's code makes many requests of Grail every second. Because of the nature of TCP connections, each Grail connection leaves a file handle on the host machine unused and unusable for a period set in the networking stack of that machine (on Solaris, this is 240 seconds by default). This phenomenon can be seen by running the following command on a command line on the Grail host, after making a series of Grail requests:

[rcreager@titan rcreager]$netstat -a | grep 18000
      *.18000              *.*                0      0 33232      0 LISTEN
titan.18000          lycaste.gb.nrao.edu.4619 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4700 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4717 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4732 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4745 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4758 16126      0 33580      0 TIME_WAIT

Here, 6 requests were made in quick succession on a Grail running on the Solaris host titan, port 18000. One can easily see that by continuously making many requests a second, that titan might eventually run out of file handles and no new requests to any services on titan will work until some of these sockets have finally timed out (this depends on how heavily titan was loaded to begin with, and the rate at which requests are made of Grail or any other titan services). This is not a bug, and using tricks to avoid this is Not A Good Thing! (See the "Programming Unix Sockets in C" FAQ for a detailed explanation of why the TIME_WAIT state is necessary for any newly closed TCP socket.)

There are two possible solutions to this problem:

  1. Change the TIME_WAIT time-out period on the host machine to something much lower than 240 seconds
  2. Keep the socket open when the same client is making multiple transactions with Grail

The first one is easy to do, if you are the sysadmin for that machine. The downside is that this is a system variable, and thus affects all TCP communications on that machine. Lowering it so something like 10-15 seconds should not break anything though, and indeed Joe Bandt did do this to virgo when he ran into the same problem with the Antenna Characterization SOAP interface he uses on the Antenna manager.

The second solution is somewhat more involved, but has several things going for it. The solution involves making Grail's SOAP interface support the HTTP Keep-Alive option. This also requires making the SOAP interface multithreaded: Keep-Alive would lock out other clients if the interface were to remain single-threaded. Internally, Grail is already multi-threaded, so no other changes would be required.

There are a few good reasons to pursue this second approach:

  1. Using a multithreaded SOAP server would make Grail scale better; a client wishing to make a Grail transaction would not have to wait for another client to finish first, provided both clients weren't after the same device.
  2. Using Keep-Alive would improve the transaction turnaround time, as now the transaction would not involve opening a socket and then closing it (on both ends). Keep-Alive in conjuction with asynchronous messaging would greatly inprove data throughput for sampler callbacks as well. (The use of Keep-Alive is recommended by the gSOAP Manual to improve performance.)
  3. The number of sockets left in the TIME_WAIT state on the server now will be dependent on the recent (within 4 minutes) number of Grail clients, not the number of transactions. This will be a far lower and less variable number.

Of course, the client end must also support Keep-Alive, otherwise the system will behave as before. I have yet to figure out how to make SOAPpy do this, but SOAPpy is based on Python's httplib and httplib can do this. I have emailed one of the SOAPpy maintainers, asking if there is a ready way to do this; laking an answer, I will have to go through the SOAPpy source. Fortunately, it is not very big.

06/30/2004

Grail verbosity reduced

The old Grail was fairly verbose, making it vulnerable to being aborted on a SIGXFSZ signal (file size exceeded) as its log file approached the limit set in vortexProc.conf.

All non-error output has been wrapped in #if defined(DEBUG)/#endif conditional directives, which means that if compiled with no DEBUG defined Grail is a whole lot quieter.

New 'loConfig' bug fixed.

Why did this problem reappear? This is actually a new bug, with the same symptoms as the old 'loConfig' bug. Grail used the virtual function Panel::reportComplete() to indicate that all registered parameter values have been loaded from the manager. Since the old Grail registered all parameters up front, this was OK. Only one reportComplete() is received on registering all parameters, and therefore there is a guarantee that all parameters registered have valid data. (In the old 'loConfig' bug, the test to see if reportComplete() had been received fell through prematurely; see earlier notes on 'loConfig'.)

The new Grail registers parameters on demand. On creating a manager client, Grail registers 'state' and 'status' and requests values for those. When configuring, this is rapidly followed by a series of new parameter request. This results in multiple reportComplete() being received. A race condition could set in where the new parameter requests could see the original 'sate' and 'status' reportComplete() (or reportComplete() for a previous parameter request) and think that values had been received for the new parameters. The fix was to give each parameter its own condition variable and wait for it to be broadcast in Panel::reportParameter(), when the actual data is received. This is much more positive: either there is real data or there is an real error.

fixGoHang helps Grail too!

The new Grail (4.4) is vulnerable to an iteresting Manager/Panel interaction issue: If a parameter has no value (such as a dynamic array with 0 elements), no reportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue fixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not allow parameters with no values; another is for the Manager to send reportParameter()/reportComplete() even if no value exists, with some way to tell the Panel user that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).

In the current release, this problem manifests itself when either (or both) the Antenna or LO1 Managers is restarted. Grail may report problems with either of these two devices, or the ScanCoordinator, which also may deal with these two devices. The config tool will throw an exception or generate an error message with the following text in it:

Device failed to respond

A look at the Grail log (/home/gbt/etc/log/vortex/Grail.< pid >.< date&time >) will show an entry like this:

53187 12:44:53
Caught DeviceClientException:   Device: ScanCoordinator.ScanCoordinator
                                Problem: Device failed to respond
                                Location: ParameterCache::subscribe(int)

The short-term fix is to run fixGoHang.

04/01/2004

I have started work refactoring Grail. The biggest functional changes from the 4.2 version is that parameters will no longer be automatically registered for callbacks and cached. This will occur on-demand as parameters are needed. This will considerably reduce network traffic between Grail and the M&C system.

Other refactoring:

  • Responsibility for parameter caching, handling and callbacks has been removed from DeviceClient and placed with the GenericRecipient class, which now becomes the ParameterCache class.
  • An analogous change has been made with respect to Samplers. Though the resulting SamplerCache class is functionally very similar to the ParameterCache class, it could not be made the same class. The classes use Monitor and Recipient to communicate with the M&C system; these two classes are similar but there are enough differences to ensure that SamplerCache and ParameterCache should be two different classes. More sophisticated refactoring (multiple inheritance?) could reduce this code duplication
  • Both ParameterCache and SamplerCache now handle callbacks through a publish/subscribe mechanism, handled by a template class EventDispatcher, based on a similar class used extensively in the Metrology software. This template class will reside in the ygor/libraries/Util directory as EventDispatcher.h.
  • DeviceClientMap has been removed to its own compilation unit, DeviceClientMap.cc

These changes have resulted in a considerable slimming down of DeviceClient accompanied by only a very modest increase in complexity in the ParameterCache (formerly GenericRecipient) class, and the creation of a similar companion SamplerCache class.

03/25/2004

'loConfig' bug

Caused by fresh data buffer containing bogus data.

Symptoms

Sometimes, Grail returns an error message when an enum parameter on an attempt to read or set an enum parameter. Because this has first and most often been observed with the LO1 parameter loConfig, I call this the 'loConfig bug'.

In the Python based GrailClient, the error looks like this:
>>> LO1.get_value("loConfig")
<Fault SOAP-ENV:Client: Device error: LO1.LO1: No value returned for parameter l
oConfig>
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "grailclient.py", line 421, in get_value
    return self.cl.get_value(self.dev, path)
  File "grailclient.py", line 249, in get_value
    return self.cl.get_value(device, path)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 362, in __call__

    return self.__r_call(*args, **kw)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 384, in __r_call

    self.__hd, self.__ma)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 306, in __call
    raise p
SOAPpy.Types.faultType: <Fault SOAP-ENV:Client: Device error: LO1.LO1: No value
returned for parameter loConfig>

A look throgh Grails logs reveals this:

1 - 17:1:32.8962: accepted connection from IP = 192.33.116.175 socket = 5
LO1.LO1 initialized correctly
DataNamedValues::value2Name(-1431655766, 1627e0, 200): EnumerationParser::findNa
me(1627e0, 200, -1431655766) failed, error code -3
Parameter::get_value(): data = aa aa aa aa
Parameter::get_value() failure: 0 = _ddp->getFieldValueStr(loConfig, 18ee18, 4, 1627e0, 0)
Caught DeviceClientException:   Device: LO1.LO1
                                Problem: No value returned for parameter loConfig
                                Location: DeviceClient::get_value()
 request served in 0:0:0.252247

If the attempt to read/set the value had succeeded, it would look like this:

2 - 17:2:21.6822: accepted connection from IP = 192.33.116.175 socket = 5
 request served in 0:0:0.005856

Finally, under the wrong conditions, this can cause Grail to hang, because it exposes a synchronization error caused by the read exception unwinding leaving a mutex set. (This has since been fixed in the latest Grail, but the loConfig bug itself remains.)

Cause of the loConfig bug:

The bug will occur when a read or set operation is requested of a DeviceClient which has not yet been constructed. The sequence of events is as follows:

  1. Request is made to set or get a parameter (say loConfig) from a DeviceClient object (say LO1) which does not yet exist.
  2. Grail creates the LO1 client
  3. Grail instructs the LO1 client to getValues(), to populate parameter cache
  4. Grail waits for LO1 to notify reportComplete
  5. Grail gets/sets the value

The problem occurs between items 4 and 5. The test for notification for reportComplete was prematurely falling through. Thus, Grail was attempting to use a value before it was actually received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.

Solution

This problem was fixed by switching to the new TCondition<> condition variable template class in Ygor/libraries/Threads. This condition variable works, so Grail always properly waits for the reportComplete.

-- RamonCreager - 19 Mar 2004
Topic revision: r20 - 2006-12-13, RamonCreager
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback