FAQ

What does a typical ESM deployment look like?

An ESM deployment is comprised of one or more instances of ESM Agents, a single instance of the ESM Server application, and any number of client machines which access the ESM application interface through the thin client browser-based front end.

An ESM Agent runs on each node which is running analytic workload (for example, SAS Application server) and is responsible for collecting process and server events and metrics, and submitting them to the ESM Server Application. The ESM Server Application collects metric data from each instance of an ESM Agent and stores it in the ESM Database. It also serves the web-based front end for the ESM application interface and performs a minimal level of scheduled database maintenance.

ESM Overview

How does ESM get its metrics from SAS?

In order to maximise compatibility and stability and eliminate any possibility of interfering with the user processes being monitored, ESM's integration with traditional SAS Foundation sessions relies on SAS DATA Step and a common filesystem location for inter-process communication.

Each ESM agent instance is configured with its own events directory, which it continually monitors for event files generated by processes looking to communicate with it. As soon as an event triggerfile appears in this location, it is read by the ESM agent, and if the data it contains is valid, the file is removed. The agent then interprets the information or instruction specified in this event file and acts upon it, by either e.g. monitoring a new process, or forwarding the data within the event file on to the ESM Server.

How are events are communicated to the ESM Server?

Deploying the ESM Agent onto a shared filesystem allows for an instance of an Agent to be started on each node in the cluster without requiring multiple installations. To avoid potential conflicts that may arise from non-unique PIDs, each node has its own dedicated events directory. In such a multi-node or GRID installation, where the filesystem is shared across multiple nodes, the layout of the event directories would look something like this:

opt
   ESM
     esm-agent
       events
         node1_hostname
         node2_hostname
           esm_eventfile_1
           esm_eventfile_2
           esm_eventfile_3
         node3_hostname

Multiple types of events are supported by ESM.

new process events

A new event tells ESM to begin monitoring a process, communicating relevant attributes of that process. The properties are typically communicated:

  • pid - the ID of the process to be monitored
  • hostname - the hostname of the machine (must match configured agent hostname / ESMNODENAME environment variable)
  • owner - the name of the user that the process can be attributed to
  • sasUuid - a unique identifier for this 'session'. Session-provided UUIDs are useful when this value needs to be propagated to sub-sessions as an environment variable for the purposes of reconciliation with parent jobs/sessions. If a value is not provided here it will be automatically generated by the agent
  • queue - the name of the queue that this session / job belongs to
  • jobName - the identifier of the session. For jobs this is typically the job name.
  • workFolder - the temporary directory attributed to the session as transient WORK storage (SAS specific). Can be an array of directory locations*.
  • utilFolder - the temporary directory attributed to the session as transient UTIL storage (SAS specific). Can be an array of directory locations*.
  • logFile - the logfile to be attributed to the session and parsed in real time for events.
  • logs - a list of logFiles, where a job generates more than one logfile or requires more than one log to be followed
  • esmType - the 'session type' is an ESM attribute. Typically it is one of WS, PWS, STP, Batch, GRID, LASR, JVM or SYS, but categories can be added dynamically. SYS sessions are not shown by default, and cannot be acted on (terminated) by users.

tag events

A tag event is a basic event which is attributed to a process at a given time, containing contextually relevant information. It is intended to be used by programmers to help identify progress between code blocks or functions, but can be extended for any purpose where overlaying contextual data flags to the timeseries is beneficial.

A basic tag event has the following properties:

  • text - the title of the tag event, searchable from the Tag Search
  • tooltip - detailed information about the event, shown when the user hovers over the flag to display the tooltip
  • color - the colour of the tag flag, in HTML colour notation

highlightStart and highlightEnd events

Highlight are called highlightStart and highlightEnd for legacy reasons and are better described as jobStart and jobEnd events. These are a special type of tag event and are used for communicating information specific to jobs, such as job return codes and job flow information.

A highlightStart event requires the following:

  • pid - the process ID of the job in question
  • hostname - the hostname of the machine the job is executing on (must match configured agent hostname / ESMNODENAME environment variable)
  • uuid - the code-generated unique ID for the job in question. The purpose of this ID is to reconcile the data communicated in the highlightEnd tag with the PID of the job

A highlightEnd event requires the following:

  • hostname - the hostname of the machine the job is executing on (must match configured agent hostname / ESMNODENAME environment variable)
  • uuid - the code-generated unique ID for the job in question. The purpose of this ID is to reconcile the data communicated in the highlightEnd tag with the PID of the job
  • text - the identifier for the job, typically the job name matching the jobName identifier in the new event
  • returnCode - the exit code, or completion status, with which the job terminated (i.e. 0 = success, 1 = warning, 2+ = error). Return codes of 3 and 6 (ABORT exits and internal errors) are treated as errors
  • flow - a colon-separated string of identifiers containing the job's position within the LSF flow hierarchy. This expects the verbatim value of the LSB_JOBNAME environment variable, from which superflous variables such as user name or LSF job ID are stripped

A note on UUIDs and highlight tags

The highlight tag mechanism may appear convoluted, but it serves to facilitate the reconciliation of job PIDs and return (exit) codes. When a SAS 'Job' is launched, a SAS process is spawned by the parent instance of the executing script (i.e. sasbatch.sh), and when that job finishes, the return code of the SAS job subprocess is collected by that script. In order to ensure a unique relationship between the session being monitored and the exit code returned upon job termination, the uuid must therefore be generated and exported by the parent context of the sasbatch.sh process so that the highlightStart tag (generated by the job process at startup, once the subprocess ID is known), can be linked to the exit code reported back to the parent process.