Archive Database
Use the Archive Database command to delete a previously archived database or to get instructions for archiving.
The SYSTEM INFORMATION category provides information about the system on which the Single-system Manager is running.
Use the commands in this category to display the following types of system information:
Use this command to display the hardware configuration of the system, which existed at a specific time on a specific date.
Hardware configuration information is available for the following systems:
If you are interested in hardware information for a specific date/time, enter the desired date/time in the appropriate field. Otherwise, the information displayed is as of 23:59:59 for the current date or the latest available.
Note: The current date means the date that your system displays.
You must select a database that corresponds to the date that you specified.
For a Group of Systems, choose the system whose hardware information you want.
If you have changed the query date or time, this report displays a table of the hardware components that form the selected configuration. The information is displayed in a hierarchical manner. If information is not available or not applicable, "N/A" is displayed.
The first column of the report table can include one or more of the following symbols: a double arrow
(
), a single arrow that points to the right
(
), or a single arrow that points down (
).
Clicking on a double arrow fully expands a particular entry and all of the components and subcomponents below it. For the Origin2000 and Onyx2
series double error expands only one module and its components. Clicking on a right arrow expands the
table to display the subcomponents that compose the selected component. Clicking on a down arrow collapses the subcomponent display.
The other columns of the table contain the following information:
| Column | Description |
| NAME | The name of the component. For the top most component(s), numbers are used to indicate module number(s) (Origin2000/Onyx2) |
| LOCATION | The location of the component |
| PART_NUMBER | The part number of the component |
| SERIAL_NUMBER | The serial number of the component |
| REVISION | The revision level of the component |
Use this command to display the software configuration of the system and version information that existed at a specific time on a specific date.
If you are interested in software information for a specific date/time, enter the desired date/time in the appropriate field. Otherwise, the displayed information defaults to 23:59:59 for the current date or the latest available.
Note: The current date means the date that your system displays.
You must select the database that corresponds to the date that you specified.
For a Group of Systems, choose the system whose hardware information you want.
This report lists the software that was installed on the system at the time you specify.
The installed software is listed 10 items per page.
lists the next 10 pages,
goes to the last page.
lists the previous 10 pages, and
returns to the first page.
The report table provides the following information:
| Column | Description |
| NAME | The name of the software application |
| VERSION | The version number of the software application |
| INSTALL_DATE | The date on which the software application was installed |
| DESCRIPTION | A description of the software |
Use this command to view any system changes that occurred within the range of dates that you specify.
If you want to see information for a specific period of time, change the entries in the "From" and "To" fields. By default, the dates are set to the current day.
Note: The current date means the date that your system displays.
You must select the database that corresponds to the dates that you specified.
System change information can be collected from only one database at a time.
For a Group of Systems, you must also choose the system whose hardware information you want.
The SGI Embedded Support Partner tracks the following types of system changes:
The software table describes all software changes that occurred during the period of time that you specified. The table provides the following information:
| Column | Description |
| Name | The name of the software application |
| Version | The version number of the software application |
| Install Date | The date on which the software application was installed |
| Deinstall Date | The date on which the software application was deinstalled |
| Description | A description of the software |
The hardware table describes all hardware changes that occurred during the period of time that you specified. The table provides the following information:
| Column | Description |
| Name | The name of the part |
| Location | The location of the part |
| Part Number | The part number for the part |
| Serial NUmber | The serial number of the part |
| Revision | The revision level of the part |
| Install Time | The date on which the component was installed |
| Deinstall Time | The date on which the component was deinstalled. |
The system changes table describes all system changes (for example, hostname, IP address change, and so on) that occurred during the period of time that you specified. The table provides the following information:
| Column | Description |
| System Changes | Current/Previous system |
| System ID | Numeric System ID |
| System Type | System IP type |
| System Serial Number | The serial number of a system |
| Hostname | The hostname of a system |
| IP Address | IP address of a system |
Use this command to view the transaction history of a part.
You must enter the component serial number. (If necessary, use the SYSTEM Information > Hardware to locate a serial number.)
You must choose a database to view the history of the component whose serial number you entered above.
For a Group of Systems, you also must choose the system whose part transaction history you want to view.
The report table lists the name of the component, the module number in which the component was installed, the part number of the component, the serial number of the module, the revision number of the part, and the slot number in which the component was installed.
Use this command to view information about events that SGI Embedded Support Partner has registered.
Enter a range of dates for the events that you want to view. For Group of Systems you must choose the system whose events registered information that you want to see. Then, choose the type of event information that you want to view. The following options are available:
| Option | Description |
| All System Events | A view of all events that occurred between a range of dates that you specify |
| Specific System Event | A view of a specific event |
| System Events by Class | A view of a selected class of events |
All System Events
The report table provides the following information about events that were registered within the selected range of dates:
| Column | Description |
| Event Number | The chronological order of the event within the event list |
| Event Class | The class in which the event belongs (for example, Availability) |
| Event Description | A brief description of the event |
| Event ID | The unique identification number assigned to this event. You can use this number to find this event via SYSLOG |
| First Occurrence | The date and time that the event first occurred |
| Last Occurrence | The date and time that the event last occurred. If Number of Occurrences is 1, the time value of the First Occurrence and the time value of the Last Occurrence will be identical |
| Number of Occurrences | The number of times that the event occurred. This number corresponds to the number of events that must occur before registration begins. By default, this number is 1. |
Specific System Event
Use this report to track a specific event that is associated with an actual or suspected system problem. Choose an event class from the list that appears.
Use this page to specify the event that you want to view. Choose the event from the list of events in the class that you have already specified. The report table provides the following information about the event registrations between the selected range of dates:
| Column | Description |
| First Occurrence | The date and time that the event first occurred |
| Last Occurrence | The date and time that the event last occurred. If Number of Occurrences is 1, the time value of the First Occurrence and the time value of the Last Occurrence will be identical. |
| Number of Events | The number of times that the event occurred. This number corresponds to the number of events that must occur before registration begins. By default, this number is 1. |
System Events by Class
Use this report when you need information about events that are associated with a specific class. For example, use Memory class to track various memory events. Choose the appropriate class for the event that you want to view.
The report table provides the following information about events that were registered between the selected range of dates:
| Column | Description |
| Event Number | The chronological order of the event within the event list |
| Event Description | A brief description of the event |
| Event ID | The unique identification number assigned to this event |
| First Event Occurrence | The date and time that the event first occurred |
| Last Event Occurrence | The date and time that the event last occurred. If Number of Occurrences is 1, the time value of the First Occurrence and the time value of the Last Occurrence will be identical |
| Number of Events | The number of times that the event occurred. This number corresponds to the number of events that must occur before registration begins. By default, this number is 1. |
Use this command to display information about actions that have been performed by SGI Embedded Support Partner.
Specify the range of dates for which you want to report actions taken. If you do not enter a date, this option defaults to the current date.
For a Group of Systems, you must specify the system for which you want to generate a report. Note: the report shows actions taken by Systems Group Manager for the system that you specify.
You must choose one of the two available types of reports:
| Report Type | Description |
| All Actions Taken | Displays all actions that were taken on the system and the events that triggered those actions |
| Actions Taken for a Specific Event | Displays actions taken for a specific event only |
All Actions Taken
This option displays the actions that the SGI Embedded Support Partner performed within the range of dates that you specified. The report table provides the following information about actions that were taken for all events between the selected range of dates:
| Column | Description |
| Event Class | The class in which the event belongs (for example, Availability) |
| Event Description | A brief description of the event |
| Event ID | The unique identification number assigned to this event |
| Action Description | A brief description of the action |
| Action Taken | The action that SGI Embedded Support Partner performed in response to the event |
| Time of Action | The date and time that SGI Embedded Support Partner performed the action |
Actions Taken for a Specific Event
Use this option when you want to view actions taken for specific events. Choose an event class that contains the event that you want to select.
From the list of events, choose the event that you want to research.The report table provides the following information about actions that were taken for the specified event between the selected range of dates:
| Column | Description |
| Action Description | A brief description of the action |
| Action Taken | The action that the SGI Embedded Support Partner performed in response to the event |
| Time of Action | The date and time that the SGI Embedded Support Partner performed the action |
This command displays the results of the diagnostics that you run on the system.
You must specify the range of dates for which you want to view diagnostics results. For a Group of Systems, you must also specify a system for which you want to view diagnostics results.
The top portion of the diagnostic report contains the information that pertains to the system from which you requested the report.
The diagnostics results table provides the following information for all diagnostics that were run on the system during the period of time that you specified:
| Column | Description |
| Diagnostic Name | Contains the name of diagnostic.
In cases where multiple tests run as a group under one program (for example, under SVP),
the total number of tests is indicated in parentheses next to the name of the diagnostic:
|
| Diagnostic Status | Diagnostic status can be PASS, FAIL or COMPLETE.
|
| Diagnostic Result Time | The time when the diagnostic test completed. When multiple tests run under one program, the Diagnostic Result Time indicates the time when the entire program completed. |
SYSTEM INFORMATION > Availability
This command displays system availability statistics. The upper portion of this page displays the total availability percentage and the mean time between interrupts (MTBI) in minutes.
You must specify the range of dates and type of availability information that you want to view. For a Group of Systems, you must also specify a system or a set of systems for which you want to view availability information.
| Option | Description |
| Overall Availability | Summary of overall availability information for the system |
| Availability Events List | Information about individual availability events that the system has registered |
Overall Availability
The Overall Availability option covers the aggregation of events for the given system. Events are grouped as either "Unscheduled" or "Service Action" (controlled shutdown) events. Events are further classified by categories within these two groups. For each category, the overall availability report includes the count of events in that category, the total downtime (in minutes), the MTBI (mean time between interrupts, in minutes) and the availability as a percentage. MTBI and availability per category are computed for events within the category as applied to the entire time period of the report. Count, total downtime, MTBI, and availability are also displayed for the two groups, as well as the final total of all the events.
The average, least, and most uptimes and downtimes are also included in the report in addition to logging start time and the duration of system uptime since the last boot.
On a Group of Systems, the above statistics are calculated for all systems in the group.
The Overall Availability table summarizes the overall availability of the system:
For a Single System, use the Event Availability Information link at the bottom of the page to access information about the individual availability events that the system has registered.
For a Group of Systems, use the Availability Summary For All Hosts link at the bottom of the page to access information about the summary of all availability events that the system(s) has registered.
Availability Summary For All Hosts
This report is applicable only for a Group of Systems. It displays the summary of events and availability of each individual system that has been subscribed to. Clicking on Host Overall Availability will display the Overall Availability for that particular system. Please refer to the Overall Availability for more information.
Event Availability Information
In the events list display, the fields shown are Start time (when the system was previously booted), the Incident Time, when the event occurred, the uptime and downtime in minutes, and a very brief description of the event type or cause of the event. The Event Summary displays the event information with more details, including a complete event type description.
The report provides a summary of an event that includes the following information:If a system panic occurs, this report also includes a brief summary of why the system panicked.
Use the SETUP menu to set or change the following parameters that control the operation of SGI Embedded Support Partner:
This command configures the Web server that SGI Embedded Support Partner uses. Use this command to perform the following functions:
The upper portion of this page displays the following information:
| Name | Description |
| Server Identification | The name of the Web server software in use |
| Server Version | The version level of the Web server software and its installation date |
| Server Port | The Web server connection port in use |
The lower portion of this page displays the following selectable options:
| Option | Description |
| Server Access Permissions | Enables or restricts access by external systems |
| User Name & Password Change | Enables you to change the current username and password |
Server Access Permissions Option
Use this page to specify which systems can access the SGI Embedded Support Partner Web server. Any change that you make to the server access list takes effect immediately.
You can specify the exact IP address or IP address mask using a wildcard. For example, 197.23.14.5, or 135.*.*.5, or *.*.*.*, and so on.
IMPORTANT: If Restrict access to the systems with the following IP addresses list is empty, all systems are allowed to connect to the SGI Embedded Support Partner server. To restrict access, add "*.*.*.*" to the restriction list. All IP addresses are allowed to connect to the server by default. Only the presence of the "*.*.*.*" in the restriction list enables the filtering mechanism of the server. Combinations of different IP addresses in "restricted" and "allowed" lists can create complex and flexible filtering mechanism for incoming IP addresses. Be very cautious in updating the "restricted" and "allowed" lists to avoid locking yourself out of the Embedded Support Partner facilities.
User Name and Password Change OptionUse this page to change a current username or password that enables access to SGI Embedded Support Partner. Any change that you make to a username or password takes effect immediately.
The username and password must each contain between 1 and 128 characters. Characters like "*", "&", and ":" are not allowed in the username and password strings.
The default username administrator and the default password partner must be changed immediately after installation.
An event is a happening or an occurrence that takes place on the system that SGI Embedded Support Partner is monitoring. A few examples of events follow: parity errors, disk full, nonmaskable interrupts (NMI), and even activities of the SGI Embedded Support Partner itself.
Use this page if you want to reset the following parameters for all events on the system. In the case of a Group of Systems, use this option to reset parameters for all systems.
Note: Refer to the SETUP > Events and the SETUP > Actions menus for additional information about events and actions.
Note: The Global Configuration setting will override individual event setting.
Because the number of events can be extensive, events are divided into sets called classes. This scheme simplifies the management of events, enables more efficient use of displays, and facilitates navigation within the program.
The following options are available:
View Event
This option is available only for a Single System. Use this option to determine the current setting of an individual event. This option allows you to view:
View Event List
This option is available for a Single System and for a Group of Systems. Use this option when you want to obtain a list of all events compatible with the SGI Embedded Support Partner. The report allows you to view:
View Classes
This option is available for a Single System and for a Group of Systems. Use this option when you want to view all classes available on the system. The report allows you to view:
* A Member system is a system that is subscribed to the Systems Group Manager.
Use this command to update existing events. For a Group of Systems, you must choose the system whose event you want to update.
You must select the class that contains the event(s) that you want to update.
Specify the event(s) that you want to update by one of the following methods:or
Single Event Update displays the current parameters for a chosen event:
Note: You can select more than one action. If you cannot locate an action that you need, use the SETUP > Actions > Add command to add it.
2. Multiple Event Update
This option is available for a Single System only. When you update multiple events, you must remember that any changes you specify on this page will affect all of the selected events:
To replace, add, or delete actions for multiple events, use SETUP > Events > Update Event Actions.
SETUP > Events > Update Event Actions
An event/action assignment defines the action that the SGI Embedded Support Partner performs when it registers a specific event. An event/action is a cause-and-effect relationship between an event and an ensuing action. Use this command to modify an event/action assignment; that is, to replace, add, or delete event/action assignments.
You can select the event/action relationship that you want to update by two methods:
1. Choose the action that you want to update.
2. Select the events for which you want to update the action assignment.
3. Click on Replace, Add, or Delete.
Replace Option:The Replace option deletes the current action from the event and assigns a new action to the event. Choose the action with which you want to replace the current action.
Add Option:The Add option assigns the selected action to one or more events. Select one or more actions that you want to assign to the selected event(s) in addition to the existing ones.
Delete Option:The Delete option deletes the action from the events that you selected.
Updating Event Action Assignments (Method 2)1. Choose the class that you want to search.
2. Choose the event(s) in the class.
3. Choose the action that you want to add. (This method does not provide Replace or Delete options.)
Use this command to add new events for the SGI Embedded Support Partner to monitor. This option is available for a Single System only. To add an event to a Group of Systems, refer to SETUP > Event > Subscribe. The following options are available:
| Option | Description |
| Either select a class name for the new event | Specifies the existing class to which you want to add the event If you want to add a new class, leave this option unselected and enter a new class description. |
| Or create a new class name for the new event | Specifies a new class of events Use this option if you want to add an event to a new class of events. If you want to add an event to an existing class of events, select the class from the existing classes and leave this option blank. |
| Enter a name for the new event | Specifies a description of the event that is shown in the interface |
You may set the following parameters:
SETUP > Events > Delete
Use this command to delete custom event or custom class from the SGI Embedded Support Partner. All records and information associated with these classes/events will also be deleted. This option is available for a Single System only. To delete an event from a Group of Systems, refer to SETUP > Event > Subscribe.
If you want to delete a custom class, choose the class that you want to delete. Click on Delete Class. The class will be deleted with all associated events and event data. You are not allowed to delete System classes .
If you want to delete a custom event, choose the class to which the event belongs. Click on Delete Event.
From the list of the events for the selected class, choose the event that you want to delete.Note: All event data associated with this event will be deleted.
SETUP > Events > Subscribe
This option is available for Group of Systems only. Subscription is the process by which a Systems Group Manager requests a remote system that is running SGI Embedded Support Partner to forward events that occur on the remote system. Subscription is done based on Events that are recognized on the remote host. Events can be individually subscribed with some exceptions. For example, all Availability class events are subscribed together. This is done to provide accurate availability statistics. Once an event is subscribed, the remote host forwards any occurrences of the event to Group Manager, which enables the Systems Group Manager to act as a central repository of information for different remote systems.
Unsubscription is the reverse of Subscription. It is the process by which Systems Group Manager informs a remote system that it is no longer interested in the events that were subscribed earlier.
Once you enter a hostname and choose subscribe or unsubscribe, you must select a class that you want to operate upon. For subscription, the list of classes is obtained from the remote host. For unsubscription, the list of classes is obtained from the SGI Embedded Support Partner database, which runs as Systems Group Manager. After a class is selected, all the events that are available for that class are presented. You may subscribe or unsubscribe certain classes of events only in full. For these classes, the list of events will not be presented. Instead, an entry that says 'All Events' is presented.For subscription, the list of events is obtained from the remote host. If the same class was subscribed before, events that were already subscribed before will not appear in the list.
For unsubscription, the list of events is obtained from the Systems Group Manager database. If the same class was unsubscribed before, events that were already unsubscribed will not appear in the list.
Use this command to view the current configuration of actions. The following options are available:
| Option | Description |
| View Action Setup | Displays the configuration information for a specific action |
| View Available Actions List | Displays a table of all actions that are currently available |
View Action Setup
You must choose an action whose information you want to view.
This option allows you to view the following action information:
View Available Actions List
This report displays all actions that are currently available. The table includes the following information:
Use this command to update an existing action.
Select an action that you want to update. You can modify all of the action parameters, except the action description:
| Option | Description |
| Actual action command string | Specifies the command that action executes |
| A username to execute the action as | Specifies the user account that the SGI
Embedded Support uses to execute the command. Default = nobody Note: username cannot be set to "root" nor to any any other account that has root privileges. |
| Action timeout | Specifies the time period for which the action can run without being killed The value that you specify must be a multiple of 5. Default = 600 seconds |
| The number of times that the event must be registered before an action will be taken | Specifies how many times the event must be registered before the SGI Embedded Support Partner performs this action |
| The number of retry times | Specifies the number of times that the SGI Embedded Support Partner attempts to execute the action before it stops The value cannot exceed 23; however, it is not recommended that you set it greater than 4. Default = 0 |
For example: action is to run diagnostic
Use this command to add a new action. The following options are available:
| Option | Description |
| Action description | Provides a description of the action. Example: page to John Dow |
| Action command string | Specifies the exact action command to execute. Example: /usr/bin/espnotify -p 1234567 |
| Username to execute the action as (default = nobody) | Specifies the user account that the SGI
Embedded Support uses to execute the command. Default = nobody Note: username cannot be set to "root" nor to any any other account that has root privileges. |
| Action timeout | Specifies the time period for which the action can run without being killed. The value that you specify must be a multiple of 5. Default = 600 seconds |
| The number of times an event must be registered before an action will be taken | Specifies how many times the event must be
registered before the SGI Embedded Support Partner performs this action. Default = 1 |
| The number of retry times | Specifies the number of times that the SGI Embedded Support Partner attempts to execute the action before it stops The value cannot exceed 23; however, it is not recommended that you set it greater than 4. Default = 0 |
For example: action is to run diagnostic
Examples of notification options:
For more information regarding notification options, refer to the espnotify man page.
The following list includes the accepted user format strings and any action-specific options:
For example: /usr/bin/espnotify -D system_name.sgi.com:0.0 -c %D
Use this command to delete an action. Choose an action that you want to delete.
Note: The action will be deleted from the SGI Embedded Support Partner database. If this action is assigned to some events, the list of all affected events is displayed. You have a choice to cancel or proceed with deletion. Use the Proceed with deletion button to delete the action and remove the selected action from all events to which it is assigned. Use the Stop deletion button to abort the deletion and leave the action in the SGI Embedded Support Partner.
If you need to assign a different action to an event, use SETUP > Events > Update or SETUP > Events > Update Event Actions.
Use espnotify action to deliver a text/numeric message to a pager by specifying appropriate command line options. You may obtain more information on espnotify by using the man espnotify command.
To work properly, paging has to be configured. The SGI Embedded Support Partner provides the required User Interface to set required configuration parameters. All the parameters are written to /etc/qpage.cf file.
Paging requires that a modem be connected to the system to dial the paging service provider to deliver a page. The Modem/Admin section enables modem configuration. The Service section enables configuration of the parameters of the Paging Service Provider(s). Because the service provider normally identifies each individual pager by means of a pager ID (which does not have to be the pager Touch-tone number), a pager ID must be provided in order to deliver the page. The Pager section enables you to configure different pagers that are associated with the Service.
Use this command to display the current values of the paging parameters and the following types of information:
You can configure the following Modem setup parameters:
| Parameter | Description |
| Modem name | Specifies a unique name that the SGI Embedded Support Partner uses to identify a modem. Entering an existing modem name will update the modem name. No spaces are allowed. |
| Modem device | Specifies the device to which the modem is connected (for example, /dev/ttya) |
| Modem initialization command | Specifies the command that the SGI Embedded Support Partner should use to initialize the modem before dialing the Service Provider. These initialization commands are modem specific and are available in your modem manual. For example, many paging services require that error correction be turned off on your modem. For some modems, this can be done by including &A0&K0&M0 in the modem initialization command |
You can configure the following Administration Setup parameters:
| Parameter | Description |
| Administrator's e-mail address | Specifies the e-mail address of the person to contact if Paging fails to deliver a pager |
| The time interval for retrying | Specifies the amount of time that the espnotify should wait between retries |
Use this command to set up information about a paging service.
You can configure the following parameters:
| Parameter | Description |
| Service name | Specifies the unique name that the SGI Embedded Support Partner uses to identify paging service provider. Entering an existing service name will result update the service name. No spaces are allowed. |
| Device | Specifies the device (modem name) that the SGI Embedded Support Partner should use to dial the service provider. Use SETUP > Paging > Modem/Admin to set up any modems. |
| Maximum number of retries | Specifies the maximum number of times the SGI Embedded Support Partner should attempt to access this service before it quits trying. |
| Maximum length of the message | Specifies the maximum number of characters that can be sent using this service. This depends on your service provider. |
| Phone number of the paging service | Specifies the IXO/TAP telephone number of the Service Provider.
Do not confuse your pager's Touch-tone telephone number with the service provider's IXO/TAP telephone number. They
are never the same. The telephone number should contain at least 7 numbers and should not include any spaces,
"-", or other symbols. |
Use this command to set up a specific pager.
You can configure the following parameters:
| Parameter | Description |
| Pager Name | Specifies a unique name to identify this pager |
| Pager ID | Specifies a unique number (ID) that is used by paging service provider to identify the pager. The ID can or cannot be the touch-tone phone number that you dial to access the pager. PLease contact your service provider to get this information |
| Service Name | Specifies the paging service (service name) to which
espnotify should deliver the page for this pager Use the
SETUP > Paging > Service to set up any paging services
that you want to use |
The Availability Monitoring is a set of tools that collectivly monitor and report the availability of
system(s) and diagnosis of system crashes. Availability monitoring tools gather information from diagnostic
programs like ICRASH, FRU Analyzer, SYSLOG and identify the cause of system shutdowns. The system configuration
information comes from configmon, hinv and versions. Availability monitoring tools can report data to various locations
based on the Availability MailList setting.
SETUP > Availability Monitoring > View Current Setup
Use this command to view the current values of the availability monitor parameters. It displays the following information:
Use this command to set up the availability monitor component of the SGI Embedded Support Partner.
You can configure the following parameters:
| Parameter | Possible Values | Description |
| Automatic e-mail distribution | Enable or Disable | Specifies whether availability monitor should automatically distribute reports by e-mail |
| Display of shutdown reason | Enable or Disable | Specifies whether availability monitor should display the reason for a shutdown |
| Include HINV information into e-mail | Yes or No | Specifies whether availability monitor should include HINV information in the diagnostic e-mail messages that it generates |
| Start uptime daemon | Yes or No | Specifies whether availability monitor should start the uptime daemon |
| Number of days between status updates (default = 60) | 0 - 300 | This value specifies the number of days after which a status report should be sent. (Availability monitor with the help of eventmond sends a status report periodically if the system is up for an extended period of time.) |
| Interval in seconds between uptime check (default = 300 seconds) | User specified | Specifies the number of seconds that event monitor should wait before it performs an uptime check on the system |
Use this command to set up the e-mail lists for availability information reports.
You can set up e-mail lists for the following reports:
The availability report contains computed system availability metrics.
The diagnostic report includes all of the availability report data and diagnostic data for troubleshooting.
System Monitoring is available only for a Group of Systems. It is a facility that is packaged with SGI Performance Co-Pilot software tools (pcp_eoe). It enables monitoring of selected services on a remote machine from Embedded Support Partner. In order to monitor a service, a hostname and a command must be provided. This command, when it is executed on the machine that is running Embedded Support Partner, obtains information about the selected service on the remote machine.
The System Monitoring facility of Performance Co-Pilot can be configured via the SGI Embedded Support Partner User Interface. This option is available only if SGI Embedded Support Partner is running as a Group Manager.
Embedded Support Partner User Interface provides 2 different screens for configuring System Monitoring. The Service section of System Monitoring allows you to add a new service or update an existing service, or delete an existing service. The Service section provides more details on how these operations can be performed. After a service has been added, you can add this service to a host by using the Hosts screen of System Monitoring. This action enables monitoring of that particular service for the host. The Hosts section provides more information on how services can be associated with hosts.
SETUP > System Monitoring > View Current Setup
Use this command to display the current values of System Monitoring parameters and the following types of information :
Use this command to set up services that need to be monitored by System Monitor.
You can add a Service by using the top section of the screen. You can configure the following Service setup parameters:
| Parameter | Description |
| New Service Name | Specifies a unique name that System Monitor uses to identify the service |
| Command to Execute | Specifies the command to be executed for monitoring this service on the remote
machine. Please note that the command must contain HOST keyword, which is replaced by the actual hostname during execution. For example, if you want to find out whether a machine is responding to ICMP requests, you can enter the following command:
|
You can update or delete a Service by using the bottom section of the screen.
| Parameter | Description |
| Service name | You can choose an existing service to update or delete from the list of services provided. |
| Command to Execute | Specifies the command to be executed for monitoring this service on the remote machine (see add Service above). This option is applicable only while updating existing service. No command is required to execute when a service is being deleted. |
The default setup comes with the following services:
| Service Name | Service Command | Service Description |
| icmp | /usr/etc/ping -c 3 -f -i 4 HOST | ICMP Echo Request |
| dns | nslookup - HOST | DNS Server |
| x-server | DISPLAY=HOST:0 /usr/bin/X11/xhost | X Server |
| rpcbind | /usr/etc/rpcinfo -p HOST | RPC Services |
| smtp | ( echo "expn root" ; echo quit ) | telnet HOST 25 | cat | Mail Server |
| nntp | ( echo "listgroup comp.sys.sgi"; echo quit ) | telnet HOST 119 | cat | News Server |
| autofsd | /usr/pcp/bin/autofsd-probe -h HOST | Autofs functionality |
| pmcd | /usr/pcp/bin/pmcd_wait -h HOST | Performance metrics collector deamon |
Use this command to set up hosts that need service monitoring by System Monitor.
You can add/update/delete Services for any host by choosing the appropriate options:
| Parameter | Description |
| Host | You can choose an existing host from the list of hosts provided. Please note that in order for a host to appear in this box, it must be Subscribed first. (see Subscribe in Events section). |
| Service(s) | You can choose any number of existing Services provided.
|
All performance rules can be enabled or disabled via user interface.Use this command to display performance rules status.
The report table displays the following information:
There is a set of rules available to set up for performance monitoring.
The table below provides a short description for each rule:
| PMIE Rule Name | PMIE Rule Description | PMIE Rule's Action |
| cpu.context_switch | High aggregate context switch rate | Average number of context switches per CPU per second exceeded threshold over the past sample interval. |
| cpu.excess_fpe | Possible high floating point exception rate | This predicate attempts to detect processes generating very large numbers of floating point exceptions (FPEs). Characteristic of this situation is heavy system time coupled with low system call rates (exceptions are delivered through the kernel to the process, taking some system time, but no system call is serviced on the application's behalf). |
| cpu.load_average | High 1-minute load average | The current 1-minute load average is higher than the larger of min_load and ( per_cpu_load times the number of CPUs ). The load average measures the number of processes that are running, runnable or soon to be runnable (i.e. in short term sleep). |
| cpu.low_util | Low average processor utilization | The average processor utilization over all CPUs was below threshold percent during the last sample interval. This rule is effectively the opposite of cpu.util and is disabled by default - it is only useful in specialized environments where, for example, processing is batch oriented and low processor utilization is indicative of poor use of system resources. In such a situation the cpu.low_util rule should be enabled, and cpu.util disabled. |
| cpu.syscall | High aggregate system call rate | Average number of system calls per CPU per second exceeded threshold over the past sample interval. |
| cpu.system | Busy executing in system mode | Over the last sample interval, the average utilization per CPU was busy percent or more, and the ratio of system time to busy time exceeded threshold percent. |
| cpu.util | High average processor utilization | The average processor utilization over all CPUs exceeded threshold percent during the last sample interval. |
| craylink.node_cb_errs | CrayLink checkbit errors on Origin node | For some Origin 2000 node, at least one checkbit error was
observed on the node (CrayLink) interface and/or the I/O interface in the last sample interval. Use the command
|
| craylink.router_cb_errs | CrayLink checkbit errors on Origin route | For some CrayLink router port, at least one checkbit error was
observed in the last sample interval. Use the command |
| filesys.buffer_cache | Low buffer cache read hit ratio | Some filesystem read activity (at least min_lread Kbytes per
second of logical reads), and the read hit ratio in the buffer
cache is below threshold percent.Note: It is possible for the read hit ratio to be negative
(more phsical reads than logical reads) - this can be as a result of:
|
| filesys.dnlc_miss | High directory name cache miss rate | With at least min_lookup directory name cache (DNLC) lookups per second being performed, threshold percent of lookups result in cache misses. |
| filesys.filling | File system is filling up | Filesystem is at least threshold percent full and the used space is growing at a rate that would see the file system full within lead_time. |
| memory.exhausted | Severe demand for real memory | The system is swapping modified pages out of main memory to the swap partitions, and has been doing this at the rate of at least threshold pages swapped out per second for at least pct of the last 10 samples, ie. sustained page out activity. |
| memory.swap_low | Low free swap space | There is only threshold percent swap space remaining - the system may soon run out of virtual memory. Reduce the number and size of the running programs or add more swap(1) space before it completely runs out. |
| network.buffers | Serious demand for network buffers | During the last sample interval the rate at which processes tried to acquire network buffers (mbufs) and either failed or were stalled waiting for a buffer to be freed is greater than threshold times per second. |
| network.tcp_drop_connects | High ratio of TCP connections dropped | There is some TCP connection activity (at least min_close connections closed per minute) and the ratio of TCP dropped connections to all closed connections exceeds threshold percent during the last sample interval. High drop rates indicate either network congestion (check the packet retransmission rate) or an application like a Web browser that is prone to terminating TCP connections prematurely, perhaps due to sluggish response or user impatience. |
| network.tcp_retransmit | High number of TCP packet retransmissions | There is some network output activity (at least 100 TCP packets per
second) and the average ratio of retransmitted TCP packets to output
TCP packets exceeds threshold percent during the last sample
interval. High retransmission rates are suggestive of network congestion, or long latency between the end-points of the TCP connections. |
| per_cpu.context_switch | High per CPU context switch rate | The number of context switches per second for at least one CPU
exceeded threshold over the past sample interval. This rule only applies to multi-processor systems, for
single-processor systems refer to the cpu.context_switch rule. For Origin 200 and Origin 2000 systems, use the command
|
| per_cpu.many_util | High number of saturated processors | The processor utilization for at least pct percent of the CPUs exceeded threshold percent during the last sample interval. Only applies to multi-processor systems having more than min_cpu_count processors - for single-processor systems refer to the cpu.util rule, for multi-processor systems with less than min_cpu_count processors refer to the per_cpu.some_util rule. |
| per_cpu.some_util | High per CPU processor utilization | The processor utilization for at least one CPU exceeded threshold
percent during the last sample interval. Only applies to multi-processor systems with less than max_cpu_count processors -
for single-processor systems refer to the cpu.util rule, and for multi-processor systems with more than max_cpu_count processors
refer to the cpu.many_util rule. For Origin 200 and Origin 2000 systems, use the command
|
| per_cpu.syscall | High per CPU system call rate | The number of system calls per second for at least one CPU
exceeded threshold over the past sample interval. This rule only applies to multi-processor systems, for
single-processor systems refer to the cpu.syscall rule. For Origin 200 and Origin 2000 systems, use the command
|
| per_cpu.system | Some CPU busy executing in system mode | Over the last sample interval, at least one CPU was active for
busy percent or more, and the ratio of system time to busy time exceeded threshold percent. Only applies to multi-processor
systems, for single-processor systems refer to the cpu.system rule. For Origin 200 and Origin 2000 systems, use the command
|
| per_disk.util | High per spindle disk utilization | For at least one spindle, disk utilization exceeded threshold percent during the last sample interval. |
| per_netif.collisions | High collision rate in packet sends | More than threshold percent of the packets being sent across an interface are causing a collision, and packets are being sent across the interface at packet_rate packets per second. Ethernet interfaces expect a certain number of packet collisions, but a high ratio of collisions to packet sends is indicitive of a saturated network. |
| per_netif.errors | High network interface error rate | For at least one network interface, the error rate exceeded threshold errors per second during the last sample interval. |
| per_netif.packets | High network interface packet transfers | For at least one network interface, the average rate of packet
transfers (in and/or out) exceeded the threshold during the last sample interval. This rule is disabled by default because the per_netif.util rule is more generally useful as it takes into consideration each network interfaces' reported bandwidth. However, there are some situations in which this value is zero, in which case an absolute threshold-based rule like this one will make more sense (for this reason it should typically be applied to some network interfaces, but not others - use the "interfaces" variable to filter this). |
| per_netif.util | High network interface utilization | For at least one network interface, the average transfer rate (in and/or out) exceeded threshold percent of the peak bandwidth of the interface during the last sample interval. |
| rpc.bad_network | RPC network transmission failure | More than threshold percent of sent client remote procedure call (RPC) packets are timing out before the server responds and the number of timeouts is significantly more than the number of duplicate packets being received (indicating lost packets). The networked file system (NFS) utilizes the RPC protocol for its client-server communication needs. This high failure rate when sending RPC packets may be due to faulty network hardware or inappropriately sized NFS packets (packets possibly too large). |
| rpc.slow_response | RPC server response is slow | More than threshold percent of sent client remote procedure call (RPC) packets are timing out before the server responds and the number of timeouts is roughly equivalent to the number of duplicate packets being received. The network file system (NFS) utilizes the RPC protocol for its client-server communication needs. This high timeout rate when sending RPC packets may be because the NFS server is processing duplicate requests from the clients which were sent after the original requests timed out. |
| espping.response | System Group Manager slow service response | A service being monitored by the SGI Embedded Support Partner Group
Manager has taken more than threshold milliseconds to complete, during the last sample interval. The hosts parameter specifies
hosts running the espping PMDA, not hosts being monitored by this PMDA. The latter are encoded in the "instances" for each
espping PMDA metric - run |
| espping.status | System Group Manager service probe failure | A service being monitored by the SGI Embedded Support Partner Group
Manager has either failed, or not responded within a timeout period (as defined by espping.control.timeout) during the last sample
interval. The hosts parameter specifies hosts running the espping PMDA, not hosts being monitored by this PMDA. The latter are
encoded in the "instances" for each espping PMDA metric - run |