(All) Performs analysis of the fabric.
Syntax
opafabricanalysis [-b|-e] [-s] [-d dir] [-c file] [-t portsfile]
[-p ports] [-T topology_input]
Options
- -- help
- Produces full help text.
- -b
- Specifies the baseline mode, default is compare/check mode.
- -e
- Evaluates health only, default is compare/check mode.
- -s
- Saves history of failures (errors/differences).
- -d dir
- Specifies the top-level directory for saving baseline and history
of failed checks. Default =
/var/usr/lib/opa/analysis
- -c file
- Specifies the error thresholds config file.
- Default =
/etc/opa/opamon.conf
- -t portsfile
- Specifies the file with list of local HFI ports used to access
fabric(s) for analysis. Default =
/etc/opa/ports
- -p ports
-
Specifies the list of local HFI ports used to access fabrics for
analysis.
Default is first active port. The first HFI in the
system is 1. The first port on an HFI is 1. Uses the format hfi:port, for
example:
- 0:0
-
First active port in system.
- 0:y
-
Port
y within system.
- x:0
-
First active port on HFI
x.
- x:y
-
HFI
x, port
y.
- -T
topology_input
- Specifies the name of topology input file to use. Any
%P markers in this filename are replaced with the
HFI:port being operated on (such as
0:0 or
1:2). Default =
/etc/opa/topology.%P.xml. If
-T NONE is specified, no topology input file is
used. See
Details
and
opareport
for more information.
Example
opafabricanalysis
opafabricanalysis -p '1:1 1:2 2:1 2:2'
The fabric analysis tool checks the following:
- Fabric links (both internal to switch chassis and
external cables)
- Fabric components (nodes, links, SMs, systems, and
their SMA configuration)
- Fabric PMA error counters and link speed mismatches
Note: The comparison
includes components on the fabric. Therefore, operations such as shutting down
a server cause the server to no longer appear on the fabric and are flagged as
a fabric change or failure by
opafabricanalysis.
Environment Variables
The following environment variables are also used by this command:
- PORTS
- List of ports, used in absence of
-t and
-p.
- PORTS_FILE
- File containing list of ports, used in absence of
-t and
-p.
- FF_TOPOLOGY_FILE
- File containing
topology_input (may have
%P marker in filename), used in absence of
-T.
- FF_ANALYSIS_DIR
- Top-level directory for baselines and failed health checks.
Details
For simple fabrics, the
Intel® Omni-Path Fabric Suite FastFabric Toolset host is connected to a single fabric. By
default, the first active port on the
FastFabric Toolset host is used to analyze
the fabric. However, in more complex fabrics, the
FastFabric Toolset host may be connected
to more than one fabric or subnet. In this case, you can specify the ports or
HFIs to use with one of the following methods:
- On the command line using
the
-p option.
- In a file specified using
the
-t option.
- Through the environment
variables
PORTS or
PORTS_FILE.
- Using the
PORTS_FILE configuration option in
opafastfabric.conf.
If the specified port does not exist or is empty, the first active
port on the local system is used. In more complex configurations, you must
specify the exact ports to use for all fabrics to be analyzed.
You can specify the
topology_input file to be used with one of the
following methods:
- On the command line using
the
-T option.
- In a file specified
through the environment variable
FF_TOPOLOGY_FILE.
- Using the
ff_topology_file configuration option in
opafastfabric.conf.
If the specified file does not exist, no
topology_input file is used. Alternately the
filename can be specified as
NONE to prevent use of an input file.
For more information, refer to
opareport.
By default, the error analysis includes PMA counters and slow links
(that is, links running below enabled speeds). You can change this using the
FF_FABRIC_HEALTH configuration parameter in
opafastfabric.conf. This parameter specifies the
opareport options and reports to be used for the
health analysis. It also can specify the PMA counter clearing behavior
(-I
seconds,
-C, or none at all).
When a
topology_input file is used, it can also be useful
to extend
FF_FABRIC_HEALTH to include fabric topology
verification options such as
-o verifylinks.
The thresholds for PMA counter analysis default to
/etc/opa/opamon.conf. However, you can specify an
alternate configuration file for thresholds using the
-c option. The
opamon.si.conf file can also be used to check for
any non-zero values for signal integrity (SI) counters.
All files generated by
opafabricanalysis start with
fabric in their file name. This is followed by the
port selection option identifying the port used for the analysis. Default is
0:0.
The
opafabricanalysis tool generates files such as the
following within
FF_ANALYSIS_DIR:
Baseline
During a baseline run, the following files are also created in
FF_ANALYSIS_DIR/latest.
- baseline/fabric.0:0.snapshot.xml
opareport snapshot of complete fabric components
and SMA configuration.
-
baseline/fabric.0:0.comps
opareport summary of fabric components and basic
SMA configuration.
-
baseline/fabric.0.0.links
opareport summary of internal and external links.
Full Analysis
- latest/fabric.0:0.snapshot.xml
opareport snapshot of complete fabric components
and SMA configuration.
-
latest/fabric.0:0.snapshot.stderr
stderr of
opareport during snapshot.
-
latest/fabric.0:0.errors
stdout of
opareport for errors encountered during fabric
error analysis.
-
latest/fabric.0.0.errors.stderr
stderr of
opareport during fabric error analysis.
-
latest/fabric.0:0.comps
stdout of
opareport for fabric components and SMA
configuration.
-
latest/fabric.0:0.comps.stderr
stderr of
opareport for fabric components.
-
latest/fabric.0:0.comps.diff
diff of baseline and latest fabric components.
-
latest/fabric.0:0.links
stdout of
opareport summary of internal and external links.
-
latest/fabric.0:0.links.stderr
stderr of
opareport summary of internal and external links.
-
latest/fabric.0:0.links.diff
diff of baseline and latest fabric internal and
external links.
-
latest/fabric.0:0.links.changes.stderr
stderr of
opareport comparison of links.
-
latest/fabric.0:0.links.changes
opareport comparison of links against baseline.
This is typically easier to read than the
links.diff file and contains the same information.
-
latest/fabric.0:0.comps.changes.stderr
stderr of
opareport comparison of components.
-
latest/fabric.0:0.comps.changes
opareport comparison of components against
baseline. This is typically easier to read than the
comps.diff file and contains the same
information.
The
.diff and
.changes files are only created if differences are
detected.
If the
-s option is used and failures are detected, files
related to the checks that failed are also copied to the time-stamped directory
name under
FF_ANALYSIS_DIR.
Fabric Items Checked Against the Baseline
Based on
opareport -o links:
- Unconnected/down/missing cables
- Added/moved cables
- Changes in link width and speed
- Changes to Node GUIDs in fabric (replacement of HFI
or Switch hardware)
- Adding/Removing Nodes [FI, Virtual FIs, Virtual
Switches, Physical Switches, Physical Switch internal switching cards
(leaf/spine)]
- Changes to server or switch names
Based on
opareport -o comps:
- Overlap with items from links report
- Changes in port MTU, LMC, number of VLs
- Changes in port speed/width enabled or supported
- Changes in HFI or switch device
IDs/revisions/VendorID (for example, ASIC HW changes)
- Changes in port Capability mask (which
features/agents run on port/server)
- Changes to ErrorLimits and PKey enforcement per
port
- Changes to IOUs/IOCs/IOC Services provided
Note: Only
applicable if IOUs in fabric (such as Virtual IO cards, native storage, and
others).
Location (port, node) and number of SMs in fabric. Includes:
- Primary and backups
- Configured priority for SM
Fabric Items Also Checked During Health Check
Based on
opareport -s -C -o errors -o slowlinks:
- PMA error counters on all
Intel® Omni-Path Fabric ports (HFI, switch external and switch internal) checked against
configurable thresholds.
- Counters are cleared each time a health check
is run. Each health check reflects a counter delta since last health check.
- Typically identifies potential fabric errors,
such as symbol errors.
- May also identify transient congestion,
depending on the counters that are monitored.
- Link active speed/width as compared to Enabled
speed.
- Identifies links whose active speed/width is
< min (enabled speed/width on each side of link).
- This typically reflects bad cables or bad ports
or poor connections.
- Side effect is the verification of SA health.