PCP

Performance Co-Pilot provides historical system metrics. PCP stores metrics in archives, in /var/log/pcp/pmlogger/$(hostname).

All metrics are identified by an PMID (Performance Metric identifier) Each metric is part of a certain domain typedef unsigned long pmInDom; except for single value instances those are always PM_INDOM_NULL.

Examle multi value metric (instances):

$ pminfo -f filesys.free

filesys.free
    inst [0 or "/dev/mapper/system"] value 472018336
    inst [1 or "/dev/nvme0n1p1"] value 371764

Single value metric:

$ pminfo -f mem.freemem

mem.freemem
    value 3015252

Obtaining the metrics from archive is used done creating a "handle" with pmNewContext. The collection time can be set to an arbitrary time with pmSetMode. The to be fetched instances can be restricted with pmAddProfile and pmDelProfile.

Performance metric description

Metadata of a metric described in pmDesc struct describes the format and semantics.

/* Performance Metric Descriptor */
typedef struct {
    pmID    pmid;   /* unique identifier */
    int     type;   /* base data type (see below) */
    pmInDom indom;  /* instance domain */
    int     sem;    /* semantics of value (see below) */
    pmUnits units;  /* dimension and units (see below) */
} pmDesc;

The types

/* pmDesc.type - data type of metric values */
#define PM_TYPE_NOSUPPORT -1   /* not in this version */
#define PM_TYPE_32        0    /* 32-bit signed integer */
#define PM_TYPE_U32       1    /* 32-bit unsigned integer */
#define PM_TYPE_64        2    /* 64-bit signed integer */
#define PM_TYPE_U64       3    /* 64-bit unsigned integer */
#define PM_TYPE_FLOAT     4    /* 32-bit floating point */
#define PM_TYPE_DOUBLE    5    /* 64-bit floating point */
#define PM_TYPE_STRING    6    /* array of char */
#define PM_TYPE_AGGREGATE 7    /* arbitrary binary data */
#define PM_TYPE_AGGREGATE_STATIC 8 /* static pointer to aggregate */
#define PM_TYPE_EVENT     9    /* packed pmEventArray */
#define PM_TYPE_UNKNOWN   255  /* used in pmValueBlock not pmDesc */

Cockpit-pcp does not support PM_TYPE_AGGREGRATE, PM_TYPE_EVENT

Semantics describe how Cockpit should represent the data:

/* pmDesc.sem - semantics of metric values */
#define PM_SEM_COUNTER  1  /* cumulative count, monotonic increasing */
#define PM_SEM_INSTANT  3  /* instantaneous value continuous domain */
#define PM_SEM_DISCRETE 4  /* instantaneous value discrete domain */

The C code doesn't do anything with this information except return it back to the client in the meta message. However the derive == rate option requires the bridge to calculate the sample rate based on the last value and the provided interval.

PCP Archive source

The metrics1 channel supports passing a source=pcp-archive or source=/path/to/archive, the latter likely introduced for testing. Archive specific options from docs/protocol.md:

"metrics" (array): Descriptions of the metrics to use. See below.
"instances" (array of strings, optional): When specified, only the listed instances are included in the reported samples.
"omit-instances" (array of strings, optional): When specified, the listed instances are omitted from the reported samples. Only one of "instances" and "omit-instances" can be specified.
"interval" (number, optional): The sample interval in milliseconds. Defaults to 1000.
"timestamp" (number, optional): The desired time of the first sample. This is only used when accessing archives of samples.

This is either the number of milliseconds since the epoch, or (when negative) the number of milliseconds in the past.

The first sample will be from a time not earlier than this timestamp, but it might be from a much later time.
"limit" (number, optional): The number of samples to return. This is only used when accessing an archive.

Reading data from archive

# Obtain an archive, this can be multiple if a path is given to say /var/log/pcp/pmlogger/hostname
context = pmapi.pmContext(c_api.PM_CONTEXT_ARCHIVE, '/path/to/archive')

# Get the internal metric ids for the user provided metrics
pmids = context.pmLookupName('mock.value')

# Get the descriptions, this is used for scaling values if required
descs = context.pmLookupDescs(pmids)

results = context.pmFetch(pmids)
for i in range(results.contents.numpmid):
    atom = context.pmExtractValue(results.contents.get_valfmt(i),
                                  results.contents.get_vlist(i, 0),
                                  descs[0].contents.type,
                                  c_api.PM_TYPE_U32)
    print(f"#mock.value={atom.ul}")

Debugging

cpf open metrics1 source="/tmp/pytest-of-jelle/pytest-current/timestamps-archives0/" metrics='[{ "name": "mock.value" }]' timestamp=1688162400000 limit=1 : wait | G_MESSAGES_DEBUG=none  /usr/lib/cockpit/cockpit-pcp | /usr/bin/cat

Unit tests

Test limitting the data, so generate a 1000 record archive (limit option in the metrics1 channel)
Different types of data, currently only testing U32. Cockpit requests "kernel.all.cpu.nice" (with a derive: "rate"), "mem.physmem", "swap.pagesout",
Test omit-instances { name: "network.interface.total.bytes", derive: "rate", "omit-instances": ["lo"] }
Test multi value metrics (which have "instances" like network.interface.total.bytes)
Test passing timestamps ie. load timestamp
Test passing instances
Test sample interval changes
Add unit tests for corrupted / broken archives

Questions

Why do we need to read archive per archive? The API supports reading all for us.
- Is it be of error handling?
- Is it because of limitting
- Is it because of the start timestamp?
Does the Meta message ever need to change? => yes
We have metrics twice
- metrics_descriptions per archive
- self.metrics == the parsed requested metrics
Meta message with instances issue, we need results to obtain the vset?

References

Programming PCP

Notes