Monitor Sawmills with Prometheus

The Sawmills Collector exposes OpenTelemetry internal metrics in Prometheus format. Any Prometheus-compatible scraper (Prometheus, VictoriaMetrics, Grafana Agent, Datadog Agent, etc.) can collect them. This guide lists the endpoint, the metrics worth watching, and sample alert rules. A kube-prometheus-stack example is included at the end.

Scrape Endpoint

Scrape every collector pod at:

Port: 19465
Path: /metrics
Recommended interval: 15s

Collector Configuration

The Prometheus port is set by the Helm chart — no collector-config edits are needed. To override the default, set the value under managedChartsValues.sawmills-collector in your remote-operator values.yaml (see Updating Collector Values):

managedChartsValues:
  sawmills-collector:
    telemetry:
      prometheus:
        port: 19465 # default

Metrics to Monitor

The collector emits per-signal counters with the suffix _log_records_total, _metric_points_total, or _spans_total. Where this guide shows <sig>, substitute the signal you ingest. All counters carry the _total suffix in OpenTelemetry’s Prometheus exposition.

Liveness
- up{job="<job>"} — scrape target reachable (the job label depends on your scraper config)
- otelcol_process_uptime_total — monotonic uptime; resets on restart
Resource Usage
- otelcol_process_cpu_seconds_total — CPU rate per pod
- otelcol_process_memory_rss — resident memory per pod
- otelcol_process_runtime_heap_alloc_bytes — Go heap (use to detect leaks)
Ingestion (Receiver Side)
- otelcol_receiver_accepted_<sig>_total — successful ingest, broken down by receiver
- otelcol_receiver_refused_<sig>_total — input-side rejections; common causes: malformed data, incompatible protocol versions, rate limiting
Egress (Exporter Side)
- otelcol_exporter_sent_<sig>_total — successful sends, broken down by exporter
- otelcol_exporter_send_failed_<sig>_total — downstream send errors; common causes: network issues, authentication failures, backend overload
- otelcol_exporter_enqueue_failed_<sig>_total — drops at the queue boundary (data lost before it could be sent)
Backpressure
- otelcol_exporter_queue_size / otelcol_exporter_queue_capacity — pair these for utilization ratio. Sustained high utilization indicates the collector cannot keep up with the destination.

Label Conventions

Sawmills emits processor instances with the format <type>/<uuid>. Real receiver and exporter label values look like:

receiver="otlp/collector-backend/otlp/72cfa7f8-00ae-40b4-a4eb-dc6f59932c29"
exporter="datadog/ec03b655-369f-4273-b255-8ed7dd33366f"
exporter="awss3/sampling-destination-1876f24b-3bdb-4fe5-93e1-fa36cb89e864"

When writing alert rules, use regex matchers (exporter=~"datadog/.*") rather than literal equality. The otelcol_exporter_queue_size and _queue_capacity series also carry a data_type label (logs, metrics, traces) — if you have multiple signals exported, group or filter on it.

A Note on Lazy Emission

OpenTelemetry only exposes a counter after its first non-zero observation. otelcol_exporter_send_failed_<sig>_total and otelcol_exporter_enqueue_failed_<sig>_total will be entirely absent from the scrape until at least one failure has occurred — this is normal, not a misconfiguration. Alerts on these metrics evaluate as “no data” rather than zero, so use absent_over_time or OR vector(0) if you need to distinguish “healthy” from “metric missing because never failed”.

Sample Alert Rules

The following are starting points and should be adjusted based on your collector configuration, data volumes, and tolerance for refusals. Replace <job> with the actual job label your scraper assigns to the collector — open /targets in your Prometheus UI and copy the value shown on the collector rows.

groups:
  - name: SawmillsCollector
    rules:
      - alert: SawmillsCollectorDown
        expr: absent(up{job="<job>"}) or up{job="<job>"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Sawmills collector scrape target down or missing"
          description: "Prometheus has been unable to scrape the collector for 2 minutes (or the target has disappeared from service discovery)."

      - alert: SawmillsReceiverRefusingLogs
        expr: sum by (receiver) (rate(otelcol_receiver_refused_log_records_total{job="<job>",receiver=~"otlp/.*"}[5m])) > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Sawmills collector refusing logs at receiver {{ $labels.receiver }}"
          description: "The receiver is refusing logs, possibly due to high load or malformed data."

      - alert: SawmillsExporterSendFailingLogs
        expr: sum by (exporter) (rate(otelcol_exporter_send_failed_log_records_total{job="<job>"}[5m])) > 100
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Sawmills exporter {{ $labels.exporter }} failing to send logs to destination"
          description: "The exporter is failing to send logs, possibly due to backend connectivity issues."

      - alert: SawmillsExporterEnqueueFailingLogs
        expr: sum by (exporter) (rate(otelcol_exporter_enqueue_failed_log_records_total{job="<job>"}[5m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Sawmills exporter {{ $labels.exporter }} dropping logs at the export queue"
          description: "Records are being dropped before they can be sent — the exporter queue is full or rejecting writes. This is data loss."

      - alert: SawmillsExporterQueueSaturated
        expr: max by (exporter, data_type) (otelcol_exporter_queue_size{job="<job>"} / clamp_min(otelcol_exporter_queue_capacity{job="<job>"}, 1)) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Sawmills exporter {{ $labels.exporter }} queue near capacity ({{ $labels.data_type }})"
          description: "Queue utilization above 80% sustained — the destination is slower than ingest. Enqueue failures (data loss) follow if this persists."

Appendix: Kube-Prometheus-Stack Example

If you run kube-prometheus-stack with Prometheus Operator, this ServiceMonitor wires up scraping directly. Adjust the release: label to match your Prometheus’s serviceMonitorSelector.

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sawmills-collector-monitor
  namespace: sawmills
  labels:
    release: prometheus # Match your Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: <collector-release-name>
  namespaceSelector:
    matchNames:
      - sawmills
  endpoints:
    - port: prometheus
      interval: 15s

Apply:

kubectl apply -f sawmills-servicemonitor.yaml

Verification

Port-forward Prometheus and check /targets for the sawmills-collector-monitor scrape pool:

kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n <prometheus-namespace>
# http://localhost:9090/targets

For more information about OpenTelemetry collector metrics, visit the official documentation.

​Scrape Endpoint

​Collector Configuration

​Metrics to Monitor

​Label Conventions

​A Note on Lazy Emission

​Sample Alert Rules

​Appendix: Kube-Prometheus-Stack Example

​ServiceMonitor

​Verification