Files
home/Projects/kompose/docs/content/5.stacks/trace.md
2025-10-09 17:24:27 +02:00

14 KiB
Executable File

title, description, navigation
title description navigation
Trace - Observability Command Center When your app goes boom, we tell you why!
icon
i-lucide-search

"When your app goes boom, we tell you why!" - SigNoz

What's This All About?

SigNoz is your all-in-one observability platform! Think of it as having X-ray vision for your applications - see traces, metrics, and logs all in one place. It's like Datadog or New Relic, but open-source and running on YOUR infrastructure. When something breaks at 3 AM, SigNoz tells you exactly what, where, and why! :icon{name="lucide:siren"}

The Observability Avengers

:icon{name="lucide:target" :size="32"} SigNoz

Container: trace_app
Image: signoz/signoz:v0.96.1
Port: 8080 (UI), 7070 (exposed externally)
Home: http://localhost:7070

Your main dashboard and query engine:

  • :icon{name="lucide:bar-chart"} APM: Application Performance Monitoring
  • :icon{name="lucide:search"} Distributed Tracing: Follow requests across services
  • :icon{name="lucide:trending-up"} Metrics: CPU, memory, custom metrics
  • :icon{name="lucide:file-text"} Logs: Centralized log management
  • :icon{name="lucide:target"} Alerting: Get notified when things break
  • :icon{name="lucide:link"} Service Maps: Visualize your architecture
  • :icon{name="lucide:clock"} Performance: Find bottlenecks
  • :icon{name="lucide:bug"} Error Tracking: Catch and debug errors

:icon{name="lucide:database"} ClickHouse

Container: trace_clickhouse
Image: clickhouse/clickhouse-server:25.5.6

The speed demon database:

  • :icon{name="lucide:zap"} Columnar Storage: Insanely fast queries
  • :icon{name="lucide:bar-chart"} Analytics: Perfect for time-series data
  • :icon{name="lucide:hard-drive"} Compression: Stores LOTS of data efficiently
  • :icon{name="lucide:rocket"} Performance: Millions of rows/second
  • :icon{name="lucide:trending-up"} Scalable: Grows with your needs

:icon{name="simple-icons:postgresql"} ZooKeeper

Container: trace_zookeeper
Image: signoz/zookeeper:3.7.1

The coordinator:

  • :icon{name="lucide:drama"} Orchestration: Manages distributed systems
  • :icon{name="lucide:refresh-cw"} Coordination: Keeps ClickHouse in sync
  • :icon{name="lucide:clipboard"} Configuration: Centralized config management

:icon{name="lucide:satellite"} OpenTelemetry Collector

Container: trace_otel_collector
Image: signoz/signoz-otel-collector:v0.129.6

The data pipeline:

  • :icon{name="lucide:download"} Receives: Traces, metrics, logs from apps
  • :icon{name="lucide:refresh-cw"} Processes: Transforms and enriches data
  • :icon{name="lucide:upload"} Exports: Sends to ClickHouse
  • :icon{name="lucide:target"} Sampling: Smart data collection
  • :icon{name="lucide:plug"} Flexible: Supports many data formats

:icon{name="lucide:wrench"} Schema Migrators

Containers: trace_migrator_sync & trace_migrator_async

The database janitors:

  • :icon{name="lucide:folder-input"} Migrations: Set up database schema
  • :icon{name="lucide:refresh-cw"} Updates: Apply schema changes
  • :icon{name="lucide:square-dashed-mouse-pointer"} Initialization: Prepare ClickHouse

Architecture Overview

Your Application
    ↓ (sends telemetry)
OpenTelemetry Collector
    ↓ (stores data)
ClickHouse Database ← ZooKeeper (coordinates)
    ↓ (queries data)
SigNoz UI ← You (investigate issues)

The Three Pillars of Observability

1. :icon{name="lucide:bar-chart"} Metrics (The Numbers)

What's happening right now?

  • Request rate (requests/second)
  • Error rate (errors/second)
  • Duration (latency, response time)
  • Custom business metrics

Example: "API calls are up 200% but error rate is only 1%"

2. :icon{name="lucide:search"} Traces (The Journey)

How did a request flow through your system?

  • Distributed tracing across services
  • See exact path of each request
  • Identify slow operations
  • Find where errors occurred

Example: "User login → Auth service (50ms) → Database (200ms) → Session storage (10ms)"

3. :icon{name="lucide:file-text"} Logs (The Details)

What exactly happened?

  • Application logs
  • System logs
  • Error messages
  • Debug information

Example: "ERROR: Database connection timeout at 2024-01-15 03:42:17"

Configuration Breakdown

Ports

Service Internal External Purpose
SigNoz UI 8080 7070 Web interface
ClickHouse 9000 - Database queries
ClickHouse HTTP 8123 - HTTP interface
OTel Collector 4317 - gRPC (OTLP)
OTel Collector 4318 - HTTP (OTLP)

Environment Variables

Telemetry:

TELEMETRY_ENABLED=true       # Send usage stats to SigNoz team
DOT_METRICS_ENABLED=true     # Enable Prometheus metrics

Database:

SIGNOZ_TELEMETRYSTORE_CLICKHOUSE_DSN=tcp://clickhouse:9000

Storage:

STORAGE=clickhouse           # Backend storage engine

First Time Setup :icon

1. Ensure Dependencies Ready

# Init ClickHouse (happens automatically)
docker compose up init-clickhouse

# Check if healthy
docker ps | grep trace

2. Start the Stack

docker compose up -d

This starts:

  • :icon{name="lucide:check"} ClickHouse (database)
  • :icon{name="lucide:check"} ZooKeeper (coordination)
  • :icon{name="lucide:check"} Schema migrations (database setup)
  • :icon{name="lucide:check"} SigNoz (UI and query engine)
  • :icon{name="lucide:check"} OTel Collector (data collection)

3. Access SigNoz

URL: http://localhost:7070

First login creates admin account!

4. Set Up Your First Service

Install OpenTelemetry SDK in your app:

Node.js:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

Python:

pip install opentelemetry-distro opentelemetry-exporter-otlp

Go:

go get go.opentelemetry.io/otel

5. Instrument Your Application

Node.js Example:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4317', // OTel Collector
  }),
  serviceName: 'my-awesome-app',
});

sdk.start();

Python Example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)

6. Send Your First Trace

// Node.js
const tracer = trace.getTracer('my-app');
const span = tracer.startSpan('do-something');
// ... do work ...
span.end();

7. View in SigNoz

  1. Navigate to http://localhost:7070
  2. Go to "Services" tab
  3. See your service appear!
  4. Click on it to see traces

Using SigNoz Like a Pro :icon

Services View

See all your microservices:

  • :icon{name="lucide:bar-chart"} Request rate
  • :icon{name="lucide:timer"} Latency (P50, P90, P99)
  • :icon{name="lucide:x"} Error rate
  • :icon{name="lucide:flame"} Top endpoints

Traces View

Debug individual requests:

  • :icon{name="lucide:search"} Search by service, operation, duration
  • :icon{name="lucide:trending-up"} Visualize request flow
  • :icon{name="lucide:timer"} See exact timings
  • 🐛 Find errors with full context

Metrics View (Dashboards)

Create custom dashboards:

  • :icon{name="lucide:bar-chart"} Application metrics
  • :icon{name="lucide:laptop"} Infrastructure metrics
  • :icon{name="lucide:trending-up"} Business KPIs
  • :icon{name="lucide:target"} Custom queries

Logs View

Query all your logs:

  • :icon{name="lucide:search"} Full-text search
  • :icon{name="lucide:tag"} Filter by attributes
  • :icon{name="lucide:clock"} Time-based queries
  • :icon{name="lucide:link"} Correlation with traces

Alerts

Set up notifications:

  • :icon{name="lucide:mail"} Email alerts
  • :icon{name="lucide:message-circle"} Slack notifications
  • :icon{name="lucide:phone"} PagerDuty integration
  • :icon{name="lucide:bell"} Custom webhooks

Common Queries & Dashboards

Find Slow Requests

Operation: GET /api/users
Duration > 1000ms
Time: Last 1 hour

Error Rate Alert

Metric: error_rate
Condition: > 5%
Duration: 5 minutes
Action: Send Slack notification

Top 10 Slowest Endpoints

Group by: Operation
Sort by: P99 Duration
Limit: 10

Service Dependencies

Auto-generated service map shows:

  • :icon{name="lucide:link"} Which services call which
  • :icon{name="lucide:bar-chart"} Request volumes
  • :icon{name="lucide:timer"} Latencies between services
  • :icon{name="lucide:x"} Error rates

Instrumenting Different Languages

Auto-Instrumentation

Node.js (Express, Fastify, etc.):

node --require @opentelemetry/auto-instrumentations-node app.js

Python (Flask, Django, FastAPI):

opentelemetry-instrument python app.py

Java (Spring Boot):

java -javaagent:opentelemetry-javaagent.jar -jar app.jar

Manual Instrumentation

Create Custom Spans:

const span = tracer.startSpan('database-query');
span.setAttribute('query', 'SELECT * FROM users');
try {
  const result = await db.query('SELECT * FROM users');
  span.setStatus({ code: SpanStatusCode.OK });
  return result;
} catch (error) {
  span.setStatus({ 
    code: SpanStatusCode.ERROR,
    message: error.message 
  });
  throw error;
} finally {
  span.end();
}

Custom Metrics

Counter (things that increase):

const counter = meter.createCounter('api_requests');
counter.add(1, { endpoint: '/api/users', method: 'GET' });

Histogram (measure distributions):

const histogram = meter.createHistogram('request_duration');
histogram.record(duration, { endpoint: '/api/users' });

Gauge (current value):

const gauge = meter.createObservableGauge('active_connections');
gauge.addCallback((result) => {
  result.observe(getActiveConnections());
});

Health & Monitoring

Check Services Health

# SigNoz
curl http://localhost:8080/api/v1/health

# ClickHouse
docker exec trace_clickhouse clickhouse-client --query="SELECT 1"

# OTel Collector
curl http://localhost:13133/

View Logs

# SigNoz
docker logs trace_app -f

# ClickHouse
docker logs trace_clickhouse -f

# OTel Collector
docker logs trace_otel_collector -f

Volumes & Data

ClickHouse Data

clickhouse_data → /var/lib/clickhouse/

All traces, metrics, logs stored here. BACKUP REGULARLY!

SigNoz Data

signoz_data → /var/lib/signoz/

SigNoz configuration and metadata.

ZooKeeper Data

zookeeper_data → /bitnami/zookeeper

Coordination state.

Performance Tuning :icon

Sampling

Don't send ALL traces (too expensive):

# OTel Collector config
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

Data Retention

Configure how long to keep data:

-- In ClickHouse
ALTER TABLE traces 
MODIFY TTL timestamp + INTERVAL 30 DAY;

Resource Limits

# For ClickHouse
environment:
  MAX_MEMORY_USAGE: 10000000000  # 10GB

Troubleshooting :icon

Q: No data appearing in SigNoz?

# Check OTel Collector is receiving data
docker logs trace_otel_collector | grep "received"

# Verify app is sending to correct endpoint
# Default: http://localhost:4317

# Check ClickHouse is storing data
docker exec trace_clickhouse clickhouse-client --query="SELECT count() FROM traces"

Q: SigNoz UI won't load?

# Check container status
docker ps | grep trace

# View logs
docker logs trace_app

# Verify ClickHouse connection
docker exec trace_app curl clickhouse:9000

Q: High memory usage?

  • Reduce data retention period
  • Increase sampling rate
  • Allocate more RAM to ClickHouse

Q: Queries are slow?

  • Check ClickHouse indexes
  • Reduce query time range
  • Optimize your dashboards

Advanced Features

Distributed Tracing

Follow a request across multiple services:

Frontend → API Gateway → Auth Service → Database
  50ms   →    100ms    →     30ms     →   200ms

Exemplars

Link metrics to traces:

  • Click on a spike in error rate
  • Jump directly to failing trace
  • Debug with full context

Service Level Objectives (SLOs)

Set and track SLOs:

  • 99.9% uptime
  • P95 latency < 200ms
  • Error rate < 0.1%

Real-World Use Cases

1. Performance Debugging 🐛

Problem: API endpoint suddenly slow
Solution:

  1. Check Traces view
  2. Filter by slow requests (>1s)
  3. See database query taking 950ms
  4. Optimize query
  5. Verify improvement in metrics

2. Error Investigation :icon

Problem: Users reporting 500 errors
Solution:

  1. Check error rate dashboard
  2. Jump to failing traces
  3. See stack trace and logs
  4. Identify null pointer exception
  5. Deploy fix and monitor

3. Capacity Planning :icon

Problem: Need to scale before Black Friday
Solution:

  1. Review historical metrics
  2. Identify bottlenecks
  3. Load test and observe traces
  4. Scale accordingly
  5. Monitor during event

4. Microservices Debugging 🕸️

Problem: Which service is causing timeouts?
Solution:

  1. View service map
  2. See latency between services
  3. Identify slow service
  4. Check its traces
  5. Find database connection pool exhausted

Why SigNoz is Awesome

  • :icon{name="lucide:smile"} Open Source: Free forever, no limits
  • :icon{name="lucide:rocket"} Fast: ClickHouse is CRAZY fast
  • :icon{name="lucide:target"} Complete: Metrics + Traces + Logs in one
  • :icon{name="lucide:bar-chart"} Powerful: Query anything, any way
  • :icon{name="lucide:lock"} Private: Your data stays on your server
  • :icon{name="lucide:dollar-sign"} Cost-Effective: No per-seat pricing
  • :icon{name="lucide:hammer"} Flexible: Customize everything
  • :icon{name="lucide:trending-up"} Scalable: Grows with your needs

Resources


"You can't fix what you can't see. SigNoz makes everything visible." - Observability Wisdom :icon{name="lucide:search"}:icon{name="lucide:sparkles"}