home/Projects/kompose/docs/content/5.stacks/trace.md at 509dff728b84ce0937c3be1a2fbcbeb02ce57730

valknar/home

Fork 0

Files

Sebastian Krüger 2f0d37a292 fix: emoji to icons

2025-10-09 17:24:27 +02:00

14 KiB

Executable File

Raw Blame History

title, description, navigation

title

description

navigation

Trace - Observability Command Center

When your app goes boom, we tell you why!

icon
i-lucide-search

"When your app goes boom, we tell you why!" - SigNoz

What's This All About?

SigNoz is your all-in-one observability platform! Think of it as having X-ray vision for your applications - see traces, metrics, and logs all in one place. It's like Datadog or New Relic, but open-source and running on YOUR infrastructure. When something breaks at 3 AM, SigNoz tells you exactly what, where, and why! :icon{name="lucide:siren"}

The Observability Avengers

:icon{name="lucide:target" :size="32"} SigNoz

Container: trace_app
Image: signoz/signoz:v0.96.1
Port: 8080 (UI), 7070 (exposed externally)
Home: http://localhost:7070

Your main dashboard and query engine:

:icon{name="lucide:bar-chart"} APM: Application Performance Monitoring
:icon{name="lucide:search"} Distributed Tracing: Follow requests across services
:icon{name="lucide:trending-up"} Metrics: CPU, memory, custom metrics
:icon{name="lucide:file-text"} Logs: Centralized log management
:icon{name="lucide:target"} Alerting: Get notified when things break
:icon{name="lucide:link"} Service Maps: Visualize your architecture
:icon{name="lucide:clock"} Performance: Find bottlenecks
:icon{name="lucide:bug"} Error Tracking: Catch and debug errors

:icon{name="lucide:database"} ClickHouse

Container: trace_clickhouse
Image: clickhouse/clickhouse-server:25.5.6

The speed demon database:

:icon{name="lucide:zap"} Columnar Storage: Insanely fast queries
:icon{name="lucide:bar-chart"} Analytics: Perfect for time-series data
:icon{name="lucide:hard-drive"} Compression: Stores LOTS of data efficiently
:icon{name="lucide:rocket"} Performance: Millions of rows/second
:icon{name="lucide:trending-up"} Scalable: Grows with your needs

:icon{name="simple-icons:postgresql"} ZooKeeper

Container: trace_zookeeper
Image: signoz/zookeeper:3.7.1

The coordinator:

:icon{name="lucide:drama"} Orchestration: Manages distributed systems
:icon{name="lucide:refresh-cw"} Coordination: Keeps ClickHouse in sync
:icon{name="lucide:clipboard"} Configuration: Centralized config management

:icon{name="lucide:satellite"} OpenTelemetry Collector

Container: trace_otel_collector
Image: signoz/signoz-otel-collector:v0.129.6

The data pipeline:

:icon{name="lucide:download"} Receives: Traces, metrics, logs from apps
:icon{name="lucide:refresh-cw"} Processes: Transforms and enriches data
:icon{name="lucide:upload"} Exports: Sends to ClickHouse
:icon{name="lucide:target"} Sampling: Smart data collection
:icon{name="lucide:plug"} Flexible: Supports many data formats

:icon{name="lucide:wrench"} Schema Migrators

Containers: trace_migrator_sync & trace_migrator_async

The database janitors:

:icon{name="lucide:folder-input"} Migrations: Set up database schema
:icon{name="lucide:refresh-cw"} Updates: Apply schema changes
:icon{name="lucide:square-dashed-mouse-pointer"} Initialization: Prepare ClickHouse

Architecture Overview

Your Application
    ↓ (sends telemetry)
OpenTelemetry Collector
    ↓ (stores data)
ClickHouse Database ← ZooKeeper (coordinates)
    ↓ (queries data)
SigNoz UI ← You (investigate issues)

The Three Pillars of Observability

1. :icon{name="lucide:bar-chart"} Metrics (The Numbers)

What's happening right now?

Request rate (requests/second)
Error rate (errors/second)
Duration (latency, response time)
Custom business metrics

Example: "API calls are up 200% but error rate is only 1%"

2. :icon{name="lucide:search"} Traces (The Journey)

How did a request flow through your system?

Distributed tracing across services
See exact path of each request
Identify slow operations
Find where errors occurred

Example: "User login → Auth service (50ms) → Database (200ms) → Session storage (10ms)"

3. :icon{name="lucide:file-text"} Logs (The Details)

What exactly happened?

Application logs
System logs
Error messages
Debug information

Example: "ERROR: Database connection timeout at 2024-01-15 03:42:17"

Configuration Breakdown

Ports

Service	Internal	External	Purpose
SigNoz UI	8080	7070	Web interface
ClickHouse	9000	-	Database queries
ClickHouse HTTP	8123	-	HTTP interface
OTel Collector	4317	-	gRPC (OTLP)
OTel Collector	4318	-	HTTP (OTLP)

Environment Variables

Telemetry:

TELEMETRY_ENABLED=true       # Send usage stats to SigNoz team
DOT_METRICS_ENABLED=true     # Enable Prometheus metrics

Database:

SIGNOZ_TELEMETRYSTORE_CLICKHOUSE_DSN=tcp://clickhouse:9000

Storage:

STORAGE=clickhouse           # Backend storage engine

First Time Setup :icon

1. Ensure Dependencies Ready

# Init ClickHouse (happens automatically)
docker compose up init-clickhouse

# Check if healthy
docker ps | grep trace

2. Start the Stack

docker compose up -d

This starts:

:icon{name="lucide:check"} ClickHouse (database)
:icon{name="lucide:check"} ZooKeeper (coordination)
:icon{name="lucide:check"} Schema migrations (database setup)
:icon{name="lucide:check"} SigNoz (UI and query engine)
:icon{name="lucide:check"} OTel Collector (data collection)

3. Access SigNoz

URL: http://localhost:7070

First login creates admin account!

4. Set Up Your First Service

Install OpenTelemetry SDK in your app:

Node.js:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

Python:

pip install opentelemetry-distro opentelemetry-exporter-otlp

Go:

go get go.opentelemetry.io/otel

5. Instrument Your Application

Node.js Example:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4317', // OTel Collector
  }),
  serviceName: 'my-awesome-app',
});

sdk.start();

Python Example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)

6. Send Your First Trace

// Node.js
const tracer = trace.getTracer('my-app');
const span = tracer.startSpan('do-something');
// ... do work ...
span.end();

7. View in SigNoz

Navigate to http://localhost:7070
Go to "Services" tab
See your service appear!
Click on it to see traces

Using SigNoz Like a Pro :icon

Services View

See all your microservices:

:icon{name="lucide:bar-chart"} Request rate
:icon{name="lucide:timer"} Latency (P50, P90, P99)
:icon{name="lucide:x"} Error rate
:icon{name="lucide:flame"} Top endpoints

Traces View

Debug individual requests:

:icon{name="lucide:search"} Search by service, operation, duration
:icon{name="lucide:trending-up"} Visualize request flow
:icon{name="lucide:timer"} See exact timings
🐛 Find errors with full context

Metrics View (Dashboards)

Create custom dashboards:

:icon{name="lucide:bar-chart"} Application metrics
:icon{name="lucide:laptop"} Infrastructure metrics
:icon{name="lucide:trending-up"} Business KPIs
:icon{name="lucide:target"} Custom queries

Logs View

Query all your logs:

:icon{name="lucide:search"} Full-text search
:icon{name="lucide:tag"} Filter by attributes
:icon{name="lucide:clock"} Time-based queries
:icon{name="lucide:link"} Correlation with traces

Alerts

Set up notifications:

:icon{name="lucide:mail"} Email alerts
:icon{name="lucide:message-circle"} Slack notifications
:icon{name="lucide:phone"} PagerDuty integration
:icon{name="lucide:bell"} Custom webhooks

Common Queries & Dashboards

Find Slow Requests

Operation: GET /api/users
Duration > 1000ms
Time: Last 1 hour

Error Rate Alert

Metric: error_rate
Condition: > 5%
Duration: 5 minutes
Action: Send Slack notification

Top 10 Slowest Endpoints

Group by: Operation
Sort by: P99 Duration
Limit: 10

Service Dependencies

Auto-generated service map shows:

:icon{name="lucide:link"} Which services call which
:icon{name="lucide:bar-chart"} Request volumes
:icon{name="lucide:timer"} Latencies between services
:icon{name="lucide:x"} Error rates

Instrumenting Different Languages

Auto-Instrumentation

Node.js (Express, Fastify, etc.):

node --require @opentelemetry/auto-instrumentations-node app.js

Python (Flask, Django, FastAPI):

opentelemetry-instrument python app.py

Java (Spring Boot):

java -javaagent:opentelemetry-javaagent.jar -jar app.jar

Manual Instrumentation

Create Custom Spans:

const span = tracer.startSpan('database-query');
span.setAttribute('query', 'SELECT * FROM users');
try {
  const result = await db.query('SELECT * FROM users');
  span.setStatus({ code: SpanStatusCode.OK });
  return result;
} catch (error) {
  span.setStatus({ 
    code: SpanStatusCode.ERROR,
    message: error.message 
  });
  throw error;
} finally {
  span.end();
}

Custom Metrics

Counter (things that increase):

const counter = meter.createCounter('api_requests');
counter.add(1, { endpoint: '/api/users', method: 'GET' });

Histogram (measure distributions):

const histogram = meter.createHistogram('request_duration');
histogram.record(duration, { endpoint: '/api/users' });

Gauge (current value):

const gauge = meter.createObservableGauge('active_connections');
gauge.addCallback((result) => {
  result.observe(getActiveConnections());
});

Health & Monitoring

Check Services Health

# SigNoz
curl http://localhost:8080/api/v1/health

# ClickHouse
docker exec trace_clickhouse clickhouse-client --query="SELECT 1"

# OTel Collector
curl http://localhost:13133/

View Logs

# SigNoz
docker logs trace_app -f

# ClickHouse
docker logs trace_clickhouse -f

# OTel Collector
docker logs trace_otel_collector -f

Volumes & Data

ClickHouse Data

clickhouse_data → /var/lib/clickhouse/

All traces, metrics, logs stored here. BACKUP REGULARLY!

SigNoz Data

signoz_data → /var/lib/signoz/

SigNoz configuration and metadata.

ZooKeeper Data

zookeeper_data → /bitnami/zookeeper

Coordination state.

Performance Tuning :icon

Sampling

Don't send ALL traces (too expensive):

# OTel Collector config
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

Data Retention

Configure how long to keep data:

-- In ClickHouse
ALTER TABLE traces 
MODIFY TTL timestamp + INTERVAL 30 DAY;

Resource Limits

# For ClickHouse
environment:
  MAX_MEMORY_USAGE: 10000000000  # 10GB

Troubleshooting :icon

Q: No data appearing in SigNoz?

# Check OTel Collector is receiving data
docker logs trace_otel_collector | grep "received"

# Verify app is sending to correct endpoint
# Default: http://localhost:4317

# Check ClickHouse is storing data
docker exec trace_clickhouse clickhouse-client --query="SELECT count() FROM traces"

Q: SigNoz UI won't load?

# Check container status
docker ps | grep trace

# View logs
docker logs trace_app

# Verify ClickHouse connection
docker exec trace_app curl clickhouse:9000

Q: High memory usage?

Reduce data retention period
Increase sampling rate
Allocate more RAM to ClickHouse

Q: Queries are slow?

Check ClickHouse indexes
Reduce query time range
Optimize your dashboards

Advanced Features

Distributed Tracing

Follow a request across multiple services:

Frontend → API Gateway → Auth Service → Database
  50ms   →    100ms    →     30ms     →   200ms

Exemplars

Link metrics to traces:

Click on a spike in error rate
Jump directly to failing trace
Debug with full context

Service Level Objectives (SLOs)

Set and track SLOs:

99.9% uptime
P95 latency < 200ms
Error rate < 0.1%

Real-World Use Cases

1. Performance Debugging 🐛

Problem: API endpoint suddenly slow
Solution:

Check Traces view
Filter by slow requests (>1s)
See database query taking 950ms
Optimize query
Verify improvement in metrics

2. Error Investigation :icon

Problem: Users reporting 500 errors
Solution:

Check error rate dashboard
Jump to failing traces
See stack trace and logs
Identify null pointer exception
Deploy fix and monitor

3. Capacity Planning :icon

Problem: Need to scale before Black Friday
Solution:

Review historical metrics
Identify bottlenecks
Load test and observe traces
Scale accordingly
Monitor during event

4. Microservices Debugging 🕸️

Problem: Which service is causing timeouts?
Solution:

View service map
See latency between services
Identify slow service
Check its traces
Find database connection pool exhausted

Why SigNoz is Awesome

:icon{name="lucide:smile"} Open Source: Free forever, no limits
:icon{name="lucide:rocket"} Fast: ClickHouse is CRAZY fast
:icon{name="lucide:target"} Complete: Metrics + Traces + Logs in one
:icon{name="lucide:bar-chart"} Powerful: Query anything, any way
:icon{name="lucide:lock"} Private: Your data stays on your server
:icon{name="lucide:dollar-sign"} Cost-Effective: No per-seat pricing
:icon{name="lucide:hammer"} Flexible: Customize everything
:icon{name="lucide:trending-up"} Scalable: Grows with your needs

Resources

"You can't fix what you can't see. SigNoz makes everything visible." - Observability Wisdom :icon{name="lucide:search"}:icon{name="lucide:sparkles"}

14 KiB Executable File Raw Blame History