Production Deployment
Complete guide to deploying RCrewAI applications in production with Docker, Kubernetes, monitoring, and enterprise features
Production Deployment
This comprehensive tutorial covers deploying RCrewAI applications to production environments with enterprise-grade reliability, monitoring, scaling, and security. You’ll learn containerization, orchestration, monitoring, and operational best practices.
Table of Contents
- Production Readiness Checklist
- Containerization with Docker
- Kubernetes Deployment
- Configuration Management
- Monitoring and Observability
- Scaling and Load Balancing
- Security and Access Control
- CI/CD Pipeline
- Operational Procedures
- Troubleshooting and Recovery
Production Readiness Checklist
Before deploying to production, ensure your RCrewAI application meets these requirements:
✅ Code Quality
- Comprehensive test coverage (>90%)
- Code review process in place
- Static analysis and linting
- Performance benchmarks established
- Security vulnerability scanning
✅ Configuration
- Environment-based configuration
- Secrets management implemented
- Resource limits defined
- Timeout and retry logic configured
- Logging levels appropriate for production
✅ Monitoring
- Health check endpoints implemented
- Metrics collection configured
- Alerting rules defined
- Log aggregation setup
- Performance monitoring enabled
✅ Infrastructure
- Load balancing configured
- Auto-scaling policies defined
- Backup and disaster recovery plan
- Network security implemented
- Resource quotas established
Containerization with Docker
Basic Dockerfile
# Use official Ruby runtime as base image
FROM ruby:3.1-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy Gemfile and Gemfile.lock
COPY Gemfile Gemfile.lock ./
# Install Ruby dependencies
RUN bundle config set --local deployment 'true' && \
bundle config set --local without 'development test' && \
bundle install
# Copy application code
COPY . .
# Create non-root user for security
RUN groupadd -r rcrewai && useradd -r -g rcrewai rcrewai
RUN chown -R rcrewai:rcrewai /app
USER rcrewai
# Expose port
EXPOSE 8080
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# Default command
CMD ["ruby", "production_app.rb"]
Multi-stage Production Dockerfile
# Build stage
FROM ruby:3.1-slim AS builder
RUN apt-get update && apt-get install -y \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY Gemfile Gemfile.lock ./
RUN bundle config set --local deployment 'true' && \
bundle config set --local without 'development test' && \
bundle install
# Production stage
FROM ruby:3.1-slim AS production
# Install only runtime dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get autoremove -y
WORKDIR /app
# Copy gems from builder stage
COPY --from=builder /usr/local/bundle /usr/local/bundle
# Copy application code
COPY . .
# Create non-root user
RUN groupadd -r rcrewai && useradd -r -g rcrewai -d /app rcrewai
RUN chown -R rcrewai:rcrewai /app
# Switch to non-root user
USER rcrewai
# Environment variables
ENV RAILS_ENV=production
ENV RACK_ENV=production
ENV BUNDLE_DEPLOYMENT=true
ENV BUNDLE_WITHOUT="development:test"
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
CMD ruby health_check.rb || exit 1
EXPOSE 8080
CMD ["ruby", "production_app.rb"]
Production Application Structure
# production_app.rb
require 'rcrewai'
require 'sinatra'
require 'json'
require 'logger'
require 'prometheus/middleware/collector'
require 'prometheus/middleware/exporter'
class ProductionRCrewAI < Sinatra::Base
configure :production do
enable :logging
set :logger, Logger.new($stdout)
# Metrics collection
use Prometheus::Middleware::Collector
use Prometheus::Middleware::Exporter
# Configure RCrewAI for production
RCrewAI.configure do |config|
config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
end
# Initialize crew registry
@@crew_registry = CrewRegistry.new
@@crew_registry.register_default_crews
end
# Health check endpoint
get '/health' do
content_type :json
begin
health_status = perform_health_check
status health_status[:status] == 'healthy' ? 200 : 503
health_status.to_json
rescue => e
status 503
{ status: 'unhealthy', error: e.message }.to_json
end
end
# Readiness check endpoint
get '/ready' do
content_type :json
begin
readiness_status = perform_readiness_check
status readiness_status[:ready] ? 200 : 503
readiness_status.to_json
rescue => e
status 503
{ ready: false, error: e.message }.to_json
end
end
# Metrics endpoint
get '/metrics' do
# Prometheus metrics are handled by middleware
end
# Main execution endpoint
post '/execute' do
content_type :json
begin
request_data = JSON.parse(request.body.read)
# Validate request
validate_execution_request(request_data)
# Get crew
crew_name = request_data['crew_name']
crew = @@crew_registry.get_crew(crew_name)
# Execute with monitoring
result = execute_with_monitoring(crew, request_data)
status 200
result.to_json
rescue JSON::ParserError
status 400
{ error: 'Invalid JSON in request body' }.to_json
rescue ValidationError => e
status 400
{ error: e.message }.to_json
rescue => e
logger.error "Execution failed: #{e.message}"
logger.error e.backtrace.join("\n")
status 500
{ error: 'Internal server error' }.to_json
end
end
private
def perform_health_check
checks = {
timestamp: Time.now.iso8601,
status: 'healthy',
checks: {}
}
# Check LLM provider connectivity
begin
# Quick LLM test
RCrewAI.client.chat(
messages: [{ role: 'user', content: 'test' }],
max_tokens: 1,
temperature: 0
)
checks[:checks][:llm] = { status: 'healthy' }
rescue => e
checks[:checks][:llm] = { status: 'unhealthy', error: e.message }
checks[:status] = 'unhealthy'
end
# Check memory usage
memory_usage = get_memory_usage
if memory_usage > 0.9
checks[:checks][:memory] = { status: 'warning', usage: memory_usage }
checks[:status] = 'degraded'
else
checks[:checks][:memory] = { status: 'healthy', usage: memory_usage }
end
checks
end
def perform_readiness_check
{
ready: true,
timestamp: Time.now.iso8601,
crews: @@crew_registry.crew_count,
uptime: Process.clock_gettime(Process::CLOCK_MONOTONIC).to_i
}
end
def validate_execution_request(data)
required_fields = ['crew_name']
missing_fields = required_fields - data.keys
if missing_fields.any?
raise ValidationError, "Missing required fields: #{missing_fields.join(', ')}"
end
unless @@crew_registry.crew_exists?(data['crew_name'])
raise ValidationError, "Unknown crew: #{data['crew_name']}"
end
end
def execute_with_monitoring(crew, request_data)
start_time = Time.now
execution_id = SecureRandom.uuid
logger.info "Starting execution", {
execution_id: execution_id,
crew_name: crew.name,
request_id: request_data['request_id']
}
begin
# Execute crew
result = crew.execute(
timeout: ENV.fetch('EXECUTION_TIMEOUT', '300').to_i,
max_retries: ENV.fetch('MAX_RETRIES', '3').to_i
)
duration = Time.now - start_time
logger.info "Execution completed", {
execution_id: execution_id,
duration: duration,
success_rate: result[:success_rate]
}
{
execution_id: execution_id,
success: true,
duration: duration,
result: result
}
rescue => e
duration = Time.now - start_time
logger.error "Execution failed", {
execution_id: execution_id,
duration: duration,
error: e.message
}
raise
end
end
def get_memory_usage
# Simple memory usage check
memory_info = `cat /proc/meminfo`.split("\n")
total = memory_info.find { |line| line.start_with?('MemTotal:') }.split[1].to_i
available = memory_info.find { |line| line.start_with?('MemAvailable:') }.split[1].to_i
(total - available).to_f / total
rescue
0.0
end
end
class ValidationError < StandardError; end
class CrewRegistry
def initialize
@crews = {}
end
def register_crew(name, crew)
@crews[name] = crew
end
def get_crew(name)
crew = @crews[name]
raise ValidationError, "Crew not found: #{name}" unless crew
crew
end
def crew_exists?(name)
@crews.key?(name)
end
def crew_count
@crews.length
end
def register_default_crews
# Register your production crews here
support_crew = create_support_crew
register_crew('customer_support', support_crew)
analysis_crew = create_analysis_crew
register_crew('data_analysis', analysis_crew)
end
private
def create_support_crew
crew = RCrewAI::Crew.new("customer_support")
support_agent = RCrewAI::Agent.new(
name: "support_specialist",
role: "Customer Support Specialist",
goal: "Provide excellent customer support and resolve issues efficiently",
tools: [
RCrewAI::Tools::WebSearch.new(max_results: 5),
RCrewAI::Tools::FileReader.new
]
)
crew.add_agent(support_agent)
support_task = RCrewAI::Task.new(
name: "handle_support_request",
description: "Handle customer support request with empathy and expertise",
expected_output: "Professional support response with clear next steps"
)
crew.add_task(support_task)
crew
end
def create_analysis_crew
crew = RCrewAI::Crew.new("data_analysis")
analyst = RCrewAI::Agent.new(
name: "data_analyst",
role: "Senior Data Analyst",
goal: "Analyze data and provide actionable insights",
tools: [
RCrewAI::Tools::FileReader.new,
RCrewAI::Tools::WebSearch.new
]
)
crew.add_agent(analyst)
analysis_task = RCrewAI::Task.new(
name: "data_analysis",
description: "Perform comprehensive data analysis and generate insights",
expected_output: "Detailed analysis report with charts and recommendations"
)
crew.add_task(analysis_task)
crew
end
end
# Health check script for Docker
# health_check.rb
begin
require 'net/http'
uri = URI('http://localhost:8080/health')
response = Net::HTTP.get_response(uri)
exit(response.code == '200' ? 0 : 1)
rescue
exit 1
end
# Start the application
if __FILE__ == $0
ProductionRCrewAI.run!(
host: '0.0.0.0',
port: ENV.fetch('PORT', 8080).to_i
)
end
Docker Compose for Development
# docker-compose.yml
version: '3.8'
services:
rcrewai:
build:
context: .
dockerfile: Dockerfile
target: production
ports:
- "8080:8080"
environment:
- RAILS_ENV=production
- OPENAI_API_KEY=${OPENAI_API_KEY}
- LLM_PROVIDER=openai
- LLM_TEMPERATURE=0.1
- EXECUTION_TIMEOUT=300
- MAX_RETRIES=3
depends_on:
- redis
- prometheus
restart: unless-stopped
healthcheck:
test: ["CMD", "ruby", "health_check.rb"]
interval: 30s
timeout: 10s
retries: 3
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis_data:/data
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
restart: unless-stopped
volumes:
redis_data:
prometheus_data:
grafana_data:
Kubernetes Deployment
Deployment Configuration
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rcrewai-app
labels:
app: rcrewai
version: v1.0.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
selector:
matchLabels:
app: rcrewai
template:
metadata:
labels:
app: rcrewai
version: v1.0.0
spec:
serviceAccountName: rcrewai-service-account
containers:
- name: rcrewai
image: your-registry/rcrewai:v1.0.0
ports:
- containerPort: 8080
env:
- name: RAILS_ENV
value: "production"
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: rcrewai-secrets
key: openai-api-key
- name: REDIS_URL
value: "redis://redis-service:6379"
- name: LLM_PROVIDER
value: "openai"
- name: LLM_TEMPERATURE
value: "0.1"
- name: EXECUTION_TIMEOUT
value: "300"
- name: MAX_RETRIES
value: "3"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
volumeMounts:
- name: tmp-volume
mountPath: /tmp
volumes:
- name: tmp-volume
emptyDir: {}
imagePullSecrets:
- name: registry-secret
---
apiVersion: v1
kind: Service
metadata:
name: rcrewai-service
labels:
app: rcrewai
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: rcrewai
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rcrewai-service-account
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: rcrewai-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- api.yourcompany.com
secretName: rcrewai-tls
rules:
- host: api.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: rcrewai-service
port:
number: 80
ConfigMap and Secrets
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: rcrewai-config
data:
LLM_PROVIDER: "openai"
LLM_TEMPERATURE: "0.1"
LLM_MAX_TOKENS: "4000"
EXECUTION_TIMEOUT: "300"
MAX_RETRIES: "3"
LOG_LEVEL: "INFO"
METRICS_ENABLED: "true"
---
apiVersion: v1
kind: Secret
metadata:
name: rcrewai-secrets
type: Opaque
data:
openai-api-key: <base64-encoded-api-key>
anthropic-api-key: <base64-encoded-api-key>
database-url: <base64-encoded-database-url>
Horizontal Pod Autoscaler
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rcrewai-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rcrewai-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
Configuration Management
Environment-based Configuration
# config/production.rb
class ProductionConfig
def self.configure
RCrewAI.configure do |config|
# LLM Provider Configuration
config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
case config.llm_provider
when :openai
config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
config.base_url = ENV['OPENAI_BASE_URL'] # Optional custom endpoint
when :anthropic
config.anthropic_api_key = ENV.fetch('ANTHROPIC_API_KEY')
when :azure
config.azure_api_key = ENV.fetch('AZURE_OPENAI_API_KEY')
config.base_url = ENV.fetch('AZURE_OPENAI_ENDPOINT')
config.api_version = ENV.fetch('AZURE_API_VERSION', '2023-05-15')
when :google
config.google_api_key = ENV.fetch('GOOGLE_API_KEY')
end
# Model Parameters
config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
# Production Settings
config.retry_limit = ENV.fetch('LLM_RETRY_LIMIT', '3').to_i
config.retry_delay = ENV.fetch('LLM_RETRY_DELAY', '2').to_i
config.max_concurrent_requests = ENV.fetch('MAX_CONCURRENT_REQUESTS', '10').to_i
# Logging
config.log_level = ENV.fetch('LOG_LEVEL', 'INFO').upcase
config.structured_logging = ENV.fetch('STRUCTURED_LOGGING', 'true') == 'true'
# Security
config.validate_ssl = ENV.fetch('VALIDATE_SSL', 'true') == 'true'
config.user_agent = "RCrewAI/#{RCrewAI::VERSION} (Production)"
end
end
def self.database_config
{
url: ENV.fetch('DATABASE_URL'),
pool_size: ENV.fetch('DB_POOL_SIZE', '5').to_i,
checkout_timeout: ENV.fetch('DB_CHECKOUT_TIMEOUT', '5').to_i,
reaping_frequency: ENV.fetch('DB_REAPING_FREQUENCY', '10').to_i
}
end
def self.redis_config
{
url: ENV.fetch('REDIS_URL', 'redis://localhost:6379'),
timeout: ENV.fetch('REDIS_TIMEOUT', '5').to_i,
reconnect_attempts: ENV.fetch('REDIS_RECONNECT_ATTEMPTS', '3').to_i
}
end
def self.monitoring_config
{
metrics_enabled: ENV.fetch('METRICS_ENABLED', 'true') == 'true',
traces_enabled: ENV.fetch('TRACES_ENABLED', 'true') == 'true',
health_check_interval: ENV.fetch('HEALTH_CHECK_INTERVAL', '30').to_i,
performance_monitoring: ENV.fetch('PERFORMANCE_MONITORING', 'true') == 'true'
}
end
end
Secrets Management with Vault
# config/vault_client.rb
require 'vault'
class VaultClient
def initialize
Vault.configure do |config|
config.address = ENV.fetch('VAULT_ADDR')
config.token = ENV['VAULT_TOKEN']
config.ssl_verify = ENV.fetch('VAULT_SSL_VERIFY', 'true') == 'true'
end
end
def get_secret(path)
secret = Vault.logical.read(path)
secret&.data
rescue Vault::VaultError => e
Rails.logger.error "Vault error: #{e.message}"
raise
end
def get_database_credentials
get_secret('secret/data/database')
end
def get_llm_api_keys
get_secret('secret/data/llm_providers')
end
def refresh_secrets
# Implement secret rotation logic
new_secrets = get_llm_api_keys
if new_secrets
ENV['OPENAI_API_KEY'] = new_secrets[:openai_api_key]
ENV['ANTHROPIC_API_KEY'] = new_secrets[:anthropic_api_key]
# Reconfigure RCrewAI with new secrets
ProductionConfig.configure
end
end
end
# Periodic secret refresh
Thread.new do
vault_client = VaultClient.new
loop do
sleep(3600) # Refresh every hour
begin
vault_client.refresh_secrets
rescue => e
Rails.logger.error "Secret refresh failed: #{e.message}"
end
end
end
Monitoring and Observability
Prometheus Metrics
# lib/metrics.rb
require 'prometheus/client'
class RCrewAIMetrics
def initialize
@registry = Prometheus::Client.registry
setup_metrics
end
def setup_metrics
# Request counters
@request_total = @registry.counter(
:rcrewai_requests_total,
docstring: 'Total number of requests',
labels: [:method, :path, :status]
)
@execution_total = @registry.counter(
:rcrewai_executions_total,
docstring: 'Total number of crew executions',
labels: [:crew_name, :status]
)
# Duration histograms
@request_duration = @registry.histogram(
:rcrewai_request_duration_seconds,
docstring: 'Request duration in seconds',
labels: [:method, :path],
buckets: [0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]
)
@execution_duration = @registry.histogram(
:rcrewai_execution_duration_seconds,
docstring: 'Crew execution duration in seconds',
labels: [:crew_name],
buckets: [1.0, 5.0, 10.0, 30.0, 60.0, 300.0, 600.0]
)
# Gauges
@active_executions = @registry.gauge(
:rcrewai_active_executions,
docstring: 'Number of active executions',
labels: [:crew_name]
)
@memory_usage = @registry.gauge(
:rcrewai_memory_usage_bytes,
docstring: 'Memory usage in bytes'
)
@llm_api_calls = @registry.counter(
:rcrewai_llm_api_calls_total,
docstring: 'Total LLM API calls',
labels: [:provider, :model, :status]
)
end
def record_request(method, path, status, duration)
@request_total.increment(labels: { method: method, path: path, status: status })
@request_duration.observe(duration, labels: { method: method, path: path })
end
def record_execution_start(crew_name)
@active_executions.increment(labels: { crew_name: crew_name })
end
def record_execution_complete(crew_name, status, duration)
@active_executions.decrement(labels: { crew_name: crew_name })
@execution_total.increment(labels: { crew_name: crew_name, status: status })
@execution_duration.observe(duration, labels: { crew_name: crew_name })
end
def record_llm_call(provider, model, status)
@llm_api_calls.increment(labels: { provider: provider, model: model, status: status })
end
def update_memory_usage
memory = get_memory_usage_bytes
@memory_usage.set(memory)
end
private
def get_memory_usage_bytes
`ps -o rss= -p #{Process.pid}`.to_i * 1024
rescue
0
end
end
# Initialize global metrics instance
$metrics = RCrewAIMetrics.new
# Middleware for automatic metrics collection
class MetricsMiddleware
def initialize(app)
@app = app
end
def call(env)
start_time = Time.now
method = env['REQUEST_METHOD']
path = env['PATH_INFO']
status, headers, body = @app.call(env)
duration = Time.now - start_time
$metrics.record_request(method, path, status.to_s, duration)
[status, headers, body]
end
end
Structured Logging
# lib/structured_logger.rb
require 'json'
require 'logger'
class StructuredLogger
def initialize(output = $stdout)
@logger = Logger.new(output)
@logger.level = Logger.const_get(ENV.fetch('LOG_LEVEL', 'INFO'))
@logger.formatter = method(:json_formatter)
end
def info(message, context = {})
@logger.info(log_entry(message, context))
end
def warn(message, context = {})
@logger.warn(log_entry(message, context))
end
def error(message, context = {})
@logger.error(log_entry(message, context))
end
def debug(message, context = {})
@logger.debug(log_entry(message, context))
end
private
def log_entry(message, context)
{
timestamp: Time.now.utc.iso8601,
level: caller_locations(2, 1)[0].label.upcase,
message: message,
service: 'rcrewai',
version: RCrewAI::VERSION,
environment: ENV.fetch('RAILS_ENV', 'development'),
process_id: Process.pid,
thread_id: Thread.current.object_id
}.merge(context)
end
def json_formatter(severity, timestamp, progname, msg)
if msg.is_a?(Hash)
msg.to_json + "\n"
else
{
timestamp: timestamp.utc.iso8601,
level: severity,
message: msg.to_s,
service: 'rcrewai'
}.to_json + "\n"
end
end
end
# Global logger instance
$logger = StructuredLogger.new
Distributed Tracing
# lib/tracing.rb
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/jaeger'
require 'opentelemetry/instrumentation/all'
class TracingSetup
def self.configure
OpenTelemetry::SDK.configure do |c|
c.service_name = 'rcrewai'
c.service_version = RCrewAI::VERSION
c.add_span_processor(
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
OpenTelemetry::Exporter::Jaeger::AgentExporter.new(
endpoint: ENV.fetch('JAEGER_AGENT_HOST', 'localhost:14268')
)
)
)
c.use_all() # Enable all instrumentations
end
end
def self.tracer
OpenTelemetry.tracer_provider.tracer('rcrewai', RCrewAI::VERSION)
end
end
# Initialize tracing
TracingSetup.configure if ENV.fetch('TRACING_ENABLED', 'true') == 'true'
# Tracing middleware
class TracingMiddleware
def initialize(app)
@app = app
@tracer = TracingSetup.tracer
end
def call(env)
@tracer.in_span("http_request") do |span|
span.set_attribute('http.method', env['REQUEST_METHOD'])
span.set_attribute('http.url', env['PATH_INFO'])
status, headers, body = @app.call(env)
span.set_attribute('http.status_code', status)
span.status = OpenTelemetry::Trace::Status.error if status >= 400
[status, headers, body]
end
end
end
Scaling and Load Balancing
Auto-scaling Configuration
# k8s/vertical-pod-autoscaler.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: rcrewai-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: rcrewai-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: rcrewai
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 1
memory: 1Gi
Load Balancer Configuration
# nginx.conf
upstream rcrewai_backend {
least_conn;
server rcrewai-1:8080 max_fails=3 fail_timeout=30s;
server rcrewai-2:8080 max_fails=3 fail_timeout=30s;
server rcrewai-3:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name api.yourcompany.com;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req zone=api burst=20 nodelay;
# Request timeout
proxy_read_timeout 300s;
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
location / {
proxy_pass http://rcrewai_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Health check
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 60s;
}
location /health {
access_log off;
proxy_pass http://rcrewai_backend;
}
location /metrics {
access_log off;
allow 10.0.0.0/8;
allow 192.168.0.0/16;
deny all;
proxy_pass http://rcrewai_backend;
}
}
Security and Access Control
Network Policies
# k8s/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: rcrewai-network-policy
spec:
podSelector:
matchLabels:
app: rcrewai
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-system
- podSelector:
matchLabels:
app: load-balancer
ports:
- protocol: TCP
port: 8080
egress:
- to: []
ports:
- protocol: TCP
port: 443 # HTTPS
- protocol: TCP
port: 53 # DNS
- protocol: UDP
port: 53 # DNS
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
Pod Security Standards
# k8s/pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: rcrewai-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
Authentication and Authorization
# lib/auth.rb
require 'jwt'
class AuthenticationMiddleware
def initialize(app)
@app = app
@secret = ENV.fetch('JWT_SECRET')
end
def call(env)
# Skip auth for health checks
if env['PATH_INFO'] == '/health' || env['PATH_INFO'] == '/ready'
return @app.call(env)
end
auth_header = env['HTTP_AUTHORIZATION']
unless auth_header&.start_with?('Bearer ')
return unauthorized_response
end
token = auth_header.sub('Bearer ', '')
begin
payload = JWT.decode(token, @secret, true, algorithm: 'HS256')[0]
env['user_id'] = payload['user_id']
env['permissions'] = payload['permissions'] || []
@app.call(env)
rescue JWT::DecodeError
unauthorized_response
end
end
private
def unauthorized_response
[401, {'Content-Type' => 'application/json'}, [
{ error: 'Unauthorized' }.to_json
]]
end
end
class AuthorizationMiddleware
def initialize(app)
@app = app
end
def call(env)
permissions = env['permissions'] || []
path = env['PATH_INFO']
method = env['REQUEST_METHOD']
required_permission = determine_required_permission(method, path)
if required_permission && !permissions.include?(required_permission)
return forbidden_response
end
@app.call(env)
end
private
def determine_required_permission(method, path)
case [method, path]
when ['POST', '/execute']
'execute_crew'
when ['GET', '/metrics']
'view_metrics'
else
nil # No special permission required
end
end
def forbidden_response
[403, {'Content-Type' => 'application/json'}, [
{ error: 'Forbidden' }.to_json
]]
end
end
CI/CD Pipeline
GitHub Actions Workflow
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
tags: ['v*']
env:
REGISTRY: ghcr.io
IMAGE_NAME: $
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.1
bundler-cache: true
- name: Run tests
run: bundle exec rspec
- name: Run security scan
run: |
bundle exec bundle-audit check --update
bundle exec brakeman -q -w2
- name: Check code style
run: bundle exec rubocop
build:
needs: test
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v3
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: $
username: $
password: $
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: $/$
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern=
type=semver,pattern=.
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: $
labels: $
deploy:
needs: build
runs-on: ubuntu-latest
environment: production
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v3
- name: Configure kubectl
uses: azure/k8s-set-context@v1
with:
method: kubeconfig
kubeconfig: $
- name: Deploy to Kubernetes
run: |
kubectl set image deployment/rcrewai-app \
rcrewai=$/$:$
kubectl rollout status deployment/rcrewai-app --timeout=600s
- name: Run smoke tests
run: |
kubectl wait --for=condition=ready pod -l app=rcrewai --timeout=300s
./scripts/smoke-tests.sh
Deployment Scripts
#!/bin/bash
# scripts/deploy.sh
set -euo pipefail
ENVIRONMENT=${1:-production}
IMAGE_TAG=${2:-latest}
echo "Deploying RCrewAI to $ENVIRONMENT with tag $IMAGE_TAG"
# Update deployment with new image
kubectl set image deployment/rcrewai-app \
rcrewai="ghcr.io/yourorg/rcrewai:$IMAGE_TAG" \
--namespace="$ENVIRONMENT"
# Wait for rollout to complete
kubectl rollout status deployment/rcrewai-app \
--namespace="$ENVIRONMENT" \
--timeout=600s
# Verify deployment
echo "Verifying deployment..."
kubectl get pods -l app=rcrewai --namespace="$ENVIRONMENT"
# Run health check
echo "Running health check..."
HEALTH_URL=$(kubectl get service rcrewai-service --namespace="$ENVIRONMENT" \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -f "http://$HEALTH_URL/health" || {
echo "Health check failed!"
exit 1
}
echo "Deployment successful!"
#!/bin/bash
# scripts/smoke-tests.sh
set -euo pipefail
SERVICE_URL=${SERVICE_URL:-http://localhost:8080}
echo "Running smoke tests against $SERVICE_URL"
# Test 1: Health check
echo "Testing health endpoint..."
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/health")
if [[ $response != "200" ]]; then
echo "Health check failed: $response"
exit 1
fi
# Test 2: Ready check
echo "Testing readiness endpoint..."
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/ready")
if [[ $response != "200" ]]; then
echo "Readiness check failed: $response"
exit 1
fi
# Test 3: Metrics endpoint
echo "Testing metrics endpoint..."
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/metrics")
if [[ $response != "200" ]] && [[ $response != "403" ]]; then
echo "Metrics check failed: $response"
exit 1
fi
# Test 4: Basic execution (if auth allows)
echo "Testing basic execution..."
response=$(curl -s -X POST "$SERVICE_URL/execute" \
-H "Content-Type: application/json" \
-d '{"crew_name": "customer_support", "request_id": "test-123"}' \
-w "%{http_code}" -o /dev/null)
# Accept 401/403 for auth-protected endpoints
if [[ $response != "200" ]] && [[ $response != "401" ]] && [[ $response != "403" ]]; then
echo "Execution test failed: $response"
exit 1
fi
echo "All smoke tests passed!"
Operational Procedures
Monitoring Dashboard
# grafana/dashboards/rcrewai-dashboard.json
{
"dashboard": {
"title": "RCrewAI Production Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(rcrewai_requests_total[5m])",
"legendFormat": " "
}
]
},
{
"title": "Response Times",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(rcrewai_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "rate(rcrewai_requests_total{status=~\"5..\"}[5m]) / rate(rcrewai_requests_total[5m]) * 100",
"legendFormat": "Error Rate %"
}
]
},
{
"title": "Crew Executions",
"type": "graph",
"targets": [
{
"expr": "rate(rcrewai_executions_total[5m])",
"legendFormat": " "
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "rcrewai_memory_usage_bytes / 1024 / 1024",
"legendFormat": "Memory MB"
}
]
},
{
"title": "Active Executions",
"type": "stat",
"targets": [
{
"expr": "sum(rcrewai_active_executions)",
"legendFormat": "Active"
}
]
}
]
}
}
Alerting Rules
# prometheus/alerts.yml
groups:
- name: rcrewai
rules:
- alert: HighErrorRate
expr: rate(rcrewai_requests_total{status=~"5.."}[5m]) / rate(rcrewai_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is % for the last 5 minutes"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is s"
- alert: ServiceDown
expr: up{job="rcrewai"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RCrewAI service is down"
description: "RCrewAI service has been down for more than 1 minute"
- alert: HighMemoryUsage
expr: rcrewai_memory_usage_bytes / 1024 / 1024 / 1024 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is GB"
- alert: TooManyActiveExecutions
expr: sum(rcrewai_active_executions) > 50
for: 2m
labels:
severity: warning
annotations:
summary: "Too many active executions"
description: " executions are currently active"
Backup and Recovery
#!/bin/bash
# scripts/backup.sh
set -euo pipefail
BACKUP_DIR=${BACKUP_DIR:-/backups}
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
echo "Starting backup at $TIMESTAMP"
# Backup configuration
kubectl get configmap rcrewai-config -o yaml > "$BACKUP_DIR/config_$TIMESTAMP.yaml"
kubectl get secret rcrewai-secrets -o yaml > "$BACKUP_DIR/secrets_$TIMESTAMP.yaml"
# Backup persistent data (if any)
if kubectl get pvc rcrewai-data 2>/dev/null; then
kubectl exec -it deployment/rcrewai-app -- tar czf - /data > "$BACKUP_DIR/data_$TIMESTAMP.tar.gz"
fi
# Cleanup old backups (keep last 30 days)
find "$BACKUP_DIR" -name "*.yaml" -mtime +30 -delete
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +30 -delete
echo "Backup completed successfully"
Troubleshooting and Recovery
Common Issues and Solutions
High Memory Usage
# Check memory usage
kubectl top pods -l app=rcrewai
# Check for memory leaks
kubectl exec -it deployment/rcrewai-app -- ps aux
# Restart pods if needed
kubectl rollout restart deployment/rcrewai-app
Slow Response Times
# Check CPU usage
kubectl top pods -l app=rcrewai
# Scale up if needed
kubectl scale deployment rcrewai-app --replicas=5
# Check database connections
kubectl logs deployment/rcrewai-app | grep -i "database\|connection"
Failed Deployments
# Check rollout status
kubectl rollout status deployment/rcrewai-app
# Check pod logs
kubectl logs deployment/rcrewai-app --previous
# Rollback if needed
kubectl rollout undo deployment/rcrewai-app
Recovery Procedures
Complete Service Recovery
#!/bin/bash
# scripts/disaster-recovery.sh
set -euo pipefail
echo "Starting disaster recovery procedure"
# 1. Restore configuration
kubectl apply -f backups/config_latest.yaml
kubectl apply -f backups/secrets_latest.yaml
# 2. Deploy application
kubectl apply -f k8s/
# 3. Wait for deployment
kubectl wait --for=condition=available deployment/rcrewai-app --timeout=600s
# 4. Restore data if needed
if [[ -f "backups/data_latest.tar.gz" ]]; then
kubectl exec -it deployment/rcrewai-app -- tar xzf - -C / < backups/data_latest.tar.gz
fi
# 5. Verify service
./scripts/smoke-tests.sh
echo "Disaster recovery completed"
Best Practices Summary
1. Security
- Use non-root containers
- Implement network policies
- Manage secrets properly
- Enable authentication and authorization
- Regular security scans
2. Reliability
- Health and readiness checks
- Resource limits and requests
- Graceful shutdown handling
- Circuit breakers for external calls
- Comprehensive error handling
3. Scalability
- Horizontal pod autoscaling
- Load balancing
- Stateless application design
- Resource optimization
- Performance monitoring
4. Observability
- Structured logging
- Comprehensive metrics
- Distributed tracing
- Real-time alerting
- Dashboard visualization
5. Operations
- Automated deployments
- Blue-green deployments
- Backup and recovery procedures
- Incident response playbooks
- Regular performance reviews
This production deployment guide provides a comprehensive foundation for running RCrewAI applications at scale with enterprise-grade reliability, security, and observability.