Back to Blog
5 min read

Building a Distributed VPN Infrastructure with 99.9% Uptime

Deep dive into designing and deploying anti-censorship VPN protocols optimized for restrictive networks using Xray Core and Golang

Building a Distributed VPN Infrastructure with 99.9% Uptime

As CTO at Vexonik, I designed and deployed a distributed VPN infrastructure achieving 99.9% uptime across multiple nodes. This article covers the technical architecture, advanced anti-censorship protocols, and lessons learned building production-grade VPN systems.

The Challenge

Building a VPN service that works reliably in restrictive networks (like China's Great Firewall) requires more than just encrypting traffic. You need:

  • Advanced protocols that can't be easily detected
  • High availability across multiple nodes
  • Low latency for good user experience
  • Scalability to handle thousands of concurrent connections
  • Monitoring for quick issue detection

Architecture Overview

Our infrastructure consists of:

  1. Control Plane: Management API and authentication
  2. Data Plane: Multiple VPN nodes distributed globally
  3. Monitoring System: Real-time health checks and alerts
  4. Load Balancer: Traffic distribution and failover
┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│Load Balancer│
└──────┬──────┘
       │
       ├──────────┬──────────┬──────────┐
       ▼          ▼          ▼          ▼
   ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
   │Node1│    │Node2│    │Node3│    │Node4│
   └─────┘    └─────┘    └─────┘    └─────┘

Core Technology: Xray-Core

We chose Xray-Core as our foundation because it supports advanced protocols like VLESS, VMess, and Trojan with anti-censorship features.

1. Basic Xray Configuration

Here's a production-ready Xray config with VLESS + Reality protocol:

{
  "log": {
    "loglevel": "warning"
  },
  "inbounds": [
    {
      "port": 443,
      "protocol": "vless",
      "settings": {
        "clients": [
          {
            "id": "UUID-HERE",
            "flow": "xtls-rprx-vision"
          }
        ],
        "decryption": "none"
      },
      "streamSettings": {
        "network": "tcp",
        "security": "reality",
        "realitySettings": {
          "show": false,
          "dest": "www.microsoft.com:443",
          "xver": 0,
          "serverNames": [
            "www.microsoft.com"
          ],
          "privateKey": "PRIVATE-KEY-HERE",
          "shortIds": [
            "",
            "0123456789abcdef"
          ]
        }
      }
    }
  ],
  "outbounds": [
    {
      "protocol": "freedom",
      "tag": "direct"
    }
  ]
}

2. Reality Protocol Explained

Reality protocol is designed to be undetectable by deep packet inspection (DPI):

  • TLS Fingerprinting: Mimics legitimate HTTPS traffic
  • SNI Routing: Routes to real websites when probed
  • Vision Flow: Adds extra obfuscation layer

Building the Control API with Golang

We built the management API in Go for performance and concurrency:

package main
 
import (
    "encoding/json"
    "fmt"
    "net/http"
    "time"
    
    "github.com/gorilla/mux"
    "github.com/xtls/xray-core/core"
)
 
type VPNNode struct {
    ID          string    `json:"id"`
    IP          string    `json:"ip"`
    Location    string    `json:"location"`
    Status      string    `json:"status"`
    Load        float64   `json:"load"`
    LastChecked time.Time `json:"last_checked"`
}
 
type NodeManager struct {
    nodes map[string]*VPNNode
}
 
func NewNodeManager() *NodeManager {
    return &NodeManager{
        nodes: make(map[string]*VPNNode),
    }
}
 
func (nm *NodeManager) AddNode(node *VPNNode) {
    nm.nodes[node.ID] = node
}
 
func (nm *NodeManager) GetHealthyNodes() []*VPNNode {
    healthy := make([]*VPNNode, 0)
    
    for _, node := range nm.nodes {
        if node.Status == "healthy" && node.Load < 0.8 {
            healthy = append(healthy, node)
        }
    }
    
    return healthy
}
 
func (nm *NodeManager) SelectBestNode() *VPNNode {
    nodes := nm.GetHealthyNodes()
    
    if len(nodes) == 0 {
        return nil
    }
    
    // Select node with lowest load
    bestNode := nodes[0]
    for _, node := range nodes {
        if node.Load < bestNode.Load {
            bestNode = node
        }
    }
    
    return bestNode
}
 
// HTTP Handlers
func (nm *NodeManager) HandleGetNodes(w http.ResponseWriter, r *http.Request) {
    nodes := make([]*VPNNode, 0, len(nm.nodes))
    for _, node := range nm.nodes {
        nodes = append(nodes, node)
    }
    
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(nodes)
}
 
func (nm *NodeManager) HandleGetBestNode(w http.ResponseWriter, r *http.Request) {
    node := nm.SelectBestNode()
    
    if node == nil {
        http.Error(w, "No healthy nodes available", http.StatusServiceUnavailable)
        return
    }
    
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(node)
}
 
func main() {
    manager := NewNodeManager()
    
    // Initialize nodes
    manager.AddNode(&VPNNode{
        ID:       "node-1",
        IP:       "192.168.1.1",
        Location: "Singapore",
        Status:   "healthy",
        Load:     0.45,
    })
    
    // Setup HTTP server
    r := mux.NewRouter()
    r.HandleFunc("/api/nodes", manager.HandleGetNodes).Methods("GET")
    r.HandleFunc("/api/nodes/best", manager.HandleGetBestNode).Methods("GET")
    
    fmt.Println("API Server starting on :8080")
    http.ListenAndServe(":8080", r)
}

User Management & Authentication

We implemented a secure user authentication system:

package auth
 
import (
    "crypto/rand"
    "encoding/hex"
    "time"
    
    "github.com/golang-jwt/jwt/v5"
    "golang.org/x/crypto/bcrypt"
)
 
type User struct {
    ID           string    `json:"id"`
    Email        string    `json:"email"`
    PasswordHash string    `json:"-"`
    UUID         string    `json:"uuid"` // For Xray
    CreatedAt    time.Time `json:"created_at"`
    ExpiresAt    time.Time `json:"expires_at"`
    IsActive     bool      `json:"is_active"`
}
 
type AuthService struct {
    jwtSecret []byte
}
 
func NewAuthService(secret string) *AuthService {
    return &AuthService{
        jwtSecret: []byte(secret),
    }
}
 
func (as *AuthService) HashPassword(password string) (string, error) {
    bytes, err := bcrypt.GenerateFromPassword([]byte(password), 14)
    return string(bytes), err
}
 
func (as *AuthService) CheckPassword(password, hash string) bool {
    err := bcrypt.CompareHashAndPassword([]byte(hash), []byte(password))
    return err == nil
}
 
func (as *AuthService) GenerateUUID() (string, error) {
    b := make([]byte, 16)
    _, err := rand.Read(b)
    if err != nil {
        return "", err
    }
    
    return fmt.Sprintf("%x-%x-%x-%x-%x",
        b[0:4], b[4:6], b[6:8], b[8:10], b[10:]), nil
}
 
func (as *AuthService) GenerateJWT(userID string) (string, error) {
    claims := jwt.MapClaims{
        "user_id": userID,
        "exp":     time.Now().Add(time.Hour * 24 * 7).Unix(),
    }
    
    token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims)
    return token.SignedString(as.jwtSecret)
}
 
func (as *AuthService) ValidateJWT(tokenString string) (*jwt.Token, error) {
    return jwt.Parse(tokenString, func(token *jwt.Token) (interface{}, error) {
        if _, ok := token.Method.(*jwt.SigningMethodHMAC); !ok {
            return nil, fmt.Errorf("unexpected signing method")
        }
        return as.jwtSecret, nil
    })
}

Health Monitoring System

Critical for maintaining 99.9% uptime:

package monitor
 
import (
    "context"
    "fmt"
    "net/http"
    "time"
)
 
type HealthChecker struct {
    nodes    []*VPNNode
    interval time.Duration
}
 
func NewHealthChecker(nodes []*VPNNode, interval time.Duration) *HealthChecker {
    return &HealthChecker{
        nodes:    nodes,
        interval: interval,
    }
}
 
func (hc *HealthChecker) Start(ctx context.Context) {
    ticker := time.NewTicker(hc.interval)
    defer ticker.Stop()
    
    for {
        select {
        case <-ticker.C:
            hc.checkAllNodes()
        case <-ctx.Done():
            return
        }
    }
}
 
func (hc *HealthChecker) checkAllNodes() {
    for _, node := range hc.nodes {
        go hc.checkNode(node)
    }
}
 
func (hc *HealthChecker) checkNode(node *VPNNode) {
    start := time.Now()
    
    // Check HTTP endpoint
    client := &http.Client{
        Timeout: 5 * time.Second,
    }
    
    resp, err := client.Get(fmt.Sprintf("http://%s/health", node.IP))
    latency := time.Since(start).Milliseconds()
    
    if err != nil || resp.StatusCode != 200 {
        node.Status = "unhealthy"
        hc.alertNodeDown(node)
        return
    }
    
    node.Status = "healthy"
    node.LastChecked = time.Now()
    
    // Update metrics
    hc.updateMetrics(node, latency)
}
 
func (hc *HealthChecker) updateMetrics(node *VPNNode, latency int64) {
    // Update Prometheus metrics
    nodeLatency.WithLabelValues(node.ID).Set(float64(latency))
    nodeStatus.WithLabelValues(node.ID).Set(1)
}
 
func (hc *HealthChecker) alertNodeDown(node *VPNNode) {
    // Send alert to monitoring system (Slack, PagerDuty, etc.)
    fmt.Printf("ALERT: Node %s is down!\n", node.ID)
}

Traffic Obfuscation Techniques

1. uTLS Implementation

uTLS helps bypass TLS fingerprinting:

package obfuscation
 
import (
    "crypto/tls"
    utls "github.com/refraction-networking/utls"
)
 
func CreateObfuscatedConnection(serverName string) (*utls.UConn, error) {
    tcpConn, err := net.Dial("tcp", serverName+":443")
    if err != nil {
        return nil, err
    }
    
    // Use Chrome fingerprint
    config := &utls.Config{
        ServerName: serverName,
    }
    
    uconn := utls.UClient(tcpConn, config, utls.HelloChrome_Auto)
    
    err = uconn.Handshake()
    if err != nil {
        return nil, err
    }
    
    return uconn, nil
}

2. Vision Flow Control

Vision adds an extra layer of obfuscation:

{
  "streamSettings": {
    "security": "reality",
    "realitySettings": {
      "fingerprint": "chrome",
      "serverName": "www.microsoft.com",
      "publicKey": "PUBLIC-KEY",
      "shortId": "0123456789abcdef",
      "spiderX": "/"
    }
  }
}

Performance Optimization

1. Connection Pooling

package pool
 
import (
    "sync"
    "time"
)
 
type ConnectionPool struct {
    conns     chan *Connection
    factory   func() (*Connection, error)
    maxSize   int
    mu        sync.Mutex
}
 
func NewConnectionPool(maxSize int, factory func() (*Connection, error)) *ConnectionPool {
    return &ConnectionPool{
        conns:   make(chan *Connection, maxSize),
        factory: factory,
        maxSize: maxSize,
    }
}
 
func (p *ConnectionPool) Get() (*Connection, error) {
    select {
    case conn := <-p.conns:
        if conn.IsAlive() {
            return conn, nil
        }
    default:
    }
    
    return p.factory()
}
 
func (p *ConnectionPool) Put(conn *Connection) {
    select {
    case p.conns <- conn:
    default:
        conn.Close()
    }
}

2. Load Balancing Algorithm

We use weighted round-robin with health checks:

type LoadBalancer struct {
    nodes   []*VPNNode
    current int
    mu      sync.Mutex
}
 
func (lb *LoadBalancer) GetNext() *VPNNode {
    lb.mu.Lock()
    defer lb.mu.Unlock()
    
    // Find next healthy node
    attempts := 0
    for attempts < len(lb.nodes) {
        lb.current = (lb.current + 1) % len(lb.nodes)
        node := lb.nodes[lb.current]
        
        if node.Status == "healthy" && node.Load < 0.9 {
            return node
        }
        
        attempts++
    }
    
    return nil
}

Deployment with Docker

Production-ready Dockerfile:

FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o vpn-server ./cmd/server

FROM alpine:latest

RUN apk --no-cache add ca-certificates
WORKDIR /root/

# Install Xray-core
ADD https://github.com/XTLS/Xray-core/releases/latest/download/Xray-linux-64.zip /tmp/
RUN unzip /tmp/Xray-linux-64.zip -d /usr/local/bin/ && \
    chmod +x /usr/local/bin/xray

COPY --from=builder /app/vpn-server .
COPY config.json .

EXPOSE 443 8080

CMD ["./vpn-server"]

Docker Compose for multi-node setup:

version: '3.8'
 
services:
  vpn-node-1:
    build: .
    ports:
      - "443:443"
      - "8080:8080"
    environment:
      - NODE_ID=node-1
      - NODE_LOCATION=Singapore
    volumes:
      - ./config-node1.json:/root/config.json
    restart: unless-stopped
 
  vpn-node-2:
    build: .
    ports:
      - "444:443"
      - "8081:8080"
    environment:
      - NODE_ID=node-2
      - NODE_LOCATION=Germany
    volumes:
      - ./config-node2.json:/root/config.json
    restart: unless-stopped
 
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
 
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Monitoring Dashboard

Prometheus configuration:

global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'vpn-nodes'
    static_configs:
      - targets:
        - 'vpn-node-1:8080'
        - 'vpn-node-2:8080'

Results & Metrics

After 3 months in production:

  • Uptime: 99.93%
  • Avg Latency: 45ms (Singapore)
  • Peak Concurrent Connections: 5,000+
  • Successful Anti-Censorship: Works in 95% of tested restrictive networks
  • Zero Security Incidents

Key Takeaways

  1. Protocol Selection Matters: Reality + Vision combo is currently the best for anti-censorship
  2. Monitoring is Critical: Health checks every 30 seconds prevented many issues
  3. Geographic Distribution: Multiple node locations ensure better availability
  4. Go is Perfect for This: Concurrency model handles thousands of connections efficiently
  5. Security First: Always use strong encryption and authentication

Tech Stack

  • Core: Xray-Core with Reality protocol
  • Backend: Golang
  • Monitoring: Prometheus + Grafana
  • Infrastructure: Docker, Docker Compose
  • Protocols: VLESS, VMess, Trojan, Reality, Vision, uTLS

Conclusion

Building a production-grade VPN service requires careful attention to protocols, monitoring, and performance. The combination of Xray-Core's advanced protocols and Go's concurrency makes it possible to achieve high uptime and great user experience.

The key is continuous monitoring and quick response to issues. With proper architecture, 99.9% uptime is achievable even when fighting against sophisticated censorship systems.


Building something similar? Feel free to reach out with questions!