Building a Distributed VPN Infrastructure with 99.9% Uptime
Deep dive into designing and deploying anti-censorship VPN protocols optimized for restrictive networks using Xray Core and Golang
As CTO at Vexonik, I designed and deployed a distributed VPN infrastructure achieving 99.9% uptime across multiple nodes. This article covers the technical architecture, advanced anti-censorship protocols, and lessons learned building production-grade VPN systems.
The Challenge
Building a VPN service that works reliably in restrictive networks (like China's Great Firewall) requires more than just encrypting traffic. You need:
- Advanced protocols that can't be easily detected
- High availability across multiple nodes
- Low latency for good user experience
- Scalability to handle thousands of concurrent connections
- Monitoring for quick issue detection
Architecture Overview
Our infrastructure consists of:
- Control Plane: Management API and authentication
- Data Plane: Multiple VPN nodes distributed globally
- Monitoring System: Real-time health checks and alerts
- Load Balancer: Traffic distribution and failover
┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────┐
│Load Balancer│
└──────┬──────┘
│
├──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│Node1│ │Node2│ │Node3│ │Node4│
└─────┘ └─────┘ └─────┘ └─────┘
Core Technology: Xray-Core
We chose Xray-Core as our foundation because it supports advanced protocols like VLESS, VMess, and Trojan with anti-censorship features.
1. Basic Xray Configuration
Here's a production-ready Xray config with VLESS + Reality protocol:
{
"log": {
"loglevel": "warning"
},
"inbounds": [
{
"port": 443,
"protocol": "vless",
"settings": {
"clients": [
{
"id": "UUID-HERE",
"flow": "xtls-rprx-vision"
}
],
"decryption": "none"
},
"streamSettings": {
"network": "tcp",
"security": "reality",
"realitySettings": {
"show": false,
"dest": "www.microsoft.com:443",
"xver": 0,
"serverNames": [
"www.microsoft.com"
],
"privateKey": "PRIVATE-KEY-HERE",
"shortIds": [
"",
"0123456789abcdef"
]
}
}
}
],
"outbounds": [
{
"protocol": "freedom",
"tag": "direct"
}
]
}2. Reality Protocol Explained
Reality protocol is designed to be undetectable by deep packet inspection (DPI):
- TLS Fingerprinting: Mimics legitimate HTTPS traffic
- SNI Routing: Routes to real websites when probed
- Vision Flow: Adds extra obfuscation layer
Building the Control API with Golang
We built the management API in Go for performance and concurrency:
package main
import (
"encoding/json"
"fmt"
"net/http"
"time"
"github.com/gorilla/mux"
"github.com/xtls/xray-core/core"
)
type VPNNode struct {
ID string `json:"id"`
IP string `json:"ip"`
Location string `json:"location"`
Status string `json:"status"`
Load float64 `json:"load"`
LastChecked time.Time `json:"last_checked"`
}
type NodeManager struct {
nodes map[string]*VPNNode
}
func NewNodeManager() *NodeManager {
return &NodeManager{
nodes: make(map[string]*VPNNode),
}
}
func (nm *NodeManager) AddNode(node *VPNNode) {
nm.nodes[node.ID] = node
}
func (nm *NodeManager) GetHealthyNodes() []*VPNNode {
healthy := make([]*VPNNode, 0)
for _, node := range nm.nodes {
if node.Status == "healthy" && node.Load < 0.8 {
healthy = append(healthy, node)
}
}
return healthy
}
func (nm *NodeManager) SelectBestNode() *VPNNode {
nodes := nm.GetHealthyNodes()
if len(nodes) == 0 {
return nil
}
// Select node with lowest load
bestNode := nodes[0]
for _, node := range nodes {
if node.Load < bestNode.Load {
bestNode = node
}
}
return bestNode
}
// HTTP Handlers
func (nm *NodeManager) HandleGetNodes(w http.ResponseWriter, r *http.Request) {
nodes := make([]*VPNNode, 0, len(nm.nodes))
for _, node := range nm.nodes {
nodes = append(nodes, node)
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(nodes)
}
func (nm *NodeManager) HandleGetBestNode(w http.ResponseWriter, r *http.Request) {
node := nm.SelectBestNode()
if node == nil {
http.Error(w, "No healthy nodes available", http.StatusServiceUnavailable)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(node)
}
func main() {
manager := NewNodeManager()
// Initialize nodes
manager.AddNode(&VPNNode{
ID: "node-1",
IP: "192.168.1.1",
Location: "Singapore",
Status: "healthy",
Load: 0.45,
})
// Setup HTTP server
r := mux.NewRouter()
r.HandleFunc("/api/nodes", manager.HandleGetNodes).Methods("GET")
r.HandleFunc("/api/nodes/best", manager.HandleGetBestNode).Methods("GET")
fmt.Println("API Server starting on :8080")
http.ListenAndServe(":8080", r)
}User Management & Authentication
We implemented a secure user authentication system:
package auth
import (
"crypto/rand"
"encoding/hex"
"time"
"github.com/golang-jwt/jwt/v5"
"golang.org/x/crypto/bcrypt"
)
type User struct {
ID string `json:"id"`
Email string `json:"email"`
PasswordHash string `json:"-"`
UUID string `json:"uuid"` // For Xray
CreatedAt time.Time `json:"created_at"`
ExpiresAt time.Time `json:"expires_at"`
IsActive bool `json:"is_active"`
}
type AuthService struct {
jwtSecret []byte
}
func NewAuthService(secret string) *AuthService {
return &AuthService{
jwtSecret: []byte(secret),
}
}
func (as *AuthService) HashPassword(password string) (string, error) {
bytes, err := bcrypt.GenerateFromPassword([]byte(password), 14)
return string(bytes), err
}
func (as *AuthService) CheckPassword(password, hash string) bool {
err := bcrypt.CompareHashAndPassword([]byte(hash), []byte(password))
return err == nil
}
func (as *AuthService) GenerateUUID() (string, error) {
b := make([]byte, 16)
_, err := rand.Read(b)
if err != nil {
return "", err
}
return fmt.Sprintf("%x-%x-%x-%x-%x",
b[0:4], b[4:6], b[6:8], b[8:10], b[10:]), nil
}
func (as *AuthService) GenerateJWT(userID string) (string, error) {
claims := jwt.MapClaims{
"user_id": userID,
"exp": time.Now().Add(time.Hour * 24 * 7).Unix(),
}
token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims)
return token.SignedString(as.jwtSecret)
}
func (as *AuthService) ValidateJWT(tokenString string) (*jwt.Token, error) {
return jwt.Parse(tokenString, func(token *jwt.Token) (interface{}, error) {
if _, ok := token.Method.(*jwt.SigningMethodHMAC); !ok {
return nil, fmt.Errorf("unexpected signing method")
}
return as.jwtSecret, nil
})
}Health Monitoring System
Critical for maintaining 99.9% uptime:
package monitor
import (
"context"
"fmt"
"net/http"
"time"
)
type HealthChecker struct {
nodes []*VPNNode
interval time.Duration
}
func NewHealthChecker(nodes []*VPNNode, interval time.Duration) *HealthChecker {
return &HealthChecker{
nodes: nodes,
interval: interval,
}
}
func (hc *HealthChecker) Start(ctx context.Context) {
ticker := time.NewTicker(hc.interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
hc.checkAllNodes()
case <-ctx.Done():
return
}
}
}
func (hc *HealthChecker) checkAllNodes() {
for _, node := range hc.nodes {
go hc.checkNode(node)
}
}
func (hc *HealthChecker) checkNode(node *VPNNode) {
start := time.Now()
// Check HTTP endpoint
client := &http.Client{
Timeout: 5 * time.Second,
}
resp, err := client.Get(fmt.Sprintf("http://%s/health", node.IP))
latency := time.Since(start).Milliseconds()
if err != nil || resp.StatusCode != 200 {
node.Status = "unhealthy"
hc.alertNodeDown(node)
return
}
node.Status = "healthy"
node.LastChecked = time.Now()
// Update metrics
hc.updateMetrics(node, latency)
}
func (hc *HealthChecker) updateMetrics(node *VPNNode, latency int64) {
// Update Prometheus metrics
nodeLatency.WithLabelValues(node.ID).Set(float64(latency))
nodeStatus.WithLabelValues(node.ID).Set(1)
}
func (hc *HealthChecker) alertNodeDown(node *VPNNode) {
// Send alert to monitoring system (Slack, PagerDuty, etc.)
fmt.Printf("ALERT: Node %s is down!\n", node.ID)
}Traffic Obfuscation Techniques
1. uTLS Implementation
uTLS helps bypass TLS fingerprinting:
package obfuscation
import (
"crypto/tls"
utls "github.com/refraction-networking/utls"
)
func CreateObfuscatedConnection(serverName string) (*utls.UConn, error) {
tcpConn, err := net.Dial("tcp", serverName+":443")
if err != nil {
return nil, err
}
// Use Chrome fingerprint
config := &utls.Config{
ServerName: serverName,
}
uconn := utls.UClient(tcpConn, config, utls.HelloChrome_Auto)
err = uconn.Handshake()
if err != nil {
return nil, err
}
return uconn, nil
}2. Vision Flow Control
Vision adds an extra layer of obfuscation:
{
"streamSettings": {
"security": "reality",
"realitySettings": {
"fingerprint": "chrome",
"serverName": "www.microsoft.com",
"publicKey": "PUBLIC-KEY",
"shortId": "0123456789abcdef",
"spiderX": "/"
}
}
}Performance Optimization
1. Connection Pooling
package pool
import (
"sync"
"time"
)
type ConnectionPool struct {
conns chan *Connection
factory func() (*Connection, error)
maxSize int
mu sync.Mutex
}
func NewConnectionPool(maxSize int, factory func() (*Connection, error)) *ConnectionPool {
return &ConnectionPool{
conns: make(chan *Connection, maxSize),
factory: factory,
maxSize: maxSize,
}
}
func (p *ConnectionPool) Get() (*Connection, error) {
select {
case conn := <-p.conns:
if conn.IsAlive() {
return conn, nil
}
default:
}
return p.factory()
}
func (p *ConnectionPool) Put(conn *Connection) {
select {
case p.conns <- conn:
default:
conn.Close()
}
}2. Load Balancing Algorithm
We use weighted round-robin with health checks:
type LoadBalancer struct {
nodes []*VPNNode
current int
mu sync.Mutex
}
func (lb *LoadBalancer) GetNext() *VPNNode {
lb.mu.Lock()
defer lb.mu.Unlock()
// Find next healthy node
attempts := 0
for attempts < len(lb.nodes) {
lb.current = (lb.current + 1) % len(lb.nodes)
node := lb.nodes[lb.current]
if node.Status == "healthy" && node.Load < 0.9 {
return node
}
attempts++
}
return nil
}Deployment with Docker
Production-ready Dockerfile:
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o vpn-server ./cmd/server
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
# Install Xray-core
ADD https://github.com/XTLS/Xray-core/releases/latest/download/Xray-linux-64.zip /tmp/
RUN unzip /tmp/Xray-linux-64.zip -d /usr/local/bin/ && \
chmod +x /usr/local/bin/xray
COPY --from=builder /app/vpn-server .
COPY config.json .
EXPOSE 443 8080
CMD ["./vpn-server"]Docker Compose for multi-node setup:
version: '3.8'
services:
vpn-node-1:
build: .
ports:
- "443:443"
- "8080:8080"
environment:
- NODE_ID=node-1
- NODE_LOCATION=Singapore
volumes:
- ./config-node1.json:/root/config.json
restart: unless-stopped
vpn-node-2:
build: .
ports:
- "444:443"
- "8081:8080"
environment:
- NODE_ID=node-2
- NODE_LOCATION=Germany
volumes:
- ./config-node2.json:/root/config.json
restart: unless-stopped
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=adminMonitoring Dashboard
Prometheus configuration:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vpn-nodes'
static_configs:
- targets:
- 'vpn-node-1:8080'
- 'vpn-node-2:8080'Results & Metrics
After 3 months in production:
- Uptime: 99.93%
- Avg Latency: 45ms (Singapore)
- Peak Concurrent Connections: 5,000+
- Successful Anti-Censorship: Works in 95% of tested restrictive networks
- Zero Security Incidents
Key Takeaways
- Protocol Selection Matters: Reality + Vision combo is currently the best for anti-censorship
- Monitoring is Critical: Health checks every 30 seconds prevented many issues
- Geographic Distribution: Multiple node locations ensure better availability
- Go is Perfect for This: Concurrency model handles thousands of connections efficiently
- Security First: Always use strong encryption and authentication
Tech Stack
- Core: Xray-Core with Reality protocol
- Backend: Golang
- Monitoring: Prometheus + Grafana
- Infrastructure: Docker, Docker Compose
- Protocols: VLESS, VMess, Trojan, Reality, Vision, uTLS
Conclusion
Building a production-grade VPN service requires careful attention to protocols, monitoring, and performance. The combination of Xray-Core's advanced protocols and Go's concurrency makes it possible to achieve high uptime and great user experience.
The key is continuous monitoring and quick response to issues. With proper architecture, 99.9% uptime is achievable even when fighting against sophisticated censorship systems.
Building something similar? Feel free to reach out with questions!