You’ve collected the data. You’ve structured it properly. Now comes the harder question: how do you actually run AI models on systems that can’t reach the cloud?
This isn’t a theoretical exercise. Many industrial environments—laboratories, manufacturing floors, cleanrooms—operate on air-gapped networks by design. The security and regulatory benefits are clear, but they create a fundamental constraint: any AI capability must run locally.
The Offline AI Stack
What You Need
Running AI locally requires assembling a complete inference stack:
| Component | Purpose | Options |
|---|---|---|
| Runtime | Execute model inference | ONNX Runtime, llama.cpp, TensorRT |
| Model | The actual AI | Quantized open-weight models |
| API Layer | Application integration | REST API, gRPC, direct embedding |
| Storage | Model and data persistence | Local filesystem, SQLite |
Hardware Considerations
Most industrial workstations aren’t AI-optimized. Here’s what’s typically available:
| Hardware Class | Typical Specs | AI Capability |
|---|---|---|
| Standard Workstation | 16GB RAM, Intel i7, no GPU | Small models (1-3B), CPU inference |
| Enhanced Workstation | 32GB RAM, dedicated GPU | Medium models (7-14B), GPU acceleration |
| Edge AI Device | Jetson Orin, Intel NUC with NPU | Optimized for continuous inference |
The key insight: you don’t need datacenter hardware for useful AI. Modern quantized models run surprisingly well on standard equipment.
Local Inference Engines
Option 1: Ollama (Simplest)
Ollama provides the easiest path to local LLM inference:
# Installation (one-time, can be done via USB transfer)
# Download installer from ollama.com on connected machine
# Transfer to air-gapped system
# Running a model
ollama run llama3.2:3b
# API access (default port 11434)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Analyze this error log..."
}'
Pros:
- Simple setup and management
- Built-in model quantization
- REST API included
- Cross-platform (Windows, Linux, macOS)
Cons:
- Limited customization
- Overhead for simple tasks
- Requires separate process
Best For: Quick deployment, prototyping, general-purpose LLM tasks
Option 2: llama.cpp (Most Flexible)
Direct C++ inference with maximum control:
// Embedding in your application
#include "llama.h"
llama_model* model = llama_load_model_from_file("model.gguf", params);
llama_context* ctx = llama_new_context_with_model(model, ctx_params);
// Run inference
llama_decode(ctx, batch);
Pros:
- Minimal dependencies
- Can embed directly in applications
- Highly optimized CPU inference
- Fine-grained control
Cons:
- More complex integration
- C/C++ expertise required
- Manual memory management
Best For: Embedded systems, performance-critical applications, custom integrations
Option 3: ONNX Runtime (Enterprise)
Microsoft’s cross-platform inference engine:
import onnxruntime as ort
# Load model
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])
# Run inference
outputs = session.run(None, {"input": input_data})
Pros:
- Enterprise support available
- Multiple hardware accelerators (TensorRT, OpenVINO, DirectML)
- Language bindings (Python, C#, C++, Java)
- Production-proven
Cons:
- Model conversion required
- Larger deployment footprint
- Less LLM-focused
Best For: Enterprise deployments, mixed model types, hardware acceleration
Comparison Summary
| Factor | Ollama | llama.cpp | ONNX Runtime |
|---|---|---|---|
| Setup complexity | Low | Medium | Medium |
| Integration effort | Low | High | Medium |
| Performance | Good | Excellent | Excellent |
| Customization | Limited | Full | Full |
| Model support | LLMs | LLMs | All types |
| Enterprise ready | Developing | No | Yes |
Model Selection for Offline Use
Language Models
| Model | Parameters | Quantized Size | RAM Required | Capability |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | ~400MB | 1GB | Basic tasks |
| Llama 3.2-1B | 1B | ~700MB | 2GB | Simple reasoning |
| Llama 3.2-3B | 3B | ~2GB | 4GB | Good general use |
| Phi-4 | 14B | ~8GB | 12GB | Strong reasoning |
| Mistral-7B | 7B | ~4GB | 8GB | Balanced performance |
Recommendation: Start with 3B models. They offer the best balance of capability and resource requirements for most industrial workstations.
Vision Models
| Model | Size | Input | Output | Use Case |
|---|---|---|---|---|
| YOLOv8n | 6MB | Image | Detections | Object detection |
| MobileNetV3 | 15MB | Image | Classification | Status classification |
| Moondream2 | 1.6B | Image + Text | Text | Visual Q&A |
| SmolVLM | 2B | Image + Text | Text | Visual understanding |
Specialized Models
| Task | Model Options | Size Range |
|---|---|---|
| Anomaly detection | Isolation Forest, Autoencoders | <10MB |
| Time series | Prophet, NeuralProphet | <50MB |
| Text embedding | all-MiniLM-L6-v2 | ~90MB |
| Classification | Gradient Boosting, Small NNs | <10MB |
Architecture Patterns
Pattern 1: Embedded Model
Model runs directly in your application process:
┌─────────────────────────────────────┐
│ Your Application │
│ ┌─────────────────────────────┐ │
│ │ Inference Engine │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Model │ │ │
│ │ └─────────────────────┘ │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
When to use:
- Single application needs AI
- Minimal latency required
- Simple deployment preferred
Implementation:
// C# example with ONNX Runtime
public class EmbeddedInference
{
private InferenceSession _session;
public void Initialize(string modelPath)
{
_session = new InferenceSession(modelPath);
}
public float[] Predict(float[] input)
{
var inputTensor = new DenseTensor<float>(input, new[] { 1, input.Length });
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input", inputTensor)
};
using var results = _session.Run(inputs);
return results.First().AsEnumerable<float>().ToArray();
}
}
Pattern 2: Local Service
Separate inference service on the same machine:
┌─────────────────────┐ ┌─────────────────────┐
│ Your Application │────▶│ Inference Service │
│ │◀────│ (localhost:8080) │
└─────────────────────┘ │ ┌───────────────┐ │
│ │ Model(s) │ │
│ └───────────────┘ │
└─────────────────────┘
When to use:
- Multiple applications share models
- Need to update models independently
- Want to isolate AI resource usage
Implementation:
# FastAPI inference service
from fastapi import FastAPI
import ollama
app = FastAPI()
@app.post("/analyze")
async def analyze(request: AnalyzeRequest):
response = ollama.generate(
model="llama3.2:3b",
prompt=f"Analyze this log entry: {request.log_entry}"
)
return {"analysis": response["response"]}
@app.post("/classify")
async def classify(request: ClassifyRequest):
# Use smaller specialized model for classification
response = ollama.generate(
model="qwen3:0.6b",
prompt=f"Classify this error: {request.error_message}\nCategories: hardware, software, user_error, unknown"
)
return {"category": parse_category(response["response"])}
Pattern 3: Edge Gateway
Dedicated AI device serves multiple workstations:
┌──────────────┐
│ Workstation 1│───┐
└──────────────┘ │ ┌─────────────────────┐
├────▶│ Edge AI Server │
┌──────────────┐ │ │ (Jetson/NUC/GPU) │
│ Workstation 2│───┤ │ ┌───────────────┐ │
└──────────────┘ │ │ │ Models │ │
│ │ └───────────────┘ │
┌──────────────┐ │ └─────────────────────┘
│ Workstation 3│───┘
└──────────────┘
Local Network (Air-gapped)
When to use:
- Workstations lack GPU/resources
- Centralized model management needed
- Higher-capability models required
Implementation considerations:
- Use message queue for async requests
- Implement request prioritization
- Handle service unavailability gracefully
Practical Implementation Examples
Example 1: Log Analysis Assistant
Goal: Automatically analyze error logs and suggest solutions
# log_analyzer.py
import ollama
from dataclasses import dataclass
@dataclass
class LogAnalysis:
severity: str
category: str
likely_cause: str
suggested_action: str
class LogAnalyzer:
def __init__(self, model: str = "llama3.2:3b"):
self.model = model
self.system_prompt = """You are a log analysis assistant for laboratory automation software.
Analyze log entries and provide:
1. Severity (critical, warning, info)
2. Category (hardware, software, network, user)
3. Likely cause
4. Suggested action
Respond in JSON format."""
def analyze(self, log_entry: str) -> LogAnalysis:
response = ollama.generate(
model=self.model,
system=self.system_prompt,
prompt=f"Analyze this log entry:\n{log_entry}"
)
# Parse JSON response
result = parse_json(response["response"])
return LogAnalysis(**result)
# Usage
analyzer = LogAnalyzer()
analysis = analyzer.analyze("2025-12-26 10:23:45 ERROR: Temperature sensor timeout on Incubator_01")
print(f"Severity: {analysis.severity}")
print(f"Suggested: {analysis.suggested_action}")
Example 2: Smart Autocomplete
Goal: Suggest parameter values based on context
# autocomplete.py
class SmartAutocomplete:
def __init__(self):
self.model = "qwen3:0.6b" # Small model for speed
self.history_db = HistoryDatabase()
def suggest_parameters(self, context: dict) -> list[str]:
# First, check historical patterns
historical = self.history_db.get_similar(context)
if historical:
return historical[:5] # Return top 5 historical matches
# Fall back to LLM generation
prompt = f"""Given this experimental context:
Sample type: {context.get('sample_type')}
Equipment: {context.get('equipment')}
Previous step: {context.get('previous_step')}
Suggest appropriate values for: {context.get('parameter_name')}
Return as comma-separated list."""
response = ollama.generate(model=self.model, prompt=prompt)
return parse_suggestions(response["response"])
Example 3: Anomaly Detection Service
Goal: Detect unusual patterns in equipment telemetry
# anomaly_detector.py
import numpy as np
from sklearn.ensemble import IsolationForest
import joblib
class AnomalyDetector:
def __init__(self, model_path: str = "models/anomaly_detector.joblib"):
self.model = joblib.load(model_path)
self.feature_names = ["temperature", "pressure", "vibration", "cycle_time"]
def detect(self, readings: dict) -> dict:
# Prepare features
features = np.array([[
readings.get(f, 0) for f in self.feature_names
]])
# Predict (-1 for anomaly, 1 for normal)
prediction = self.model.predict(features)[0]
score = self.model.score_samples(features)[0]
return {
"is_anomaly": prediction == -1,
"confidence": abs(score),
"readings": readings
}
def retrain(self, historical_data: np.ndarray):
"""Periodic retraining with new data"""
self.model.fit(historical_data)
joblib.dump(self.model, self.model_path)
# Usage in background service
detector = AnomalyDetector()
for reading in telemetry_stream:
result = detector.detect(reading)
if result["is_anomaly"]:
alert_operator(result)
Example 4: Visual Equipment Status
Goal: Read equipment displays using vision model
# display_reader.py
import base64
from pathlib import Path
class DisplayReader:
def __init__(self):
self.model = "moondream:1.8b" # Small VLM
def read_display(self, image_path: str) -> dict:
# Encode image
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.generate(
model=self.model,
prompt="Read all text and numbers visible on this equipment display. Report any error indicators or warnings.",
images=[image_data]
)
return {
"raw_text": response["response"],
"values": self.parse_values(response["response"]),
"warnings": self.detect_warnings(response["response"])
}
def parse_values(self, text: str) -> dict:
# Extract numeric values with units
# Implementation depends on display format
pass
def detect_warnings(self, text: str) -> list:
warning_keywords = ["error", "warning", "fault", "alarm"]
return [w for w in warning_keywords if w.lower() in text.lower()]
Model Updates in Air-Gapped Environments
The Update Challenge
Models need occasional updates, but you can’t pull from the internet. Solutions:
Approach 1: USB Transfer
Connected Machine Air-Gapped System
┌─────────────┐ ┌─────────────┐
│ Download │ USB Drive │ Verify │
│ model from │───────────────▶│ checksum │
│ source │ │ Install │
└─────────────┘ └─────────────┘
Process:
- Download model on connected machine
- Generate checksum (SHA-256)
- Transfer via approved media
- Verify checksum on target
- Install and validate
Approach 2: Scheduled Sync
# model_updater.py
import hashlib
from pathlib import Path
class ModelUpdater:
def __init__(self, model_dir: Path, manifest_path: Path):
self.model_dir = model_dir
self.manifest = self.load_manifest(manifest_path)
def check_for_updates(self, update_source: Path) -> list:
"""Check USB/network share for new models"""
updates = []
for model_file in update_source.glob("*.gguf"):
if self.needs_update(model_file):
updates.append(model_file)
return updates
def needs_update(self, new_model: Path) -> bool:
"""Compare checksums to determine if update needed"""
current = self.model_dir / new_model.name
if not current.exists():
return True
return self.checksum(new_model) != self.checksum(current)
def install_update(self, model_file: Path) -> bool:
"""Safely install model update"""
# Verify checksum against manifest
expected = self.manifest.get(model_file.name)
actual = self.checksum(model_file)
if expected != actual:
raise SecurityError(f"Checksum mismatch for {model_file.name}")
# Backup current model
current = self.model_dir / model_file.name
if current.exists():
current.rename(current.with_suffix(".backup"))
# Copy new model
shutil.copy(model_file, self.model_dir)
# Validate new model works
if not self.validate_model(current):
self.rollback(current)
return False
return True
Approach 3: Model Versioning
Track model versions like software versions:
| Model | Version | Checksum | Validated | Active |
|---|---|---|---|---|
| llama3.2-3b | 1.0.0 | abc123… | 2025-12-01 | Yes |
| llama3.2-3b | 1.1.0 | def456… | 2025-12-15 | No |
| anomaly-detector | 2.3.1 | ghi789… | 2025-12-20 | Yes |
Performance Optimization
Inference Speed Tips
| Technique | Impact | Implementation |
|---|---|---|
| Quantization | 2-4x speedup | Use Q4_K_M or Q5_K_M formats |
| Batch processing | 2-10x throughput | Group similar requests |
| Caching | Instant for repeats | Cache common queries |
| Prompt optimization | 20-50% speedup | Shorter, focused prompts |
Memory Management
# Efficient model loading
class ModelManager:
def __init__(self, max_loaded: int = 2):
self.loaded_models = {}
self.max_loaded = max_loaded
self.usage_order = []
def get_model(self, model_name: str):
if model_name in self.loaded_models:
# Move to end of usage order (most recently used)
self.usage_order.remove(model_name)
self.usage_order.append(model_name)
return self.loaded_models[model_name]
# Evict least recently used if at capacity
if len(self.loaded_models) >= self.max_loaded:
evict = self.usage_order.pop(0)
del self.loaded_models[evict]
# Load new model
model = self.load_model(model_name)
self.loaded_models[model_name] = model
self.usage_order.append(model_name)
return model
Fallback Strategies
AI should enhance, not break, your application:
class ResilientAIService:
def __init__(self):
self.ai_available = self.check_ai_status()
def analyze_with_fallback(self, data: dict) -> dict:
if self.ai_available:
try:
return self.ai_analyze(data)
except Exception as e:
log.warning(f"AI analysis failed: {e}")
return self.rule_based_analyze(data)
else:
return self.rule_based_analyze(data)
def rule_based_analyze(self, data: dict) -> dict:
"""Deterministic fallback when AI unavailable"""
# Simple rule-based logic
if data.get("error_code") in KNOWN_ERRORS:
return KNOWN_ERRORS[data["error_code"]]
return {"status": "unknown", "suggestion": "Contact support"}
Closing Thoughts
Offline AI is not a limitation—it’s a different deployment model. The capabilities are real: modern quantized models running on standard hardware can provide genuine value for log analysis, anomaly detection, smart suggestions, and visual understanding.
The key is matching your AI ambitions to your infrastructure reality:
- Start with small models (1-3B parameters)
- Use appropriate inference engines for your platform
- Build fallback paths for when AI is unavailable
- Plan for model updates from the beginning
The models will keep improving. The deployment patterns you establish now will serve you well as more capable models become available in smaller sizes.
Implementation details will vary based on your specific platform, but the architectural patterns described here apply broadly to air-gapped industrial environments.
Further Reading: