Working with LLMs
Deploy and run large language models on the Jetson Orin Nano Field Kit
Introduction
The Jetson Orin Nano Field Kit provides powerful capabilities for running large language models (LLMs) locally. With its GPU acceleration and optimized software stack, you can deploy models ranging from small chatbots to sophisticated vision-language models.
Model Selection
RAM Constraints
Important: The Jetson Orin Nano Field Kit has 8GB of total RAM, of which only ~6GB is practically available for inference after accounting for system overhead.
Recommendation: LLMs should not exceed 3-4GB maximum memory usage to ensure stable operation and leave room for other processes. This means:
- Stick to smaller models (1-3B parameters) for best results
- Use quantized models (4-bit, 8-bit) to reduce memory footprint
- Monitor memory usage during inference to avoid OOM errors
- Avoid running multiple large models simultaneously
Choosing the Right Model
Consider these factors when selecting an LLM:
- Memory Footprint: Must fit within 3-4GB RAM constraint
- Quantization: Use quantized models (4-bit, 8-bit) for better performance and lower memory usage
- Use Case: Chat, code generation, vision-language, or specialized tasks
Installation
Pre-installed LLM Infrastructure
The Field Kit comes with LLM infrastructure pre-installed and ready to use:
Ollama - Local LLM server with pre-installed models:
# Check if Ollama is installed
which ollama
# Start Ollama service
ollama serve
# List pre-installed models
ollama list
# Pre-installed models include:
# - qwen3:1.7B (1.7B parameters, optimized for the Field Kit)
# - ministral-3:3B (3B parameters, efficient and capable)
# Use a pre-installed model
ollama run qwen3:1.7b "Hello, how are you?"
# Pull additional models if needed (be mindful of RAM constraints)
ollama pull <model-name>llama.cpp - Pre-built with CUDA support:
# Check installation
which llama-cli
# llama.cpp binaries are in ~/Workspace/llama.cpp/build/bin
# Pre-installed models work with llama.cpp as wellPyTorch - Jetson-optimized version:
# PyTorch 2.5.0 for Jetson is pre-installed
python3 -c "import torch; print(torch.__version__)"
# Note: Uses numpy<2 for compatibilityInstall Additional Dependencies
# Install transformers and related libraries
pip3 install transformers accelerate bitsandbytes
# Install additional utilities
pip3 install sentencepiece protobufOpen WebUI (Pre-installed and Configured)
The Field Kit includes Open WebUI pre-installed and configured for a web-based chat interface:
# Open WebUI is pre-configured and ready to use
# Check if it's running
docker ps | grep open-webui
# Start Open WebUI (if not already running)
cd jetson-orin-nano-field-kit/system/open-webui
docker compose up -d
# Access at http://<JETSON_IP>:8080
# The web interface is already configured to work with Ollama and the pre-installed modelsOpen WebUI provides a user-friendly web interface for chatting with the pre-installed models (qwen3:1.7B, ministral-3:3B) and any additional models you install via Ollama.
Using Hugging Face Transformers
Basic Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Use FP16 for better performance
device_map="auto"
)
# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)Using Quantized Models
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# Load quantized model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
# Use the model
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
temperature=0.7
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)Using llama.cpp
Command Line Usage
# llama.cpp binaries are pre-installed
cd ~/Workspace/llama.cpp/build/bin
# Download a model (GGUF format)
# Example: Download from Hugging Face
# Run inference
./llama-cli -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
-p "The future of AI is" \
-n 100 \
-ngl 35 # Offload 35 layers to GPUPython Usage
from llama_cpp import Llama
# Load model (download GGUF format from Hugging Face)
llm = Llama(
model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
n_ctx=2048, # Context window
n_threads=4, # CPU threads
n_gpu_layers=35, # Layers to offload to GPU
verbose=False
)
# Generate text
prompt = "The future of AI is"
response = llm(
prompt,
max_tokens=100,
temperature=0.7,
top_p=0.9,
echo=False
)
print(response['choices'][0]['text'])Using Ollama (Recommended)
Ollama provides an easier interface for running LLMs. The Field Kit comes with qwen3:1.7B and ministral-3:3B pre-installed:
# List available models (includes pre-installed models)
ollama list
# Run inference with pre-installed model
ollama run qwen3:1.7b "The future of AI is"
# Or use ministral-3:3B
ollama run ministral-3:3b "Explain quantum computing"
# Or use the API
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:1.7b",
"prompt": "The future of AI is",
"stream": false
}'Python Client:
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'qwen3:1.7b',
'prompt': 'The future of AI is',
'stream': False
})
print(response.json()['response'])Chat Interface
from llama_cpp import Llama
llm = Llama(
model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
n_ctx=2048,
n_gpu_layers=35,
verbose=False
)
def chat(prompt, system_prompt="You are a helpful assistant."):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=200,
temperature=0.7
)
return response['choices'][0]['message']['content']
# Use the chat function
while True:
user_input = input("You: ")
if user_input.lower() in ['quit', 'exit']:
break
response = chat(user_input)
print(f"Assistant: {response}")Vision-Language Models
LLaVA Setup
pip3 install llavaUsing LLaVA
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
import torch
# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path),
load_4bit=True # Use 4-bit quantization
)
# Prepare image and prompt
image_file = "/path/to/image.jpg"
prompt = "What do you see in this image?"
# Run inference
args = type('Args', (), {
"model_path": model_path,
"model_base": None,
"model_name": get_model_name_from_path(model_path),
"query": prompt,
"conv_mode": None,
"image_file": image_file,
"sep": ",",
"temperature": 0.2,
"top_p": None,
"num_beams": 1,
"max_new_tokens": 512
})()
response = eval_model(args)
print(response)Building a Chat Application
Simple Chat Interface
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
class ChatBot:
def __init__(self, model_name="microsoft/phi-2"):
print(f"Loading model: {model_name}")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.conversation_history = []
def chat(self, user_input, max_tokens=150, temperature=0.7):
# Add user message to history
self.conversation_history.append(f"User: {user_input}")
# Build context from history
context = "\n".join(self.conversation_history[-5:]) # Last 5 exchanges
prompt = f"{context}\nAssistant:"
# Tokenize
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
# Generate
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode response
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the new response
response = response.split("Assistant:")[-1].strip()
# Add to history
self.conversation_history.append(f"Assistant: {response}")
return response
def reset(self):
self.conversation_history = []
# Usage
bot = ChatBot()
while True:
user_input = input("\nYou: ")
if user_input.lower() in ['quit', 'exit', 'reset']:
if user_input.lower() == 'reset':
bot.reset()
print("Conversation reset.")
break
response = bot.chat(user_input)
print(f"Assistant: {response}")Integration with Voice Assistant
Combine LLMs with voice capabilities:
import whisper
import pyttsx3
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import sounddevice as sd
import numpy as np
import queue
class VoiceLLMAssistant:
def __init__(self, llm_model="microsoft/phi-2"):
# Initialize Whisper for speech-to-text
print("Loading Whisper...")
self.whisper_model = whisper.load_model("base")
# Initialize TTS
self.tts_engine = pyttsx3.init()
self.tts_engine.setProperty('rate', 150)
# Initialize LLM
print(f"Loading LLM: {llm_model}")
self.tokenizer = AutoTokenizer.from_pretrained(llm_model)
self.llm = AutoModelForCausalLM.from_pretrained(
llm_model,
torch_dtype=torch.float16,
device_map="auto"
)
self.audio_queue = queue.Queue()
def listen(self, duration=5):
"""Listen and transcribe"""
audio_data = []
sample_rate = 16000
with sd.InputStream(
callback=lambda indata, *args: self.audio_queue.put(indata.copy()),
channels=1,
samplerate=sample_rate
):
sd.sleep(int(duration * 1000))
while not self.audio_queue.empty():
audio_data.append(self.audio_queue.get())
if not audio_data:
return None
audio_array = np.concatenate(audio_data)
result = self.whisper_model.transcribe(audio_array, fp16=False)
return result['text']
def generate_response(self, prompt):
"""Generate LLM response"""
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.llm.device)
with torch.no_grad():
outputs = self.llm.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return response.split(prompt)[-1].strip()
def speak(self, text):
"""Text-to-speech"""
self.tts_engine.say(text)
self.tts_engine.runAndWait()
def run(self):
"""Main loop"""
print("Voice LLM Assistant ready. Say 'quit' to exit.")
while True:
# Listen
user_input = self.listen(duration=3)
if user_input:
print(f"You: {user_input}")
if "quit" in user_input.lower():
self.speak("Goodbye!")
break
# Generate response
response = self.generate_response(user_input)
print(f"Assistant: {response}")
# Speak response
self.speak(response)
# Usage
assistant = VoiceLLMAssistant()
assistant.run()Performance Optimization
Model Optimization Tips
- Use Quantization: 4-bit or 8-bit quantization reduces memory and improves speed
- Batch Processing: Process multiple prompts together when possible
- Context Window: Limit context size to what you need
- GPU Offloading: Offload as many layers to GPU as possible
- Model Caching: Keep models in memory between requests
Memory Management
import torch
import gc
def clear_cache():
"""Clear GPU cache"""
torch.cuda.empty_cache()
gc.collect()
# Use after model operations
response = model.generate(...)
clear_cache()Model Serving
Simple API Server
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
app = Flask(__name__)
# Load model once at startup
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
@app.route('/generate', methods=['POST'])
def generate():
data = request.json
prompt = data.get('prompt', '')
max_tokens = data.get('max_tokens', 100)
temperature = data.get('temperature', 0.7)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return jsonify({'response': response})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)Next Steps
- Combine with Voice Assistant for conversational AI
- Integrate with Computer Vision for multimodal AI
- Check Troubleshooting for common LLM issues
- Review Good Guidance for optimization tips