Working with LLMs

Deploy and run large language models on the Jetson Orin Nano Field Kit

Introduction

The Jetson Orin Nano Field Kit provides powerful capabilities for running large language models (LLMs) locally. With its GPU acceleration and optimized software stack, you can deploy models ranging from small chatbots to sophisticated vision-language models.

Model Selection

RAM Constraints

Important: The Jetson Orin Nano Field Kit has 8GB of total RAM, of which only ~6GB is practically available for inference after accounting for system overhead.

Recommendation: LLMs should not exceed 3-4GB maximum memory usage to ensure stable operation and leave room for other processes. This means:

  • Stick to smaller models (1-3B parameters) for best results
  • Use quantized models (4-bit, 8-bit) to reduce memory footprint
  • Monitor memory usage during inference to avoid OOM errors
  • Avoid running multiple large models simultaneously

Choosing the Right Model

Consider these factors when selecting an LLM:

  • Memory Footprint: Must fit within 3-4GB RAM constraint
  • Quantization: Use quantized models (4-bit, 8-bit) for better performance and lower memory usage
  • Use Case: Chat, code generation, vision-language, or specialized tasks

Installation

Pre-installed LLM Infrastructure

The Field Kit comes with LLM infrastructure pre-installed and ready to use:

Ollama - Local LLM server with pre-installed models:

# Check if Ollama is installed
which ollama

# Start Ollama service
ollama serve

# List pre-installed models
ollama list

# Pre-installed models include:
# - qwen3:1.7B (1.7B parameters, optimized for the Field Kit)
# - ministral-3:3B (3B parameters, efficient and capable)

# Use a pre-installed model
ollama run qwen3:1.7b "Hello, how are you?"

# Pull additional models if needed (be mindful of RAM constraints)
ollama pull <model-name>

llama.cpp - Pre-built with CUDA support:

# Check installation
which llama-cli

# llama.cpp binaries are in ~/Workspace/llama.cpp/build/bin

# Pre-installed models work with llama.cpp as well

PyTorch - Jetson-optimized version:

# PyTorch 2.5.0 for Jetson is pre-installed
python3 -c "import torch; print(torch.__version__)"
# Note: Uses numpy<2 for compatibility

Install Additional Dependencies

# Install transformers and related libraries
pip3 install transformers accelerate bitsandbytes

# Install additional utilities
pip3 install sentencepiece protobuf

Open WebUI (Pre-installed and Configured)

The Field Kit includes Open WebUI pre-installed and configured for a web-based chat interface:

# Open WebUI is pre-configured and ready to use
# Check if it's running
docker ps | grep open-webui

# Start Open WebUI (if not already running)
cd jetson-orin-nano-field-kit/system/open-webui
docker compose up -d

# Access at http://<JETSON_IP>:8080
# The web interface is already configured to work with Ollama and the pre-installed models

Open WebUI provides a user-friendly web interface for chatting with the pre-installed models (qwen3:1.7B, ministral-3:3B) and any additional models you install via Ollama.

Using Hugging Face Transformers

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use FP16 for better performance
    device_map="auto"
)

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using Quantized Models

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load quantized model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# Use the model
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using llama.cpp

Command Line Usage

# llama.cpp binaries are pre-installed
cd ~/Workspace/llama.cpp/build/bin

# Download a model (GGUF format)
# Example: Download from Hugging Face

# Run inference
./llama-cli -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  -p "The future of AI is" \
  -n 100 \
  -ngl 35  # Offload 35 layers to GPU

Python Usage

from llama_cpp import Llama

# Load model (download GGUF format from Hugging Face)
llm = Llama(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048,  # Context window
    n_threads=4,  # CPU threads
    n_gpu_layers=35,  # Layers to offload to GPU
    verbose=False
)

# Generate text
prompt = "The future of AI is"
response = llm(
    prompt,
    max_tokens=100,
    temperature=0.7,
    top_p=0.9,
    echo=False
)

print(response['choices'][0]['text'])

Ollama provides an easier interface for running LLMs. The Field Kit comes with qwen3:1.7B and ministral-3:3B pre-installed:

# List available models (includes pre-installed models)
ollama list

# Run inference with pre-installed model
ollama run qwen3:1.7b "The future of AI is"

# Or use ministral-3:3B
ollama run ministral-3:3b "Explain quantum computing"

# Or use the API
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:1.7b",
  "prompt": "The future of AI is",
  "stream": false
}'

Python Client:

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'qwen3:1.7b',
    'prompt': 'The future of AI is',
    'stream': False
})

print(response.json()['response'])

Chat Interface

from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048,
    n_gpu_layers=35,
    verbose=False
)

def chat(prompt, system_prompt="You are a helpful assistant."):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
    
    response = llm.create_chat_completion(
        messages=messages,
        max_tokens=200,
        temperature=0.7
    )
    
    return response['choices'][0]['message']['content']

# Use the chat function
while True:
    user_input = input("You: ")
    if user_input.lower() in ['quit', 'exit']:
        break
    
    response = chat(user_input)
    print(f"Assistant: {response}")

Vision-Language Models

LLaVA Setup

pip3 install llava

Using LLaVA

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
import torch

# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path),
    load_4bit=True  # Use 4-bit quantization
)

# Prepare image and prompt
image_file = "/path/to/image.jpg"
prompt = "What do you see in this image?"

# Run inference
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0.2,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

response = eval_model(args)
print(response)

Building a Chat Application

Simple Chat Interface

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

class ChatBot:
    def __init__(self, model_name="microsoft/phi-2"):
        print(f"Loading model: {model_name}")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.conversation_history = []
    
    def chat(self, user_input, max_tokens=150, temperature=0.7):
        # Add user message to history
        self.conversation_history.append(f"User: {user_input}")
        
        # Build context from history
        context = "\n".join(self.conversation_history[-5:])  # Last 5 exchanges
        prompt = f"{context}\nAssistant:"
        
        # Tokenize
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        
        # Generate
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )
        
        # Decode response
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract only the new response
        response = response.split("Assistant:")[-1].strip()
        
        # Add to history
        self.conversation_history.append(f"Assistant: {response}")
        
        return response
    
    def reset(self):
        self.conversation_history = []

# Usage
bot = ChatBot()

while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ['quit', 'exit', 'reset']:
        if user_input.lower() == 'reset':
            bot.reset()
            print("Conversation reset.")
        break
    
    response = bot.chat(user_input)
    print(f"Assistant: {response}")

Integration with Voice Assistant

Combine LLMs with voice capabilities:

import whisper
import pyttsx3
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import sounddevice as sd
import numpy as np
import queue

class VoiceLLMAssistant:
    def __init__(self, llm_model="microsoft/phi-2"):
        # Initialize Whisper for speech-to-text
        print("Loading Whisper...")
        self.whisper_model = whisper.load_model("base")
        
        # Initialize TTS
        self.tts_engine = pyttsx3.init()
        self.tts_engine.setProperty('rate', 150)
        
        # Initialize LLM
        print(f"Loading LLM: {llm_model}")
        self.tokenizer = AutoTokenizer.from_pretrained(llm_model)
        self.llm = AutoModelForCausalLM.from_pretrained(
            llm_model,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        
        self.audio_queue = queue.Queue()
    
    def listen(self, duration=5):
        """Listen and transcribe"""
        audio_data = []
        sample_rate = 16000
        
        with sd.InputStream(
            callback=lambda indata, *args: self.audio_queue.put(indata.copy()),
            channels=1,
            samplerate=sample_rate
        ):
            sd.sleep(int(duration * 1000))
        
        while not self.audio_queue.empty():
            audio_data.append(self.audio_queue.get())
        
        if not audio_data:
            return None
        
        audio_array = np.concatenate(audio_data)
        result = self.whisper_model.transcribe(audio_array, fp16=False)
        return result['text']
    
    def generate_response(self, prompt):
        """Generate LLM response"""
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.llm.device)
        
        with torch.no_grad():
            outputs = self.llm.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.7,
                do_sample=True
            )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response.split(prompt)[-1].strip()
    
    def speak(self, text):
        """Text-to-speech"""
        self.tts_engine.say(text)
        self.tts_engine.runAndWait()
    
    def run(self):
        """Main loop"""
        print("Voice LLM Assistant ready. Say 'quit' to exit.")
        
        while True:
            # Listen
            user_input = self.listen(duration=3)
            
            if user_input:
                print(f"You: {user_input}")
                
                if "quit" in user_input.lower():
                    self.speak("Goodbye!")
                    break
                
                # Generate response
                response = self.generate_response(user_input)
                print(f"Assistant: {response}")
                
                # Speak response
                self.speak(response)

# Usage
assistant = VoiceLLMAssistant()
assistant.run()

Performance Optimization

Model Optimization Tips

  1. Use Quantization: 4-bit or 8-bit quantization reduces memory and improves speed
  2. Batch Processing: Process multiple prompts together when possible
  3. Context Window: Limit context size to what you need
  4. GPU Offloading: Offload as many layers to GPU as possible
  5. Model Caching: Keep models in memory between requests

Memory Management

import torch
import gc

def clear_cache():
    """Clear GPU cache"""
    torch.cuda.empty_cache()
    gc.collect()

# Use after model operations
response = model.generate(...)
clear_cache()

Model Serving

Simple API Server

from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = Flask(__name__)

# Load model once at startup
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data.get('prompt', '')
    max_tokens = data.get('max_tokens', 100)
    temperature = data.get('temperature', 0.7)
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return jsonify({'response': response})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Next Steps