A Short Tale of Modern Software Development

How I Decided To Briefly Test Coding LLMs

The idea started simply enough: Can I just use a large language model to determine the style of writing in the chunk of text: Markdown, Novel (all text in quotes) or Roleplaying (actions in between asterisks; everything else is speech).

I’ve hand written some algorithms to parse these in the past, but it’s always kind of a nuisance. The thought occurred to me that I could build a custom text classifier neural network and train that , which sounded fun. Maybe I could even port that to GGML so that I could use it in my other apps that use llama.cpp?

Then it occurred to me that I could use a lot of my existing code - once I add LoRA support - to do the text classification if I just fine-tune a smaller LLM, like maybe a 1B Llama model. But to do the fine-tuning I would need a dataset. It seemed like a prime opportunity to write an example for one of my libraries to have it act like a synthetic dataset generator.

Slowly the idea for this blog post started forming. Why not test some different models out with continue.dev from within VS Code, now that I finally bought some credits on openrouter. This makes the 70B+ tier of LLM’s more usable since my workstation only gets a token or two a second with those.

In the end, I decided to give it a go and document the process. Hopefully it can show what using tools like continue.dev can do besides the obvious code completion work.

I’ll start by asking each model how to add an example to a Rust project. From there, I follow up with a specific list of demands to make an example that uses clap for CLI argument parsing and that derives it text prediction code from an integration test of my Rust project - highlighting a natural workflow in continue.dev. From there, if errors are encountered, I try to address them in the Chat window of continue.dev and see what happens. Overall, each of the models create a simple example of how to load an LLM with my wrapper library woolyrust and then generate a variable number of replies while saving them off to a JSONL file. The task is not extremely complicated.

With the coding models, I include their response verbatim under their own heading, leaving their markdown intact. The verbosity of these models makes this blog post come out nightmarishly long, so I’ve provided a TOC to the start of each model’s tests.

Models Tested:

  1. Llama 3.1 Nemotron 70B
  2. Qwen 2.5 Coder 32B
  3. Claude 3.5 Sonnet

And the TLDR: Overall Conclusion

[Editors note: No images for now until I make some. I’m trying to actually do my own this time instead of AI renders. It’s gonna be a little random and weird, but … fuckit it’s my blog, right? Just need a bit more time for the eye candy.]

First Things First

Starting with assessing different models for coding projects, I had wandered into a leaderboard that brought to mind the possibility of some different models to use besides the ones I’ve grown comfortable with: Qwen 2.5 Coder 32B and Codestral 22B. With that in mind, I configured continue.dev to use openrouter and set it up for the Llama 3.1 Nemotron 70B model.

The typical workflow for having AI assist on bigger ideas as opposed to single-task questions is to start small and build up. To highlight the ability for these models to work with existing code, I decide to use my woolyrust project and opened that up in VS Code. Woolyrust is a Rust language binding for woolycore which is my library that provides a high-level C interface to llama.cpp, making it easier to write language bindings.


My Opening Nemotron 70B Question

With my woolyrust project open, I open the continue.dev tab to bring up the ‘chat’ interface and ask it a simple question.

Is it easy to add examples to Rust library projects?

Nemotron 70B’s Response

Adding examples to Rust library projects is indeed relatively easy and encouraged, as it helps demonstrate the usage of your library to users. Here’s a step-by-step guide on how to add examples to a Rust library project:

1. Create an examples Directory

Inside your library’s root directory (where your Cargo.toml file is located), create a new directory named examples if it doesn’t already exist.

2. Add Your Example Files

Within the examples directory, create one or more Rust files (e.g., my_example.rs, advanced_usage.rs, etc.). Each file will contain an example of how to use your library. These files should be regular Rust programs that use your library as an external crate would.

Example File Structure (my_example.rs):

// my_example.rs
extern crate my_library; // Use the exact name from your Cargo.toml

use my_library::{SomeModule, AnotherItemYouWantToDemonstrate};

fn main() {
    // Example usage of your library's functionality
    let result = SomeModule::some_function();
    println!("Example result: {}", result);

    // More examples...
}

3. (Optional) Specify Example Dependencies

If your examples require additional dependencies not needed by your library itself (e.g., for testing or logging in the context of the example), you can specify these in the [dev-dependencies] section of your Cargo.toml. This is because examples are typically built and run with cargo build --examples or cargo run --example <name>, which considers dev dependencies.

Specifying Dev Dependencies (Cargo.toml excerpt):

[dev-dependencies]
"Just-an-example = "1.2.3"

4. Running Your Examples

To build and run an example, navigate to your project’s root directory in your terminal and use the following Cargo command, replacing <example_name> with the name of your example file without the .rs extension:

cargo run --example my_example

Additional Tips:

  • Documentation Examples: Consider using your examples in documentation. Rustdoc allows embedding code examples that are actually run as tests to ensure they remain valid.
  • Testing Your Examples: While not necessary, you might want to ensure your examples continue to work as your library evolves. You can convert examples into integration tests by moving them into the tests directory and modifying them to assert expected outcomes instead of printing them.
  • Keep Examples Up-to-Date: Regularly review and update your examples to reflect the best practices for using your library, especially when its API changes.

By following these steps, you can easily enhance your Rust library project with meaningful examples that will significantly benefit your users.


With the Basics Established

Now that I got the model thinking of what I want to do - adding a new example to my existing project - I continue the chat with a far more detailed and nuanced demand.

I want to create a new example for this Rust project. It should take three 
command line args, parsed with the `clap` library, for 'count', 
'system prompt' and 'instruction prompt'.

Then, it should follow the text prediction workflow for the library as 
illustrated in the @step_prediction_test function. I would like to perform 
'count' number of text predictions for a prompt built by the 'system prompt' 
and 'instruction prompt' arguments on the command line.

I want to put the resulting predicted text into a vector and then turn it 
into a JSONL file at the end, serializing it and saving it to the file system.

Nemotron-70B’s Response

Below is an example of how you can create a new executable in your Rust project that meets your requirements. This example assumes you have clap for command-line argument parsing and serde & serde_json for JSON serialization. If you haven’t added these dependencies to your Cargo.toml yet, you’ll need to do so:

[dependencies]
clap = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

Create a new file named cli_predictor.rs in the examples directory:

// examples/cli_predictor.rs
use clap::{App, Arg};
use serde_json;
use std::fs::File;
use std::io::Write;
use std::path::Path;

// Import your library's functionality
use woolyrust::{Llama, ManagedGptParams, TokenList};

fn main() {
    // Clap App Setup
    let matches = App::new("CLI Text Predictor")
       .version("1.0")
       .author("Your Name")
       .about("Predicts text based on system and instruction prompts")
       .arg(
            Arg::with_name("count")
               .short("c")
               .long("count")
               .help("Number of predictions to generate")
               .required(true)
               .takes_value(true),
        )
       .arg(
            Arg::with_name("system_prompt")
               .short("s")
               .long("system")
               .help("System prompt for the model")
               .required(true)
               .takes_value(true),
        )
       .arg(
            Arg::with_name("instruction_prompt")
               .short("i")
               .long("instruction")
               .help("Instruction prompt for the model")
               .required(true)
               .takes_value(true),
        )
       .get_matches();

    // Extract command-line arguments
    let count: usize = matches.value_of("count").unwrap().parse().expect("Invalid count");
    let system_prompt = matches.value_of("system_prompt").unwrap();
    let instruction_prompt = matches.value_of("instruction_prompt").unwrap();

    // Setup library model (Similar to step_prediction_test)
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
    let mut llama = Llama::new();
    let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    //Prediction parameters (adjust as needed)
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100; // Per prediction length
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;

    // Vector to hold JSON serializable prediction results
    #[derive(serde::Serialize)]
    struct PredictionResult {
        prompt: String,
        prediction: String,
    }
    let mut prediction_results = Vec::new();

    for _ in 0..count {
        // Construct prompt
        let prompt = format!("{}{}\n<|assistant|>\n", system_prompt, instruction_prompt);

        // Process prompt and generate prediction (simplified from your test)
        let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
        assert_eq!(prompt_token_count > 0, true);
        let mut predictions: TokenList = vec![];
        while predictions.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut sampler);
            if predictions.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }
            predictions.push(next_token);
        }
        let prediction_str = llama.detokenize_text(&mut predictions, false);
        assert!(!prediction_str.is_empty());

        // Store result for JSON serialization
        prediction_results.push(PredictionResult {
            prompt: prompt.clone(),
            prediction: prediction_str.clone(),
        });

        println!("Prediction {}\nPrompt: {}\nPrediction: {}\n", _+1, prompt, prediction_str);
    }

    // Serialize and save to JSONL file
    let jsonl_file_path = Path::new("predictions.jsonl");
    let mut file = File::create(jsonl_file_path).expect("Failed to create file");
    for result in prediction_results {
        let json = serde_json::to_string(&result).expect("Failed to serialize");
        file.write_all(format!("{}\n", json).as_bytes())
           .expect("Failed to write to file");
    }
    println!("Predictions saved to {}", jsonl_file_path.display());
}

To run this example:

Navigate to your project’s root directory in your terminal and execute:

cargo run --example cli_predictor -- -c 3 -s "System prompt here" -i "Instruction prompt here"

Replace -c 3 with your desired prediction count, and -s and -i with your system and instruction prompts, respectively. The predictions will be saved to a file named predictions.jsonl in the current working directory.


Implementing the First Step

The replies all seemed reasonable to me out of the gate, so I end up trying it out. I create the examples directory and touch the examples/dataset_generator.rs to create the example file. I add the three new dependencies as dev dependencies, unlike the suggested fragment from Nemotron, since they’re only needed for this example. Then I paste in the generated code…

At this point, I want to highlight what is real cool about continue.dev: you can easily reference files, structs or functions while chatting. Notice the @step_prediction_test reference in my question above? That pulled my integration test function into the AI’s context so that it knew exactly how I wanted to generate text using my library! And looking at the generated code, I can tell it definitely copied what I was doing there.

But sadly, there were problems. The generated code involving clap didn’t compile. Arg::with_name() isn’t found and the module can’t use clap::App because it doesn’t exist. So I ask Nemotron to fix it’s mistakes:

There's no `App` type in `clap` and `Arg::with_name()` doesn't exist. 
Rewrite the CLI arg parsing to use the `derive` feature of `clap` (already 
enabled in my cargo.toml) to parse the arguments into a struct.

Nemotron 70B’s Response

Here’s how you can rewrite the CLI argument parsing to use the derive feature of clap. This approach defines a struct for the arguments and derives the Parser trait for automatic parsing.

First, ensure you have the derive feature enabled for clap in your Cargo.toml:

[dependencies]
clap = { version = "3", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

Modified cli_predictor.rs with derived parser:

// examples/cli_predictor.rs
use clap::Parser;
use serde_json;
use std::fs::File;
use std::io::Write;
use std::path::Path;

// Import your library's functionality
use yourlibrary::{Llama, ManagedGptParams, TokenList};

// Define a struct for the CLI arguments
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Args {
    /// Number of predictions to generate
    #[clap(short, long, value_parser)]
    count: usize,

    /// System prompt for the model
    #[clap(short = 's', long, value_parser)]
    system_prompt: String,

    /// Instruction prompt for the model
    #[clap(short = 'i', long, value_parser)]
    instruction_prompt: String,
}

fn main() {
    // Parse CLI arguments into the `Args` struct
    let args = Args::parse();

    // Rest of your application remains the same, using fields from `args`
    // Setup library model (Similar to step_prediction_test)
    let mut model_params = yourlibrary::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = yourlibrary::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
    let mut llama = Llama::new();
    let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    // Prediction parameters (adjust as needed)
    let mut params = ManagedGptParams::defaults();
    //... (rest of the prediction setup remains the same)

    let mut prediction_results = Vec::new();

    for _ in 0..args.count {
        // Construct prompt
        let prompt = format!("{}{}\n<|assistant|>\n", args.system_prompt, args.instruction_prompt);

        // Process prompt and generate prediction (simplified from your test)
        //... (prediction logic remains the same)

        // Store result for JSON serialization
        prediction_results.push(PredictionResult {
            prompt: prompt.clone(),
            prediction: prediction_str.clone(),
        });

        println!("Prediction {}\nPrompt: {}\nPrediction: {}\n", _+1, prompt, prediction_str);
    }

    // Serialize and save to JSONL file
    let jsonl_file_path = Path::new("predictions.jsonl");
    let mut file = File::create(jsonl_file_path).expect("Failed to create file");
    for result in prediction_results {
        let json = serde_json::to_string(&result).expect("Failed to serialize");
        file.write_all(format!("{}\n", json).as_bytes())
          .expect("Failed to write to file");
    }
    println!("Predictions saved to {}", jsonl_file_path.display());
}

Note:

  • Replace yourlibrary with the actual name of your library when importing (yourlibrary::{Llama, ManagedGptParams, TokenList};).
  • The PredictionResult struct definition is assumed to be the same as before; if not defined in this file, ensure it’s defined similarly with serde::Serialize derive macro for JSON serialization.

Run the example with:

cargo run --example cli_predictor -- --count 3 --system-prompt "System prompt here" --instruction-prompt "Instruction prompt here"

Or use the short flags if preferred:

cargo run --example cli_predictor -- -c 3 -s "System prompt here" -i "Instruction prompt here"

Are We Working Yet?

So I make the suggested swaps to the clap code and the new structure parsing compiles fine. After that, I fix a very obvious problem with the main text generation loop not having a variable named for the index … because the println!() at the end of the loop wants to print the index as _+1 which is obviously wrong. I make the index in the for loop bound to the variable i and replace the underscore in the println!().

The last remaining compile error probably comes from the fact that I only referenced @step_prediction_test in the chat earlier and didn’t include the whole file with @step_prediction_tests.rs, so Nemotron had no way of knowing what the utility function get_test_model_path() was. In this example it’s undefined, but I copy my function over from the integration test and paste it into the bottom of the example file as well as importing exit from std::process.

Now it compiles! But will it run correctly? No, there’s no chance. When reviewing the code, I spot another error and ask Nemotron to fix it for me just to see if it would. I know the fix, but the whole point of this is to test out these models, right? Let’s see what it gives me.

While it looks like the prompt string gets made using `format!`, it doesn't 
actually get set in the `params` that gets sent to the `woolyrust` 
library functions. Can you doublecheck the @step_prediction_test example again and make any 
corrections necessary to make sure `params` will be configured properly 
for text prediciton?

Nemotron 70B’s Response

The provided step_prediction_test function appears to correctly set the prompt in the params using the set_prompt method, not relying on the format! macro for this specific task. However, to ensure clarity and address the mention of format!, which is not directly used for setting the prompt in params, here are the relevant parts of the code with a minor adjustment for better clarity and confirmation that the prompt is properly set:

// Original prompt setup (for reference)
let prompt = "<|user|>\nWrite the start to the next movie collaboration between Quentin Tarantino and Robert Rodriguez.<|end|>\n<|assistant|>\n";
params.set_prompt(prompt);

// Additional prompt setup later in the code (for the second prediction)
let new_prompt_text = "Do you have a suggestion for genre?<|end|>\n<|user|>\nMake it like a Pixar movie script, but with those two authors!<|end|>\n<|assistant|>\n";
let new_prompt_tokens = llama.process_additional_prompt(&mut second_sampler, new_prompt_text);

To explicitly demonstrate using format! for constructing a prompt (though not necessary in this case, as the original code directly assigns the prompt string), you could modify the initial prompt setup like so:

// Demonstrative example using format! (not necessary for the original code's functionality)
let movie_collaboration_idea = "Quentin Tarantino and Robert Rodriguez";
let prompt = format!(
    "<|user|>\nWrite the start to the next movie collaboration between {}.<|end|>\n<|assistant|>\n",
    movie_collaboration_idea
);
params.set_prompt(prompt);

And for the additional prompt:

// Demonstrative example for the additional prompt using format!
let suggested_genre = "Pixar movie script";
let additional_authors = "those two authors";
let new_prompt_text = format!(
    "Do you have a suggestion for genre?<|end|>\n<|user|>\nMake it like a {}, but with {}!<|end|>\n<|assistant|>\n",
    suggested_genre, additional_authors
);
let new_prompt_tokens = llama.process_additional_prompt(&mut second_sampler, new_prompt_text);

Confirmation of params Configuration for Text Prediction:

Based on the provided step_prediction_test function, the params appear to be properly configured for text prediction, with key configurations including:

  1. Prompt Setting: The primary and additional prompts are correctly set using set_prompt and process_additional_prompt, respectively.
  2. Model and Context Parameters: model_params and context_params are customized (e.g., n_gpu_layers, n_ctx).
  3. Prediction Parameters: ManagedGptParams are extensively configured, covering aspects like seeding, threading, prediction length, temperature, top-k and top-p filtering, and more.
  4. Freezing and Defrosting: States are properly frozen and defrosted to facilitate efficient continuation of predictions.

Given this analysis, the params should indeed be correctly configured for the intended text prediction tasks within the step_prediction_test function.


Good Enough?

Okay, so the block where it demonstrates creating the prompt with format! and follows up with a params.set_prompt() call is correct, and seeing it reach further to show how you can add further prompt tokens in with an additional library call is nice… but it still didn’t add the function call into the example code. I figure this is good enough and just add the set_prompt() call in myself.

Referencing Nemotron’s first reply to figure out how, I decide to run the example and see what happens. Before doing the actual invocation of the example, I need to make sure the environment variable I use for my integration tests is set so it knows what model to load.

export WOOLY_TEST_MODEL_FILE=~/.cache/lm-studio/models/bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_S.gguf
cargo run --example dataset_generator -- --count 1 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."

It worked!

Here’s the output in the terminal:

Prediction 1
Prompt: You are a creative writing AI.Generate a sample text block in Markdown format.
<|assistant|>

Prediction: ## Sample Text Block

### Introduction
This is the introduction to our sample text block. It sets the stage for what's to come and provides context.

#### Key Points
*   We will cover some key points about [topic].
*   These points will be discussed in detail throughout this document.
*   They provide an overview of the main ideas that we'll explore.

### Main Content
#### Section 1: Topic Overview
Our topic is [topic]. This section provides an overview of the

Predictions saved to predictions.jsonl

The predicitons.jsonl file was created successfully and here’s the serialized result (formatted nicer for reading):

{"prompt":"You are a creative writing AI.Generate a sample text block in Markdown format.
<|assistant|>
","prediction":"## Sample Text Block

### Introduction
This is the introduction to our sample text block. It sets the stage for what's to come and provides context.

#### Key Points
*   We will cover some key points about [topic].
*   These points will be discussed in detail throughout this document.
*   They provide an overview of the main ideas that we'll explore.

### Main Content
#### Section 1: Topic Overview
Our topic is [topic]. This section provides an overview of the"}

At this point, the only other problem I see offhand is the hardcoded seed parameter, which is better off as u32::MAX to have the library use random seeds. Maybe some randomization of some of the sampling parameters within an allowable range would be useful too.

Final Nemotron 70B Example Code

// examples/cli_predictor.rs
use clap::Parser;
use serde_json;
use std::io::Write;
use std::path::Path;
use std::{fs::File, process::exit};

// Import your library's functionality
use woolyrust::{Llama, ManagedGptParams, TokenList};

// Define a struct for the CLI arguments
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Args {
    /// Number of predictions to generate
    #[clap(short, long, value_parser)]
    count: usize,

    /// System prompt for the model
    #[clap(short = 's', long, value_parser)]
    system_prompt: String,

    /// Instruction prompt for the model
    #[clap(short = 'i', long, value_parser)]
    instruction_prompt: String,
}

fn main() {
    // Parse CLI arguments into the `Args` struct
    let args = Args::parse();

    // Setup library model (Similar to step_prediction_test)
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
    let mut llama = Llama::new();
    let load_success =
        llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    //Prediction parameters (adjust as needed)
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100; // Per prediction length
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;

    // Vector to hold JSON serializable prediction results
    #[derive(serde::Serialize)]
    struct PredictionResult {
        prompt: String,
        prediction: String,
    }
    let mut prediction_results = Vec::new();

    for i in 0..args.count {
        // Construct prompt
        let prompt = format!(
            "{}{}\n<|assistant|>\n",
            args.system_prompt, args.instruction_prompt
        );
        params.set_prompt(prompt.as_str());

        // Process prompt and generate prediction (simplified from your test)
        let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
        assert_eq!(prompt_token_count > 0, true);
        let mut predictions: TokenList = vec![];
        while predictions.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut sampler);
            if predictions.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }
            predictions.push(next_token);
        }
        let prediction_str = llama.detokenize_text(&mut predictions, false);
        assert!(!prediction_str.is_empty());

        // Store result for JSON serialization
        prediction_results.push(PredictionResult {
            prompt: prompt.clone(),
            prediction: prediction_str.clone(),
        });

        println!(
            "Prediction {}\nPrompt: {}\nPrediction: {}\n",
            i + 1,
            prompt,
            prediction_str
        );
    }

    // Serialize and save to JSONL file
    let jsonl_file_path = Path::new("predictions.jsonl");
    let mut file = File::create(jsonl_file_path).expect("Failed to create file");
    for result in prediction_results {
        let json = serde_json::to_string(&result).expect("Failed to serialize");
        file.write_all(format!("{}
", json).as_bytes())
            .expect("Failed to write to file");
    }
    println!("Predictions saved to {}", jsonl_file_path.display());
}

// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
    let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
    if let Ok(fp) = model_filepath {
        return fp;
    } else {
        println!(
            "Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
        );
        exit(1);
    }
}

Nemotron 70B Conclusion

Not bad! How much did running all of this cost me in openrouter credits? Less than half a penny! Insane. I’m a big supporter of local AI usage and have only just this week started toying with API access of bigger models, but this price point is amazing. There are privacy concerns and such, for sure. The first thing I did was opt out of letting openrouter do anything with my usage data in their privacy settings, but who knows what the endpoints really do. However, if I didn’t have my 4090 on my workstation, I think it’d be a lot harder to justify purchasing one just to run LLMs.

Next Challenger - Qwen 2.5 Coder 32B

Lets move that example file off to the side and start this whole thing over to see what it’s like using a model specifically crafted for programming tasks. I’ll be running qwen2.5-coder-32b-instruct and using the Q4_K_M quant with all layers offloaded to my 4090 while using LM Studio to be my API provider for the LAN while I access it on my MacBook Air M3. I load it using a context size of 8192 tokens and LM Studio reports just over 18 GB of VRAM used. My default sampler settings for this model use a temp of 0.3, top_k of 40, repeat penalty of 1.03, top_p of 0.9 and min_p of 0.05. Flash attention is enabled as well.

Still having my woolyrust project open in VS Code, I set continue.dev to use my LAN API endpoint with the qwen2.5 coder model and I fire off my first question, which in a new chat window. (Note: it is the exact same one I used to start the process with Nemotron)

Is it easy to add examples to Rust library projects?

As the tokens roll in, I would ballpark the speed of this 32B model on my 4090 as being roughly equivalent to what I was pulling from openrouter providers with Nemotron 70B, FWIW.

Qwen2.5 Coder 32B’s Response

Adding examples to Rust library projects is straightforward and can be done in a few simple steps. Here’s how you can do it:

Step 1: Create an examples Directory

First, create a directory named examples in the root of your Rust project. This is where you will store your example files.

mkdir examples

Step 2: Add Example Files

Inside the examples directory, you can add as many Rust files as you need. Each file should contain a main function, as it will be compiled as a separate binary.

For example, create a file named simple_example.rs:

// examples/simple_example.rs

use your_crate_name::SomeStruct;

fn main() {
    let example = SomeStruct::new();
    println!("Example: {:?}", example);
}

Step 3: Update Cargo.toml

You don’t need to update Cargo.toml specifically for the examples, but you can specify dependencies or other configurations if needed. However, Rust automatically recognizes files in the examples directory as examples.

Step 4: Run the Examples

To run an example, use the cargo run command with the --example flag followed by the name of the example file (without the .rs extension).

cargo run --example simple_example

Step 5: Build the Examples

If you want to build the examples without running them, you can use the cargo build command with the --examples flag.

cargo build --examples

Additional Tips

  • Documentation: Consider adding documentation comments to your example files to explain what they do. This can be useful for users of your library.
  • Testing Examples: You can also run all examples as tests using the cargo test --examples command. This ensures that your examples remain up-to-date and work correctly with your library.

By following these steps, you can easily add and manage examples in your Rust library projects, making it easier for others to understand and use your library.


A Good Start

Honestly? I like the clarity of Qwen’s answer a touch better. And interestingly enough, you’ll notice that it’s largely the same response as what Nemotron 70B gave. Subjectively, rendered inside continue.dev’s Chat window, I think Qwen’s response is easier to read.

With the context set, I send it the same detailed request I sent Nemotron:

I want to create a new example for this Rust project. It should take three 
command line args, parsed with the `clap` library, for 'count', 
'system prompt' and 'instruction prompt'.

Then, it should follow the text prediction workflow for the library as 
illustrated in the @step_prediction_test function. I would like to perform 
'count' number of text predictions for a prompt built by the 'system prompt' 
and 'instruction prompt' arguments on the command line.

I want to put the resulting predicted text into a vector and then turn it 
into a JSONL file at the end, serializing it and saving it to the file system.

Qwen2.5 Coder 32B’s Response

Certainly! To create a new example for your Rust project that takes command-line arguments using the clap library and performs text predictions, you can follow these steps:

  1. Add clap and serde Dependencies: First, ensure you have the necessary dependencies in your Cargo.toml.

  2. Create the Example File: Create a new file in the examples directory, e.g., text_prediction.rs.

  3. Implement the Command-Line Argument Parsing and Prediction Workflow: Use clap to parse the command-line arguments and then follow the prediction workflow as shown in your step_prediction_test function.

  4. Serialize and Save to JSONL: Use serde to serialize the predictions into a JSONL file.

Here’s how you can do it:

Step 1: Add Dependencies

Add the following dependencies to your Cargo.toml:

[dependencies]
clap = { version = "4.0", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
woolyrust = "0.1"  # Replace with your actual crate name and version

Step 2: Create the Example File

Create a new file named text_prediction.rs in the examples directory:

// examples/text_prediction.rs

use clap::Parser;
use serde::{Serialize, Deserialize};
use std::fs::File;
use std::io::{self, Write};
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
    #[arg(short, long, default_value_t = 1)]
    count: u32,

    #[arg(short, long, required = true)]
    system_prompt: String,

    #[arg(short, long, required = true)]
    instruction_prompt: String,
}

#[derive(Serialize, Deserialize)]
struct Prediction {
    prompt: String,
    prediction: String,
}

fn main() -> io::Result<()> {
    let cli = Cli::parse();

    // Load up our test model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path();
    let mut llama = Llama::new();
    let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    // Set the text generation parameters up
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;
    let antiprompts = vec!["<|end|>"];
    let prompt = format!("<|user|>\n{}\n<|end|>\n<|assistant|>\n{}", cli.system_prompt, cli.instruction_prompt);
    params.set_antiprompts(&antiprompts);
    params.set_prompt(prompt.clone());

    params.params.dry_multiplier = 0.8;
    params.params.dry_base = 1.75;
    params.params.dry_allowed_length = 2;
    params.params.dry_penalty_last_n = -1;
    let seq_breakers = vec!["\n", ":", "\"", "*"];
    params.set_dry_sequence_breakers(&seq_breakers);

    // Get the prompt ingested into the context and pull the sampler
    // used in the process so that repeat penalties and such are
    // accounted for.
    let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
    assert_eq!(prompt_token_count > 0, true);

    // Freeze the state after processing the prompt so that we can generate
    // a second block of text after the first one without having to reprocess
    // the prompt.
    let frozen_prompt = llama.freeze(&mut params, None);

    let mut predictions: Vec<Prediction> = Vec::new();

    for _ in 0..cli.count {
        // Start our prediction loop now that the prompt has been processed
        let mut prediction_tokens: TokenList = vec![];
        while prediction_tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut first_sampler);

            if prediction_tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            prediction_tokens.push(next_token);
        }

        // Print out our prediction
        let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
        println!("Prediction (tokens: {})\n{}", prediction_tokens.len(), prediction_str);
        assert!(!prediction_str.is_empty());
        assert!(prediction_tokens.len() > 0);

        // Store the prediction
        predictions.push(Prediction {
            prompt: prompt.clone(),
            prediction: prediction_str,
        });

        // Defrost our frozen state from processing the prompt and generate something new
        let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
        first_sampler = next_sampler;
    }

    // Serialize predictions to JSONL and save to a file
    let file_path = "predictions.jsonl";
    let mut file = File::create(file_path)?;

    for prediction in predictions {
        let json = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json)?;
    }

    println!("Predictions saved to {}", file_path);

    Ok(())
}

fn get_test_model_path() -> String {
    // Replace with your actual model path
    "path/to/your/model.bin".to_string()
}

Step 3: Run the Example

To run the example, use the cargo run command with the --example flag followed by the name of the example file (without the .rs extension).

cargo run --example text_prediction -- --count 3 --system-prompt "Write a movie collaboration" --instruction-prompt "between Quentin Tarantino and Robert Rodriguez"

Explanation

  • Command-Line Arguments: The clap library is used to parse command-line arguments for count, system_prompt, and instruction_prompt.
  • Prediction Workflow: The prediction workflow follows the steps from your step_prediction_test function.
  • JSONL Serialization: The predictions are serialized into JSONL format and saved to a file named predictions.jsonl.

This setup should help you create a robust example for your Rust project that takes command-line arguments and performs multiple text predictions, saving the results in a structured format.


Quite Impressed!

Whoa! Compared to Nemotron, there’s a number of things Qwen gets right out of the gate. Firstly, it wrote the example to have clap use a struct for all the command-line parameters, which I prefer. Secondly, if you look at the source step_prediction_tests.rs file, you can tell that it kept much more of the same parameters. Lastly, you can see that it wrote it’s own implementation of get_test_model_path() to make the example complete, since it didn’t have access to the one I wrote in its context. I’m going to replace Qwen’s placeholder with my actual code, but I’m happy to see it figured that out.

However, there’s still a compiler error on this line, params.set_prompt(prompt.clone()); since the function expects a &str and not a String. Lets ask Qwen to fix it.

When I paste that code into the source file, there's a compiler error 
with `params.set_prompt(prompt.clone());`... can you fix that?

Qwen2.5 Coder 32B’s Response

Certainly! The error you’re encountering is likely due to the ManagedGptParams struct not having a set_prompt method. Let’s address this by ensuring that the ManagedGptParams struct has the necessary methods and fields to handle the prompt.

Since the woolyrust library might not have a set_prompt method, we need to check the actual API of the woolyrust crate to see how to set the prompt correctly. Let’s assume that the ManagedGptParams struct has a field for the prompt, and we can directly set it.

Here’s the revised code with the necessary adjustments:

<…SNIP …>


Qwen’s First Misstep

So I cut the response short for this article, but it reiterated the sample code but changed that one line to:

    params.prompt = prompt.clone();  // Directly set the prompt field

The problem is that there is no member called ‘prompt’ on that structure. Qwen doesn’t have the struct’s definition in context, only my integration test so it’s trying to infer that it’s a String, when in reality, the ManagedGptParams struct exists to give a lifetime for native C strings used for the FFI calls… which is why calling set_prompt() is necessary instead of just setting the string. Qwen hoped to simplify it by guessing but it is wrong. It goes to show how specific you should be with Chat in continue.dev if you have the knowledge.

Lets try saving the situation:

No, that's not right. @ManagedGptParams does not have a `prompt` member. The `set_prompt()` 
function just needs a `&str` passed to it instead. Can you fix your original 
sample code up now that you know that?

Qwen2.5 Coder 32B’s Response

Certainly! Given that the ManagedGptParams struct does not have a prompt field and instead uses a set_prompt method, we need to adjust the code accordingly. Here’s the revised version of the code:

// examples/text_prediction.rs

use clap::Parser;
use serde::{Serialize, Deserialize};
use std::fs::File;
use std::io::{self, Write};
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
    #[arg(short, long, default_value_t = 1)]
    count: u32,

    #[arg(short, long, required = true)]
    system_prompt: String,

    #[arg(short, long, required = true)]
    instruction_prompt: String,
}

#[derive(Serialize, Deserialize)]
struct Prediction {
    prompt: String,
    prediction: String,
}

fn main() -> io::Result<()> {
    let cli = Cli::parse();

    // Load up our test model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path();
    let mut llama = Llama::new();
    let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    // Set the text generation parameters up
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;
    let antiprompts = vec!["<|end|>"];
    let prompt = format!("<|user|>\n{}\n<|end|>\n<|assistant|>\n{}", cli.system_prompt, cli.instruction_prompt);
    params.antiprompts = antiprompts;

    // Set the prompt using the set_prompt method
    params.set_prompt(prompt.as_str());

    params.params.dry_multiplier = 0.8;
    params.params.dry_base = 1.75;
    params.params.dry_allowed_length = 2;
    params.params.dry_penalty_last_n = -1;
    let seq_breakers = vec!["\n", ":", "\"", "*"];
    params.dry_sequence_breakers = seq_breakers;

    // Get the prompt ingested into the context and pull the sampler
    // used in the process so that repeat penalties and such are
    // accounted for.
    let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
    assert_eq!(prompt_token_count > 0, true);

    // Freeze the state after processing the prompt so that we can generate
    // a second block of text after the first one without having to reprocess
    // the prompt.
    let frozen_prompt = llama.freeze(&mut params, None);

    let mut predictions: Vec<Prediction> = Vec::new();

    for _ in 0..cli.count {
        // Start our prediction loop now that the prompt has been processed
        let mut prediction_tokens: TokenList = vec![];
        while prediction_tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut first_sampler);

            if prediction_tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            prediction_tokens.push(next_token);
        }

        // Print out our prediction
        let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
        println!("Prediction (tokens: {})\n{}", prediction_tokens.len(), prediction_str);
        assert!(!prediction_str.is_empty());
        assert!(prediction_tokens.len() > 0);

        // Store the prediction
        predictions.push(Prediction {
            prompt: prompt.clone(),
            prediction: prediction_str,
        });

        // Defrost our frozen state from processing the prompt and generate something new
        let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
        first_sampler = next_sampler;
    }

    // Serialize predictions to JSONL and save to a file
    let file_path = "predictions.jsonl";
    let mut file = File::create(file_path)?;

    for prediction in predictions {
        let json = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json)?;
    }

    println!("Predictions saved to {}", file_path);

    Ok(())
}

fn get_test_model_path() -> String {
    // Replace with your actual model path
    "path/to/your/model.bin".to_string()
}

Key Changes

  1. Use set_prompt Method: Instead of directly setting the prompt field, we use the set_prompt method to set the prompt.

  2. Pass &str to set_prompt: The set_prompt method takes a &str, so we pass prompt.as_str() to it.

Running the Example

To run the example, use the following command:

cargo run --example text_prediction -- --count 3 --system-prompt "Write a movie collaboration" --instruction-prompt "between Quentin Tarantino and Robert Rodriguez"

This should resolve the compiler error and allow you to run the example successfully. If there are any further issues, please provide the specific error messages, and I can help you troubleshoot further.


Rollback Time

Okay, so it got the error fixed with set_prompt(), but then it dropped the usage of set_antiprompts() and set_dry_sequence_breakers() which serve the same purpose as set_prompt() and tries to just set named members that don’t exist instead - even after tagging the struct in Chat to pull in the definition of it. Disappointing. Maybe it would have helped to retag the @step_prediction_test integration test again too.

Rolling back to the first example code, I add my function to get the model path and fix the set_prompt() call myself and then take a closer look at the logic of the generated code.

Impressively enough it caught on to the usage of ‘freezing’ and ‘defrosting’ (saving/restoring) the state of the LLM after the initial prompt processing as an optimization.

Time to see if it runs!

cargo run --example dataset_generator -- --count 1 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."

Terminal output:

Prediction (tokens: 100)


# Sample Text Block
## Introduction
This is a sample text block written in Markdown format.

### Paragraph 1

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupid
Predictions saved to predictions.jsonl

The generated predictions.jsonl file had the expected data.


Finishing Touches for Qwen

Okay, so it works, but because of the sampler parameters it pulled from my integration test where I wanted the generated text to mostly be the same, every generation is the same across iterations of the loop. So I figured I’d ask a little bonus question of Qwen.

But I knew I wanted to tag my updated example, so before I asked this question, I hit the ‘…’ button in continue.dev’s window and had it ’re-index’ my code.

Okay the corrected code now exists in @dataset_generator.rs however the sampling 
parameters produce the same exact response every time making looping worthess.
Can you give me code just for the sampler parameters and change them up so that 
there's a much greater increase in variability in the generted text?

Qwen2.5 Coder 32B’s Response

Certainly! To increase variability in the generated text, you can adjust the sampling parameters to allow for more randomness and diversity. Here are some key parameters you can modify:

  1. Temperature (temp): Lower values make the model more deterministic, while higher values make it more random.
  2. Top-k Sampling (top_k): Controls the number of highest probability vocabulary tokens to keep for top-k filtering.
  3. Top-p Sampling (top_p): Also known as nucleus sampling, where the model considers the smallest set of tokens whose cumulative probability exceeds top_p.
  4. Penalty for Repeat Tokens (penalty_repeat): Adjusts the penalty for repeating tokens to encourage diversity.
  5. Penalty for Last N Tokens (penalty_last_n): Adjusts the penalty for repeating tokens in the last n tokens.

Here’s the updated code with modified sampling parameters to increase variability:

// examples/dataset_generator.rs

use clap::Parser;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{self, Write};
use std::process::exit;
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
    #[arg(short, long, default_value_t = 1)]
    count: u32,

    #[arg(short, long, required = true)]
    system_prompt: String,

    #[arg(short, long, required = true)]
    instruction_prompt: String,
}

#[derive(Serialize, Deserialize)]
struct Prediction {
    prompt: String,
    prediction: String,
}

fn main() -> io::Result<()> {
    let cli = Cli::parse();

    // Load up our test model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path();
    let mut llama = Llama::new();
    let load_success =
        llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    // Set the text generation parameters up
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42; // You can change this to a different seed or use a random seed
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.7; // Increased temperature for more randomness
    params.params.top_k = 50; // Increased top-k for more diversity
    params.params.top_p = 0.9; // Increased top-p for more diversity
    params.params.min_p = 0.05; // Lower min_p to allow more tokens
    params.params.penalty_repeat = 1.0; // Reduced penalty for repeat tokens
    params.params.penalty_last_n = 32; // Reduced penalty last n tokens
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;
    let antiprompts = vec!["<|end|>"];
    let prompt = format!(
        "<|user|>\n{}\n<|end|>\n<|assistant|>\n{}",
        cli.system_prompt, cli.instruction_prompt
    );
    params.set_antiprompts(&antiprompts);
    params.set_prompt(prompt.as_str());

    params.params.dry_multiplier = 0.8;
    params.params.dry_base = 1.75;
    params.params.dry_allowed_length = 2;
    params.params.dry_penalty_last_n = -1;
    let seq_breakers = vec!["\n", ":", "\"", "*"];
    params.set_dry_sequence_breakers(&seq_breakers);

    // Get the prompt ingested into the context and pull the sampler
    // used in the process so that repeat penalties and such are
    // accounted for.
    let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
    assert_eq!(prompt_token_count > 0, true);

    // Freeze the state after processing the prompt so that we can generate
    // a second block of text after the first one without having to reprocess
    // the prompt.
    let frozen_prompt = llama.freeze(&mut params, None);

    let mut predictions: Vec<Prediction> = Vec::new();

    for _ in 0..cli.count {
        // Start our prediction loop now that the prompt has been processed
        let mut prediction_tokens: TokenList = vec![];
        while prediction_tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut first_sampler);

            if prediction_tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            prediction_tokens.push(next_token);
        }

        // Print out our prediction
        let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
        println!(
            "Prediction (tokens: {})\n{}",
            prediction_tokens.len(),
            prediction_str
        );
        assert!(!prediction_str.is_empty());
        assert!(prediction_tokens.len() > 0);

        // Store the prediction
        predictions.push(Prediction {
            prompt: prompt.clone(),
            prediction: prediction_str,
        });

        // Defrost our frozen state from processing the prompt and generate something new
        let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
        first_sampler = next_sampler;
    }

    // Serialize predictions to JSONL and save to a file
    let file_path = "predictions.jsonl";
    let mut file = File::create(file_path)?;

    for prediction in predictions {
        let json = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json)?;
    }

    println!("Predictions saved to {}", file_path);

    Ok(())
}

// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
    let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
    if let Ok(fp) = model_filepath {
        return fp;
    } else {
        println!(
            "Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
        );
        exit(1);
    }
}

Key Changes

  1. Temperature (temp): Set to 0.7 to increase randomness.
  2. Top-k Sampling (top_k): Set to 50 to consider more tokens.
  3. Top-p Sampling (top_p): Set to 0.9 to consider a larger set of tokens.
  4. Minimum Probability (min_p): Set to 0.05 to allow more tokens.
  5. Penalty for Repeat Tokens (penalty_repeat): Set to 1.0 to reduce the penalty.
  6. Penalty for Last N Tokens (penalty_last_n): Set to 32 to reduce the penalty.

These changes should help increase the variability in the generated text. You can further tweak these parameters to achieve the desired level of diversity and randomness.


One Last Qwen Run

It gave me the whole block of code again which I wish it didn’t do, but at least it clearly marked it’s changes, so I just copied those into my example code. It also didn’t adjust the seed, so every response will still stay the same.

I’ll keep this short, but I asked this in chat:

You didn't change the `seed` so everything will still be the same.

And Qwen regurgitated the whole example code file again but had the following segment for setting the seed variable:

// Set a random seed
    let seed = SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards")
        .as_secs();
    params.params.seed = seed as u64;

Which is close, but wrong. seed is a u32 type. Also, I know that a much simpler way to introduce random seed behavior is to set seed to u32::MAX, so I don’t use Qwen’s suggestion and just fix the seed myself.

Once more I run the example and see what comes out:

cargo run --example dataset_generator -- --count 3 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."

Terminal output:

Prediction (tokens: 100)
 This text block should include a heading, a brief description, and a few key points. # Heading
## Brief Description
This sample text block demonstrates how to create a Markdown formatted text with a heading, description, and key points.

### Key Points
*   **Heading**: The `# Heading` line defines the heading level, with `#` indicating the main heading, `##` indicating a subheading, and so on.
*   **Description**: The `## Brief Description` line
Prediction (tokens: 100)


---

### The Mysterious Island

#### Chapter 1: The Discovery

As I stood on the creaky deck of the old sailing ship, the salty sea air whipping my hair into a frenzy, I couldn't help but feel a sense of wonder. We had been at sea for weeks, and finally, the horizon had begun to take on a familiar shape – the island of Tortuga.

Legend had it that the island was cursed, that anyone who set foot on its sandy shores would
Prediction (tokens: 100)


# Introduction
The city of Eldrador is a place of mystery and wonder. It is a city that exists outside the boundaries of time, where the past, present, and future all converge. The city is filled with strange and wondrous sights, sounds, and smells that will leave you breathless and bewildered. As you walk through the city's winding streets, you'll stumble upon ancient ruins, hidden temples, and mysterious artifacts that will challenge your perceptions and spark your imagination.

## The
Predictions saved to predictions.jsonl

The generated predictions.jsonl file had all the predictions as expected (formatted for reading):

{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":" This text block should include a heading, a brief description, and a few key points. # Heading
## Brief Description
This sample text block demonstrates how to create a Markdown formatted text with a heading, description, and key points.

### Key Points
*   **Heading**: The `# Heading` line defines the heading level, with `#` indicating the main heading, `##` indicating a subheading, and so on.
*   **Description**: The `## Brief Description` line"
}
{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":" 

---

### The Mysterious Island

#### Chapter 1: The Discovery

As I stood on the creaky deck of the old sailing ship, the salty sea air whipping my hair into a frenzy, I couldn't help but feel a sense of wonder. We had been at sea for weeks, and finally, the horizon had begun to take on a familiar shape – the island of Tortuga.

Legend had it that the island was cursed, that anyone who set foot on its sandy shores would"
}
{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":" 

# Introduction
The city of Eldrador is a place of mystery and wonder. It is a city that exists outside the boundaries of time, where the past, present, and future all converge. The city is filled with strange and wondrous sights, sounds, and smells that will leave you breathless and bewildered. As you walk through the city's winding streets, you'll stumble upon ancient ruins, hidden temples, and mysterious artifacts that will challenge your perceptions and spark your imagination.

## The"
}

Final Qwen 2.5 Coder 32B Example Code

// examples/text_prediction.rs

use clap::Parser;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{self, Write};
use std::process::exit;
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
    #[arg(short, long, default_value_t = 1)]
    count: u32,

    #[arg(short, long, required = true)]
    system_prompt: String,

    #[arg(short, long, required = true)]
    instruction_prompt: String,
}

#[derive(Serialize, Deserialize)]
struct Prediction {
    prompt: String,
    prediction: String,
}

fn main() -> io::Result<()> {
    let cli = Cli::parse();

    // Load up our test model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path();
    let mut llama = Llama::new();
    let load_success =
        llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert_eq!(load_success, true);
    assert_eq!(llama.is_loaded(), true);

    // Set the text generation parameters up
    let mut params = ManagedGptParams::defaults();
    params.params.seed = u32::MAX;
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.7; // Increased temperature for more randomness
    params.params.top_k = 50; // Increased top-k for more diversity
    params.params.top_p = 0.9; // Increased top-p for more diversity
    params.params.min_p = 0.05; // Lower min_p to allow more tokens
    params.params.penalty_repeat = 1.0; // Reduced penalty for repeat tokens
    params.params.penalty_last_n = 32; // Reduced penalty last n tokens
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;
    let antiprompts = vec!["<|end|>"];
    let prompt = format!(
        "<|user|>\n{}\n<|end|>\n<|assistant|>\n{}",
        cli.system_prompt, cli.instruction_prompt
    );
    params.set_antiprompts(&antiprompts);
    params.set_prompt(prompt.as_str());

    params.params.dry_multiplier = 0.8;
    params.params.dry_base = 1.75;
    params.params.dry_allowed_length = 2;
    params.params.dry_penalty_last_n = -1;
    let seq_breakers = vec!["\n", ":", "\"", "*"];
    params.set_dry_sequence_breakers(&seq_breakers);

    // Get the prompt ingested into the context and pull the sampler
    // used in the process so that repeat penalties and such are
    // accounted for.
    let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
    assert_eq!(prompt_token_count > 0, true);

    // Freeze the state after processing the prompt so that we can generate
    // a second block of text after the first one without having to reprocess
    // the prompt.
    let frozen_prompt = llama.freeze(&mut params, None);

    let mut predictions: Vec<Prediction> = Vec::new();

    for _ in 0..cli.count {
        // Start our prediction loop now that the prompt has been processed
        let mut prediction_tokens: TokenList = vec![];
        while prediction_tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut first_sampler);

            if prediction_tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            prediction_tokens.push(next_token);
        }

        // Print out our prediction
        let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
        println!(
            "Prediction (tokens: {})\n{}",
            prediction_tokens.len(),
            prediction_str
        );
        assert!(!prediction_str.is_empty());
        assert!(prediction_tokens.len() > 0);

        // Store the prediction
        predictions.push(Prediction {
            prompt: prompt.clone(),
            prediction: prediction_str,
        });

        // Defrost our frozen state from processing the prompt and generate something new
        let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
        first_sampler = next_sampler;
    }

    // Serialize predictions to JSONL and save to a file
    let file_path = "predictions.jsonl";
    let mut file = File::create(file_path)?;

    for prediction in predictions {
        let json = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json)?;
    }

    println!("Predictions saved to {}", file_path);

    Ok(())
}

// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
    let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
    if let Ok(fp) = model_filepath {
        return fp;
    } else {
        println!(
            "Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
        );
        exit(1);
    }
}

Qwen 2.5 Coder 32B Conclusion

The model was very close to getting everything right in one shot, but trying to correct the one silly error via Chat would have sent me down a spiral of problems progressively getting worse. It’s a shame that it rewrote stuff that wasn’t a problem and made it worse … And in general, it doesn’t seem to want to be concise in chat by default. I’m almost positive I could interject a system prompt to adjust that, however.

All that said, the example it generated was actually better than Nemotron 70B, picking up on the whole prompt caching trick by freezing processed prompt tokens. I suspect that I might have to be more specific with Qwen 2.5 Coder 32B than I would have to with Nemotron 70B, but such is the case with any smaller LLM versus a significantly larger one.


The Crowdpleaser - Claude 3.5 Sonnet

Everyone seems to rave about this model, lets take it for a spin and use the exact same workflow as we did with the other models by starting off with a simple question once everything is configured correctly in continue.dev and VS Code:

Is it easy to add examples to Rust library projects?

Claude 3.5 Sonnet’s Response

Yes, it’s quite straightforward to add examples to Rust library projects! You can add example code in two main ways:

  1. In the examples/ directory of your project: Create standalone example files that demonstrate how to use your library. These can be run directly with cargo run --example example_name.

  2. As documentation examples: Write examples directly in your code’s documentation using doc comments (/// or //!). These serve both as documentation and as testable examples through cargo test.

Here’s a quick example of how you would set up both:

/// A simple function that adds two numbers
///
/// # Example
///
/// ```
/// use your_crate_name::add;
/// assert_eq!(add(2, 2), 4);
/// ```
pub fn add(a: i32, b: i32) -> i32 {
    a + b
}

And in a separate example file:

use your_crate_name::add;

fn main() {
    let result = add(10, 5);
    println!("10 + 5 = {}", result);
}

You can run this example using:

cargo run --example basic_math

Both approaches are valuable:

  • Doc examples are great for showing quick, focused usage snippets
  • Standalone examples are better for more complex, real-world usage scenarios

Would you like me to show more specific examples or explain any particular aspect in more detail?


An Interesting Response To a Simple Question

I don’t know how I feel about this one. On one hand, it’s clear and concise, but on the other it doesn’t explicitly say where to put that basic_math.rs file, though it should be implied well enough, I suppose. Lets follow up with the first set of demands in the same way as we did with the other models:

I want to create a new example for this Rust project. It should take three 
command line args, parsed with the `clap` library, for 'count', 
'system prompt' and 'instruction prompt'.

Then, it should follow the text prediction workflow for the library as 
illustrated in the @step_prediction_test function. I would like to perform 
'count' number of text predictions for a prompt built by the 'system prompt' 
and 'instruction prompt' arguments on the command line.

I want to put the resulting predicted text into a vector and then turn it 
into a JSONL file at the end, serializing it and saving it to the file system.

Claude 3.5 Sonnet’s Response

I’ll help you create an example that combines clap, the text prediction workflow, and JSONL output. Here’s how we can do that:

use clap::Parser;
use serde::Serialize;
use std::fs::File;
use std::io::Write;
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Number of predictions to generate
    #[arg(short, long)]
    count: usize,

    /// System prompt to use
    #[arg(short, long)]
    system_prompt: String,

    /// Instruction prompt to use
    #[arg(short, long)]
    instruction_prompt: String,
}

#[derive(Serialize)]
struct Prediction {
    system_prompt: String,
    instruction_prompt: String,
    generated_text: String,
    token_count: usize,
}

fn main() -> std::io::Result<()> {
    let args = Args::parse();
    
    // Initialize model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = "path/to/your/model.gguf"; // Update with actual path
    let mut llama = Llama::new();
    let load_success = llama.load_model(model_filepath, model_params, context_params, true);
    assert!(load_success);

    // Set up generation parameters
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;

    let antiprompts = vec!["<|end|>"];
    params.set_antiprompts(&antiprompts);

    // Store predictions
    let mut predictions: Vec<Prediction> = Vec::new();

    // Generate multiple predictions
    for _ in 0..args.count {
        // Construct prompt
        let prompt = format!(
            "<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
            args.system_prompt, args.instruction_prompt
        );
        params.set_prompt(&prompt);

        // Process prompt and get sampler
        let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
        assert!(prompt_token_count > 0);

        // Generate prediction
        let mut tokens: TokenList = vec![];
        while tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut sampler);

            if tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            tokens.push(next_token);
        }

        // Get predicted text
        let prediction_str = llama.detokenize_text(&mut tokens, false);
        
        // Store prediction
        predictions.push(Prediction {
            system_prompt: args.system_prompt.clone(),
            instruction_prompt: args.instruction_prompt.clone(),
            generated_text: prediction_str,
            token_count: tokens.len(),
        });
    }

    // Write to JSONL file
    let mut file = File::create("predictions.jsonl")?;
    for prediction in predictions {
        let json_line = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json_line)?;
    }

    Ok(())
}

To use this example, you’ll need to add these dependencies to your Cargo.toml:

[dependencies]
clap = { version = "4.4", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

You can run this example with:

cargo run --example batch_predictions -- --count 5 --system-prompt "You are a creative writer." --instruction-prompt "Write a short story about a robot."

This example:

  1. Uses clap to parse command line arguments
  2. Sets up the Llama model with the same parameters as your test
  3. Generates the requested number of predictions
  4. Stores each prediction with its metadata in a struct
  5. Writes the predictions to a JSONL file where each line is a valid JSON object

The output will be saved in predictions.jsonl, with each line containing a JSON object with the system prompt, instruction prompt, generated text, and token count.

Note: You’ll need to update the model_filepath variable to point to your actual model file location.


One and Done!

In a complement to Nemotron, I’d say that current Claude is generating code in much the same way. It doesn’t get stuck on my model file being pulled from an external function. And it sets the prompt appropriately.

Pasting the code into the file presents me with no compiler errors. On a more detailed review, I notice it doesn’t pick up on the prompt caching strategy that Qwen 2.5 Coder 32B did, which is a shame.

Pulling in my get_test_model_path() function from the integration test, I update the code to call that to get my model file path instead of the placeholder string and then run the example.

This version of the example produces no terminal output, but the predictions.jsonl file is created appropriately and looks fine (formatted for reading):

{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block

## Introduction
This is a sample text block written in Markdown format. It includes headings, paragraphs, and links.

### Key Features of Markdown

*   **Headings**: Use `# Heading` to create headings of different levels (e.g., `# Heading 1`, `## Heading 2`, etc.)
*   **Paragraphs**: Simply start a new line to create a paragraph.
*   **Links**: Use `[text](url)` to create links.
",
"token_count":100}

Claude 3.5 Sonnet successfully one-shotted the task. The extra text was clear and concise. Very well done! Lets see if it can properly adjust the sampler settings so that looping this over a number of iterations doesn’t produce the same output.

I update continue.dev by clicking the ‘…’ button and ’re-index’ before asking the next follow-up question:

Okay, I have the example code placed in @dataset_generator.rs and it's working
well for a `count` of 1. However, the sampler settings encourage consistent 
text generation. Can you supply just the code needed to update the sampler 
parameters so I can get different text each loop iteration?

Claude 3.5 Sonnet’s Response

Here’s how to update the generation parameters inside the loop to ensure more variety between generations:

// ... existing code ...

    // Generate multiple predictions
    for i in 0..args.count {
        // Update sampling parameters for variety
        params.params.seed = 42 + i as i32; // Different seed each iteration
        params.params.temp = 0.7;           // Higher temperature for more variety
        params.params.top_k = 40;           // Allow more token choices
        params.params.top_p = 0.9;          // Sample from more of the distribution
        params.params.min_p = 0.05;         // Lower minimum probability threshold
        params.params.penalty_repeat = 1.2;  // Slightly stronger repetition penalty

        // Construct prompt
        let prompt = format!(
            "<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
            args.system_prompt, args.instruction_prompt
        );
        params.set_prompt(&prompt);

        // ... existing code ...
    }

// ... existing code ...

These adjustments will make each generation more creative and different from the others, while still maintaining reasonable coherence. The key changes are:

  1. Using a different seed for each iteration
  2. Higher temperature (0.7 vs 0.1)
  3. Higher top_k (40 vs 1)
  4. Lower top_p (0.9 vs 1.0)
  5. Lower min_p (0.05 vs 0.1)

I Mean … Yeah …

The fix is a little sloppy as the only thing changing is the seed, and even then it will generate dupes when invoked multiple times on the command-line. But it is basically correct - it misses the type of seed being u32, but I don’t believe it had access to that information in its context window.

After making the modifications, I run the example with a count of three and this is the resulting predictions.jsonl file (formatted for reading):

{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"## Sample Text Block

### Heading 1
This is the first heading.

#### Subheading 2
Another subheading that's part of this main title.


#### *Italicized* Text 
This piece of content will be displayed as italicized.

#### **Bold** Text 
And another bold one, because there should be multiple levels of emphasis here too!

#### _Underlined_ Text _
To create an underlined link or highlighter you'll use underscores instead. This won't work for",
"token_count":100
}
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block
## Introduction

Welcome to our new restaurant, where the food is delicious and the atmosphere is cozy. Our menu features a variety of dishes from around the world.

### Menu Items

* Appetizers:
\t+ Spinach and Artichoke Dip ($8)
\t+ Bruschetta with Fresh Tomatoes ($7)

### Beverages

* Coffee: $2-$5 per cup
* Tea: $3-$6 per pot

- - -

Would you like me",
"token_count":100
}
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block

## Introduction
Welcome to our tutorial on creating a compelling narrative. In this section, we will explore the fundamentals of storytelling and provide you with practical tips to improve your writing skills.

### Key Elements of Storytelling

*   **Character Development**: Well-rounded characters are essential for engaging readers. Give your characters unique personalities, motivations, and backstories.
*   **Plot Structure**: A clear plot structure helps maintain reader interest. Use techniques like conflict resolution, rising action,",
"token_count":100
}

Final Claude 3.5 Sonnet Example Code

use clap::Parser;
use serde::Serialize;
use std::io::Write;
use std::{fs::File, process::exit};
use woolyrust::{Llama, ManagedGptParams, TokenList};

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Number of predictions to generate
    #[arg(short, long)]
    count: usize,

    /// System prompt to use
    #[arg(short, long)]
    system_prompt: String,

    /// Instruction prompt to use
    #[arg(short, long)]
    instruction_prompt: String,
}

#[derive(Serialize)]
struct Prediction {
    system_prompt: String,
    instruction_prompt: String,
    generated_text: String,
    token_count: usize,
}

fn main() -> std::io::Result<()> {
    let args = Args::parse();

    // Initialize model
    let mut model_params = woolyrust::get_default_model_params();
    model_params.n_gpu_layers = 100;
    let mut context_params = woolyrust::get_default_context_params();
    context_params.n_ctx = 1024 * 2;

    let model_filepath = get_test_model_path();
    let mut llama = Llama::new();
    let load_success =
        llama.load_model(model_filepath.as_str(), model_params, context_params, true);
    assert!(load_success);

    // Set up generation parameters
    let mut params = ManagedGptParams::defaults();
    params.params.seed = 42;
    params.params.n_threads = -1;
    params.params.n_predict = 100;
    params.params.temp = 0.1;
    params.params.top_k = 1;
    params.params.top_p = 1.0;
    params.params.min_p = 0.1;
    params.params.penalty_repeat = 1.1;
    params.params.penalty_last_n = 512;
    params.params.ignore_eos = false;
    params.params.flash_attn = true;
    params.params.n_batch = 128;
    params.params.prompt_cache_all = false;

    let antiprompts = vec!["<|end|>"];
    params.set_antiprompts(&antiprompts);

    // Store predictions
    let mut predictions: Vec<Prediction> = Vec::new();

    // Generate multiple predictions
    for i in 0..args.count {
        // Update sampling parameters for variety
        params.params.seed = 42 + i as u32; // Different seed each iteration
        params.params.temp = 0.7; // Higher temperature for more variety
        params.params.top_k = 40; // Allow more token choices
        params.params.top_p = 0.9; // Sample from more of the distribution
        params.params.min_p = 0.05; // Lower minimum probability threshold
        params.params.penalty_repeat = 1.2; // Slightly stronger repetition penalty

        // Construct prompt
        let prompt = format!(
            "<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
            args.system_prompt, args.instruction_prompt
        );
        params.set_prompt(&prompt);

        // Process prompt and get sampler
        let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
        assert!(prompt_token_count > 0);

        // Generate prediction
        let mut tokens: TokenList = vec![];
        while tokens.len() < params.params.n_predict as usize {
            let next_token = llama.sample_next_token(&mut sampler);

            if tokens.len() < params.params.n_predict as usize {
                let processed = llama.process_next_token(next_token);
                assert!(processed);
            }

            tokens.push(next_token);
        }

        // Get predicted text
        let prediction_str = llama.detokenize_text(&mut tokens, false);

        // Store prediction
        predictions.push(Prediction {
            system_prompt: args.system_prompt.clone(),
            instruction_prompt: args.instruction_prompt.clone(),
            generated_text: prediction_str,
            token_count: tokens.len(),
        });
    }

    // Write to JSONL file
    let mut file = File::create("predictions.jsonl")?;
    for prediction in predictions {
        let json_line = serde_json::to_string(&prediction)?;
        writeln!(file, "{}", json_line)?;
    }

    Ok(())
}

// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
    let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
    if let Ok(fp) = model_filepath {
        return fp;
    } else {
        println!(
            "Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
        );
        exit(1);
    }
}

Claude 3.5 Sonnet Conclusion

Well, what’s to say? It was efficient, clear and nailed everything in one shot. Very impressive. The latency through openrouter was much more noticeable with this model, though I wouldn’t say it was inconveniently slow. The price of this model ate into my credits much faster. Just these three calls cost me about $0.06, which admittedly isn’t much … but I think that going crazy with it might see my credit balance getting absolutely annihilated.

Overall Conclusion

I will say that I’m a little sad that Qwen 2.5 Coder 32B (quantized!) is the only model that picked up on the ability to cache the processed prompt tokens. I figured for sure Claude would have figured that out. It literally is the whole point of the integration test function I put in context with continue.dev tagging for the models to base the new example on.

Otherwise, I’d say Qwen 2.5 Coder 32B had a real good showing. It’s a capable ‘smaller’ model that fits into consumer GPUs more readily and just like other ‘smaller’ models, it needs a little more hand holding to achieve maximum results. Claude and Nemotron both felt a little easier to work with because of that.

Maybe it’s also no surprise that Claude 3.5 Sonnet is just the best tested. It’s expensive and slower, but it got everything right. If I ever get around to writing another one of these blog articles with a more advanced test than this basic Rust example based off of existing unit test, I’ll make sure to pit it against and OpenAI model. In my limited experience, ‘o1-preview’ gets things right that nothing else does…

Llama 3.1 Nemotron 70B did okay. I think for the price on openrouter I would have no problems keeping it in my rotation and I’m definitely going to test it out further. I also don’t have much experience with other 70B class LLMs like Llama 3.3 70B, so I don’t know how it compares.

I think an easy conclusion to draw is to configure multiple models in continue.dev so that you can switch only to high-performance/high-cost models only when needed.

This took a few hours to put together, even with how sloppy and unedited my writing is, but I regret not being able to test different quants of Qwen 2.5 Coder 32B to see if anything changes significantly. There are also other models that deserve to be mentioned in this context such as the smaller ‘Codestral-22b’ and of course OpenAI’s ‘o1-preview’. Already it felt like this article is too long for anyone to actually read, so I stopped at just these three models.

Ahhh … But Which Version Was The Chosen One?

Wondering which version I kept to develop further?

Drumroll, please! … … …

The Qwen 2.5 Coder 32B version. It had the prompt caching already enabled and was the easiest to take forward.

Errata

I didn’t realize until the very end that the way I was copying responses from the AI had somehow escaped all the strings and also replaced the “\n” string with actual newlines, so I had to repair all of that by hand. If you spot something that doesn’t look right because quotes are escaped or there’s a newline where there shouldn’t be … that’s probably my fault. My bad.