A Short Tale of Modern Software Development
How I Decided To Briefly Test Coding LLMs
The idea started simply enough: Can I just use a large language model to determine the style of writing in the chunk of text: Markdown, Novel (all text in quotes) or Roleplaying (actions in between asterisks; everything else is speech).
I’ve hand written some algorithms to parse these in the past, but it’s always kind of a nuisance. The thought occurred to me that I could build a custom text classifier neural network and train that , which sounded fun. Maybe I could even port that to GGML so that I could use it in my other apps that use llama.cpp?
Then it occurred to me that I could use a lot of my existing code - once I add LoRA support - to do the text classification if I just fine-tune a smaller LLM, like maybe a 1B Llama model. But to do the fine-tuning I would need a dataset. It seemed like a prime opportunity to write an example for one of my libraries to have it act like a synthetic dataset generator.
Slowly the idea for this blog post started forming. Why not test some different models out with continue.dev from within VS Code, now that I finally bought some credits on openrouter. This makes the 70B+ tier of LLM’s more usable since my workstation only gets a token or two a second with those.
In the end, I decided to give it a go and document the process. Hopefully it can show what using tools like continue.dev can do besides the obvious code completion work.
I’ll start by asking each model how to add an example to a Rust project. From there, I follow up with a specific list of demands to make an example that uses clap for CLI argument parsing and that derives it text prediction code from an integration test of my Rust project - highlighting a natural workflow in continue.dev. From there, if errors are encountered, I try to address them in the Chat window of continue.dev and see what happens. Overall, each of the models create a simple example of how to load an LLM with my wrapper library woolyrust and then generate a variable number of replies while saving them off to a JSONL file. The task is not extremely complicated.
With the coding models, I include their response verbatim under their own heading, leaving their markdown intact. The verbosity of these models makes this blog post come out nightmarishly long, so I’ve provided a TOC to the start of each model’s tests.
Models Tested:
And the TLDR: Overall Conclusion
[Editors note: No images for now until I make some. I’m trying to actually do my own this time instead of AI renders. It’s gonna be a little random and weird, but … fuckit it’s my blog, right? Just need a bit more time for the eye candy.]
First Things First
Starting with assessing different models for coding projects, I had wandered into a leaderboard that brought to mind the possibility of some different models to use besides the ones I’ve grown comfortable with: Qwen 2.5 Coder 32B and Codestral 22B. With that in mind, I configured continue.dev to use openrouter and set it up for the Llama 3.1 Nemotron 70B model.
The typical workflow for having AI assist on bigger ideas as opposed to single-task questions is to start small and build up. To highlight the ability for these models to work with existing code, I decide to use my woolyrust project and opened that up in VS Code. Woolyrust is a Rust language binding for woolycore which is my library that provides a high-level C interface to llama.cpp, making it easier to write language bindings.
My Opening Nemotron 70B Question
With my woolyrust project open, I open the continue.dev tab to bring up the ‘chat’ interface and ask it a simple question.
Is it easy to add examples to Rust library projects?
Nemotron 70B’s Response
Adding examples to Rust library projects is indeed relatively easy and encouraged, as it helps demonstrate the usage of your library to users. Here’s a step-by-step guide on how to add examples to a Rust library project:
1. Create an examples
Directory
Inside your library’s root directory (where your Cargo.toml
file is located), create a new directory named examples
if it doesn’t already exist.
2. Add Your Example Files
Within the examples
directory, create one or more Rust files (e.g., my_example.rs
, advanced_usage.rs
, etc.). Each file will contain an example of how to use your library. These files should be regular Rust programs that use your library as an external crate would.
Example File Structure (my_example.rs
):
// my_example.rs
extern crate my_library; // Use the exact name from your Cargo.toml
use my_library::{SomeModule, AnotherItemYouWantToDemonstrate};
fn main() {
// Example usage of your library's functionality
let result = SomeModule::some_function();
println!("Example result: {}", result);
// More examples...
}
3. (Optional) Specify Example Dependencies
If your examples require additional dependencies not needed by your library itself (e.g., for testing or logging in the context of the example), you can specify these in the [dev-dependencies]
section of your Cargo.toml
. This is because examples are typically built and run with cargo build --examples
or cargo run --example <name>
, which considers dev dependencies.
Specifying Dev Dependencies (Cargo.toml
excerpt):
[dev-dependencies]
"Just-an-example = "1.2.3"
4. Running Your Examples
To build and run an example, navigate to your project’s root directory in your terminal and use the following Cargo command, replacing <example_name>
with the name of your example file without the .rs
extension:
cargo run --example my_example
Additional Tips:
- Documentation Examples: Consider using your examples in documentation. Rustdoc allows embedding code examples that are actually run as tests to ensure they remain valid.
- Testing Your Examples: While not necessary, you might want to ensure your examples continue to work as your library evolves. You can convert examples into integration tests by moving them into the
tests
directory and modifying them to assert expected outcomes instead of printing them. - Keep Examples Up-to-Date: Regularly review and update your examples to reflect the best practices for using your library, especially when its API changes.
By following these steps, you can easily enhance your Rust library project with meaningful examples that will significantly benefit your users.
With the Basics Established
Now that I got the model thinking of what I want to do - adding a new example to my existing project - I continue the chat with a far more detailed and nuanced demand.
I want to create a new example for this Rust project. It should take three
command line args, parsed with the `clap` library, for 'count',
'system prompt' and 'instruction prompt'.
Then, it should follow the text prediction workflow for the library as
illustrated in the @step_prediction_test function. I would like to perform
'count' number of text predictions for a prompt built by the 'system prompt'
and 'instruction prompt' arguments on the command line.
I want to put the resulting predicted text into a vector and then turn it
into a JSONL file at the end, serializing it and saving it to the file system.
Nemotron-70B’s Response
Below is an example of how you can create a new executable in your Rust project that meets your requirements. This example assumes you have clap
for command-line argument parsing and serde
& serde_json
for JSON serialization. If you haven’t added these dependencies to your Cargo.toml
yet, you’ll need to do so:
[dependencies]
clap = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
Create a new file named cli_predictor.rs
in the examples
directory:
// examples/cli_predictor.rs
use clap::{App, Arg};
use serde_json;
use std::fs::File;
use std::io::Write;
use std::path::Path;
// Import your library's functionality
use woolyrust::{Llama, ManagedGptParams, TokenList};
fn main() {
// Clap App Setup
let matches = App::new("CLI Text Predictor")
.version("1.0")
.author("Your Name")
.about("Predicts text based on system and instruction prompts")
.arg(
Arg::with_name("count")
.short("c")
.long("count")
.help("Number of predictions to generate")
.required(true)
.takes_value(true),
)
.arg(
Arg::with_name("system_prompt")
.short("s")
.long("system")
.help("System prompt for the model")
.required(true)
.takes_value(true),
)
.arg(
Arg::with_name("instruction_prompt")
.short("i")
.long("instruction")
.help("Instruction prompt for the model")
.required(true)
.takes_value(true),
)
.get_matches();
// Extract command-line arguments
let count: usize = matches.value_of("count").unwrap().parse().expect("Invalid count");
let system_prompt = matches.value_of("system_prompt").unwrap();
let instruction_prompt = matches.value_of("instruction_prompt").unwrap();
// Setup library model (Similar to step_prediction_test)
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
let mut llama = Llama::new();
let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
//Prediction parameters (adjust as needed)
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100; // Per prediction length
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
// Vector to hold JSON serializable prediction results
#[derive(serde::Serialize)]
struct PredictionResult {
prompt: String,
prediction: String,
}
let mut prediction_results = Vec::new();
for _ in 0..count {
// Construct prompt
let prompt = format!("{}{}\n<|assistant|>\n", system_prompt, instruction_prompt);
// Process prompt and generate prediction (simplified from your test)
let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
let mut predictions: TokenList = vec![];
while predictions.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut sampler);
if predictions.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
predictions.push(next_token);
}
let prediction_str = llama.detokenize_text(&mut predictions, false);
assert!(!prediction_str.is_empty());
// Store result for JSON serialization
prediction_results.push(PredictionResult {
prompt: prompt.clone(),
prediction: prediction_str.clone(),
});
println!("Prediction {}\nPrompt: {}\nPrediction: {}\n", _+1, prompt, prediction_str);
}
// Serialize and save to JSONL file
let jsonl_file_path = Path::new("predictions.jsonl");
let mut file = File::create(jsonl_file_path).expect("Failed to create file");
for result in prediction_results {
let json = serde_json::to_string(&result).expect("Failed to serialize");
file.write_all(format!("{}\n", json).as_bytes())
.expect("Failed to write to file");
}
println!("Predictions saved to {}", jsonl_file_path.display());
}
To run this example:
Navigate to your project’s root directory in your terminal and execute:
cargo run --example cli_predictor -- -c 3 -s "System prompt here" -i "Instruction prompt here"
Replace -c 3
with your desired prediction count, and -s
and -i
with your system and instruction prompts, respectively. The predictions will be saved to a file named predictions.jsonl
in the current working directory.
Implementing the First Step
The replies all seemed reasonable to me out of the gate, so I end up trying it out. I create the examples
directory
and touch the examples/dataset_generator.rs
to create the example file. I add the three new dependencies as dev
dependencies, unlike the suggested fragment from Nemotron, since they’re only needed for this example. Then I paste
in the generated code…
At this point, I want to highlight what is real cool about continue.dev: you can easily reference
files, structs or functions while chatting. Notice the @step_prediction_test
reference in my question above? That
pulled my integration test function into the AI’s context so that it knew exactly how I wanted to generate text
using my library! And looking at the generated code, I can tell it definitely copied what I was doing there.
But sadly, there were problems. The generated code involving clap
didn’t compile. Arg::with_name()
isn’t found
and the module can’t use clap::App
because it doesn’t exist. So I ask Nemotron to fix it’s mistakes:
There's no `App` type in `clap` and `Arg::with_name()` doesn't exist.
Rewrite the CLI arg parsing to use the `derive` feature of `clap` (already
enabled in my cargo.toml) to parse the arguments into a struct.
Nemotron 70B’s Response
Here’s how you can rewrite the CLI argument parsing to use the derive
feature of clap
. This approach defines a struct for the arguments and derives the Parser
trait for automatic parsing.
First, ensure you have the derive
feature enabled for clap
in your Cargo.toml
:
[dependencies]
clap = { version = "3", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
Modified cli_predictor.rs
with derived parser:
// examples/cli_predictor.rs
use clap::Parser;
use serde_json;
use std::fs::File;
use std::io::Write;
use std::path::Path;
// Import your library's functionality
use yourlibrary::{Llama, ManagedGptParams, TokenList};
// Define a struct for the CLI arguments
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Args {
/// Number of predictions to generate
#[clap(short, long, value_parser)]
count: usize,
/// System prompt for the model
#[clap(short = 's', long, value_parser)]
system_prompt: String,
/// Instruction prompt for the model
#[clap(short = 'i', long, value_parser)]
instruction_prompt: String,
}
fn main() {
// Parse CLI arguments into the `Args` struct
let args = Args::parse();
// Rest of your application remains the same, using fields from `args`
// Setup library model (Similar to step_prediction_test)
let mut model_params = yourlibrary::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = yourlibrary::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
let mut llama = Llama::new();
let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
// Prediction parameters (adjust as needed)
let mut params = ManagedGptParams::defaults();
//... (rest of the prediction setup remains the same)
let mut prediction_results = Vec::new();
for _ in 0..args.count {
// Construct prompt
let prompt = format!("{}{}\n<|assistant|>\n", args.system_prompt, args.instruction_prompt);
// Process prompt and generate prediction (simplified from your test)
//... (prediction logic remains the same)
// Store result for JSON serialization
prediction_results.push(PredictionResult {
prompt: prompt.clone(),
prediction: prediction_str.clone(),
});
println!("Prediction {}\nPrompt: {}\nPrediction: {}\n", _+1, prompt, prediction_str);
}
// Serialize and save to JSONL file
let jsonl_file_path = Path::new("predictions.jsonl");
let mut file = File::create(jsonl_file_path).expect("Failed to create file");
for result in prediction_results {
let json = serde_json::to_string(&result).expect("Failed to serialize");
file.write_all(format!("{}\n", json).as_bytes())
.expect("Failed to write to file");
}
println!("Predictions saved to {}", jsonl_file_path.display());
}
Note:
- Replace
yourlibrary
with the actual name of your library when importing (yourlibrary::{Llama, ManagedGptParams, TokenList};
). - The
PredictionResult
struct definition is assumed to be the same as before; if not defined in this file, ensure it’s defined similarly withserde::Serialize
derive macro for JSON serialization.
Run the example with:
cargo run --example cli_predictor -- --count 3 --system-prompt "System prompt here" --instruction-prompt "Instruction prompt here"
Or use the short flags if preferred:
cargo run --example cli_predictor -- -c 3 -s "System prompt here" -i "Instruction prompt here"
Are We Working Yet?
So I make the suggested swaps to the clap
code and the new structure parsing compiles fine. After that, I fix a very
obvious problem with the main text generation loop not having a variable named for the index … because the
println!()
at the end of the loop wants to print the index as _+1
which is obviously wrong. I make the index in the for
loop bound to the variable i
and replace the underscore in the println!()
.
The last remaining compile error probably comes from the fact that I only referenced @step_prediction_test
in the chat
earlier and didn’t include the whole file with @step_prediction_tests.rs
, so Nemotron had no way of knowing what the
utility function get_test_model_path()
was. In this example it’s undefined, but I copy my function over from the
integration test and paste it into the bottom of the example file as well as importing exit
from std::process
.
Now it compiles! But will it run correctly? No, there’s no chance. When reviewing the code, I spot another error and ask Nemotron to fix it for me just to see if it would. I know the fix, but the whole point of this is to test out these models, right? Let’s see what it gives me.
While it looks like the prompt string gets made using `format!`, it doesn't
actually get set in the `params` that gets sent to the `woolyrust`
library functions. Can you doublecheck the @step_prediction_test example again and make any
corrections necessary to make sure `params` will be configured properly
for text prediciton?
Nemotron 70B’s Response
The provided step_prediction_test
function appears to correctly set the prompt in the params
using the set_prompt
method, not relying on the format!
macro for this specific task. However, to ensure clarity and address the mention of format!
, which is not directly used for setting the prompt in params
, here are the relevant parts of the code with a minor adjustment for better clarity and confirmation that the prompt is properly set:
// Original prompt setup (for reference)
let prompt = "<|user|>\nWrite the start to the next movie collaboration between Quentin Tarantino and Robert Rodriguez.<|end|>\n<|assistant|>\n";
params.set_prompt(prompt);
// Additional prompt setup later in the code (for the second prediction)
let new_prompt_text = "Do you have a suggestion for genre?<|end|>\n<|user|>\nMake it like a Pixar movie script, but with those two authors!<|end|>\n<|assistant|>\n";
let new_prompt_tokens = llama.process_additional_prompt(&mut second_sampler, new_prompt_text);
To explicitly demonstrate using format!
for constructing a prompt (though not necessary in this case, as the original code directly assigns the prompt string), you could modify the initial prompt setup like so:
// Demonstrative example using format! (not necessary for the original code's functionality)
let movie_collaboration_idea = "Quentin Tarantino and Robert Rodriguez";
let prompt = format!(
"<|user|>\nWrite the start to the next movie collaboration between {}.<|end|>\n<|assistant|>\n",
movie_collaboration_idea
);
params.set_prompt(prompt);
And for the additional prompt:
// Demonstrative example for the additional prompt using format!
let suggested_genre = "Pixar movie script";
let additional_authors = "those two authors";
let new_prompt_text = format!(
"Do you have a suggestion for genre?<|end|>\n<|user|>\nMake it like a {}, but with {}!<|end|>\n<|assistant|>\n",
suggested_genre, additional_authors
);
let new_prompt_tokens = llama.process_additional_prompt(&mut second_sampler, new_prompt_text);
Confirmation of params
Configuration for Text Prediction:
Based on the provided step_prediction_test
function, the params
appear to be properly configured for text prediction, with key configurations including:
- Prompt Setting: The primary and additional prompts are correctly set using
set_prompt
andprocess_additional_prompt
, respectively. - Model and Context Parameters:
model_params
andcontext_params
are customized (e.g.,n_gpu_layers
,n_ctx
). - Prediction Parameters:
ManagedGptParams
are extensively configured, covering aspects like seeding, threading, prediction length, temperature, top-k and top-p filtering, and more. - Freezing and Defrosting: States are properly frozen and defrosted to facilitate efficient continuation of predictions.
Given this analysis, the params
should indeed be correctly configured for the intended text prediction tasks within the step_prediction_test
function.
Good Enough?
Okay, so the block where it demonstrates creating the prompt with format!
and follows up with a params.set_prompt()
call is correct,
and seeing it reach further to show how you can add further prompt tokens in with an additional library call is nice… but it still
didn’t add the function call into the example code. I figure this is good enough and just add the set_prompt()
call in myself.
Referencing Nemotron’s first reply to figure out how, I decide to run the example and see what happens. Before doing the actual invocation of the example, I need to make sure the environment variable I use for my integration tests is set so it knows what model to load.
export WOOLY_TEST_MODEL_FILE=~/.cache/lm-studio/models/bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_S.gguf
cargo run --example dataset_generator -- --count 1 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."
It worked!
Here’s the output in the terminal:
Prediction 1
Prompt: You are a creative writing AI.Generate a sample text block in Markdown format.
<|assistant|>
Prediction: ## Sample Text Block
### Introduction
This is the introduction to our sample text block. It sets the stage for what's to come and provides context.
#### Key Points
* We will cover some key points about [topic].
* These points will be discussed in detail throughout this document.
* They provide an overview of the main ideas that we'll explore.
### Main Content
#### Section 1: Topic Overview
Our topic is [topic]. This section provides an overview of the
Predictions saved to predictions.jsonl
The predicitons.jsonl
file was created successfully and here’s the serialized result (formatted nicer for reading):
{"prompt":"You are a creative writing AI.Generate a sample text block in Markdown format.
<|assistant|>
","prediction":"## Sample Text Block
### Introduction
This is the introduction to our sample text block. It sets the stage for what's to come and provides context.
#### Key Points
* We will cover some key points about [topic].
* These points will be discussed in detail throughout this document.
* They provide an overview of the main ideas that we'll explore.
### Main Content
#### Section 1: Topic Overview
Our topic is [topic]. This section provides an overview of the"}
At this point, the only other problem I see offhand is the hardcoded seed
parameter, which is better off as u32::MAX
to have the library use random seeds. Maybe some randomization of some of the sampling parameters within an
allowable range would be useful too.
Final Nemotron 70B Example Code
// examples/cli_predictor.rs
use clap::Parser;
use serde_json;
use std::io::Write;
use std::path::Path;
use std::{fs::File, process::exit};
// Import your library's functionality
use woolyrust::{Llama, ManagedGptParams, TokenList};
// Define a struct for the CLI arguments
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Args {
/// Number of predictions to generate
#[clap(short, long, value_parser)]
count: usize,
/// System prompt for the model
#[clap(short = 's', long, value_parser)]
system_prompt: String,
/// Instruction prompt for the model
#[clap(short = 'i', long, value_parser)]
instruction_prompt: String,
}
fn main() {
// Parse CLI arguments into the `Args` struct
let args = Args::parse();
// Setup library model (Similar to step_prediction_test)
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path(); // Assuming this function exists as in your test
let mut llama = Llama::new();
let load_success =
llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
//Prediction parameters (adjust as needed)
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100; // Per prediction length
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
// Vector to hold JSON serializable prediction results
#[derive(serde::Serialize)]
struct PredictionResult {
prompt: String,
prediction: String,
}
let mut prediction_results = Vec::new();
for i in 0..args.count {
// Construct prompt
let prompt = format!(
"{}{}\n<|assistant|>\n",
args.system_prompt, args.instruction_prompt
);
params.set_prompt(prompt.as_str());
// Process prompt and generate prediction (simplified from your test)
let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
let mut predictions: TokenList = vec![];
while predictions.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut sampler);
if predictions.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
predictions.push(next_token);
}
let prediction_str = llama.detokenize_text(&mut predictions, false);
assert!(!prediction_str.is_empty());
// Store result for JSON serialization
prediction_results.push(PredictionResult {
prompt: prompt.clone(),
prediction: prediction_str.clone(),
});
println!(
"Prediction {}\nPrompt: {}\nPrediction: {}\n",
i + 1,
prompt,
prediction_str
);
}
// Serialize and save to JSONL file
let jsonl_file_path = Path::new("predictions.jsonl");
let mut file = File::create(jsonl_file_path).expect("Failed to create file");
for result in prediction_results {
let json = serde_json::to_string(&result).expect("Failed to serialize");
file.write_all(format!("{}
", json).as_bytes())
.expect("Failed to write to file");
}
println!("Predictions saved to {}", jsonl_file_path.display());
}
// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
if let Ok(fp) = model_filepath {
return fp;
} else {
println!(
"Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
);
exit(1);
}
}
Nemotron 70B Conclusion
Not bad! How much did running all of this cost me in openrouter credits? Less than half a penny! Insane. I’m a big supporter of local AI usage and have only just this week started toying with API access of bigger models, but this price point is amazing. There are privacy concerns and such, for sure. The first thing I did was opt out of letting openrouter do anything with my usage data in their privacy settings, but who knows what the endpoints really do. However, if I didn’t have my 4090 on my workstation, I think it’d be a lot harder to justify purchasing one just to run LLMs.
Next Challenger - Qwen 2.5 Coder 32B
Lets move that example file off to the side and start this whole thing over to see what it’s like using
a model specifically crafted for programming tasks. I’ll be running qwen2.5-coder-32b-instruct
and using the Q4_K_M
quant with all layers offloaded to my 4090 while using LM Studio to be my API
provider for the LAN while I access it on my MacBook Air M3. I load it using a context size of 8192 tokens and LM Studio reports
just over 18 GB of VRAM used. My default sampler settings for this model use a temp of 0.3, top_k of 40, repeat penalty of 1.03,
top_p of 0.9 and min_p of 0.05. Flash attention is enabled as well.
Still having my woolyrust project open in VS Code, I set continue.dev to use my LAN API endpoint with the qwen2.5 coder model and I fire off my first question, which in a new chat window. (Note: it is the exact same one I used to start the process with Nemotron)
Is it easy to add examples to Rust library projects?
As the tokens roll in, I would ballpark the speed of this 32B model on my 4090 as being roughly equivalent to what I was pulling from openrouter providers with Nemotron 70B, FWIW.
Qwen2.5 Coder 32B’s Response
Adding examples to Rust library projects is straightforward and can be done in a few simple steps. Here’s how you can do it:
Step 1: Create an examples
Directory
First, create a directory named examples
in the root of your Rust project. This is where you will store your example files.
mkdir examples
Step 2: Add Example Files
Inside the examples
directory, you can add as many Rust files as you need. Each file should contain a main
function, as it will be compiled as a separate binary.
For example, create a file named simple_example.rs
:
// examples/simple_example.rs
use your_crate_name::SomeStruct;
fn main() {
let example = SomeStruct::new();
println!("Example: {:?}", example);
}
Step 3: Update Cargo.toml
You don’t need to update Cargo.toml
specifically for the examples, but you can specify dependencies or other configurations if needed. However, Rust automatically recognizes files in the examples
directory as examples.
Step 4: Run the Examples
To run an example, use the cargo run
command with the --example
flag followed by the name of the example file (without the .rs
extension).
cargo run --example simple_example
Step 5: Build the Examples
If you want to build the examples without running them, you can use the cargo build
command with the --examples
flag.
cargo build --examples
Additional Tips
- Documentation: Consider adding documentation comments to your example files to explain what they do. This can be useful for users of your library.
- Testing Examples: You can also run all examples as tests using the
cargo test --examples
command. This ensures that your examples remain up-to-date and work correctly with your library.
By following these steps, you can easily add and manage examples in your Rust library projects, making it easier for others to understand and use your library.
A Good Start
Honestly? I like the clarity of Qwen’s answer a touch better. And interestingly enough, you’ll notice that it’s largely the same response as what Nemotron 70B gave. Subjectively, rendered inside continue.dev’s Chat window, I think Qwen’s response is easier to read.
With the context set, I send it the same detailed request I sent Nemotron:
I want to create a new example for this Rust project. It should take three
command line args, parsed with the `clap` library, for 'count',
'system prompt' and 'instruction prompt'.
Then, it should follow the text prediction workflow for the library as
illustrated in the @step_prediction_test function. I would like to perform
'count' number of text predictions for a prompt built by the 'system prompt'
and 'instruction prompt' arguments on the command line.
I want to put the resulting predicted text into a vector and then turn it
into a JSONL file at the end, serializing it and saving it to the file system.
Qwen2.5 Coder 32B’s Response
Certainly! To create a new example for your Rust project that takes command-line arguments using the clap
library and performs text predictions, you can follow these steps:
-
Add
clap
andserde
Dependencies: First, ensure you have the necessary dependencies in yourCargo.toml
. -
Create the Example File: Create a new file in the
examples
directory, e.g.,text_prediction.rs
. -
Implement the Command-Line Argument Parsing and Prediction Workflow: Use
clap
to parse the command-line arguments and then follow the prediction workflow as shown in yourstep_prediction_test
function. -
Serialize and Save to JSONL: Use
serde
to serialize the predictions into a JSONL file.
Here’s how you can do it:
Step 1: Add Dependencies
Add the following dependencies to your Cargo.toml
:
[dependencies]
clap = { version = "4.0", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
woolyrust = "0.1" # Replace with your actual crate name and version
Step 2: Create the Example File
Create a new file named text_prediction.rs
in the examples
directory:
// examples/text_prediction.rs
use clap::Parser;
use serde::{Serialize, Deserialize};
use std::fs::File;
use std::io::{self, Write};
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
#[arg(short, long, default_value_t = 1)]
count: u32,
#[arg(short, long, required = true)]
system_prompt: String,
#[arg(short, long, required = true)]
instruction_prompt: String,
}
#[derive(Serialize, Deserialize)]
struct Prediction {
prompt: String,
prediction: String,
}
fn main() -> io::Result<()> {
let cli = Cli::parse();
// Load up our test model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path();
let mut llama = Llama::new();
let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
// Set the text generation parameters up
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
let prompt = format!("<|user|>\n{}\n<|end|>\n<|assistant|>\n{}", cli.system_prompt, cli.instruction_prompt);
params.set_antiprompts(&antiprompts);
params.set_prompt(prompt.clone());
params.params.dry_multiplier = 0.8;
params.params.dry_base = 1.75;
params.params.dry_allowed_length = 2;
params.params.dry_penalty_last_n = -1;
let seq_breakers = vec!["\n", ":", "\"", "*"];
params.set_dry_sequence_breakers(&seq_breakers);
// Get the prompt ingested into the context and pull the sampler
// used in the process so that repeat penalties and such are
// accounted for.
let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
// Freeze the state after processing the prompt so that we can generate
// a second block of text after the first one without having to reprocess
// the prompt.
let frozen_prompt = llama.freeze(&mut params, None);
let mut predictions: Vec<Prediction> = Vec::new();
for _ in 0..cli.count {
// Start our prediction loop now that the prompt has been processed
let mut prediction_tokens: TokenList = vec![];
while prediction_tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut first_sampler);
if prediction_tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
prediction_tokens.push(next_token);
}
// Print out our prediction
let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
println!("Prediction (tokens: {})\n{}", prediction_tokens.len(), prediction_str);
assert!(!prediction_str.is_empty());
assert!(prediction_tokens.len() > 0);
// Store the prediction
predictions.push(Prediction {
prompt: prompt.clone(),
prediction: prediction_str,
});
// Defrost our frozen state from processing the prompt and generate something new
let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
first_sampler = next_sampler;
}
// Serialize predictions to JSONL and save to a file
let file_path = "predictions.jsonl";
let mut file = File::create(file_path)?;
for prediction in predictions {
let json = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json)?;
}
println!("Predictions saved to {}", file_path);
Ok(())
}
fn get_test_model_path() -> String {
// Replace with your actual model path
"path/to/your/model.bin".to_string()
}
Step 3: Run the Example
To run the example, use the cargo run
command with the --example
flag followed by the name of the example file (without the .rs
extension).
cargo run --example text_prediction -- --count 3 --system-prompt "Write a movie collaboration" --instruction-prompt "between Quentin Tarantino and Robert Rodriguez"
Explanation
- Command-Line Arguments: The
clap
library is used to parse command-line arguments forcount
,system_prompt
, andinstruction_prompt
. - Prediction Workflow: The prediction workflow follows the steps from your
step_prediction_test
function. - JSONL Serialization: The predictions are serialized into JSONL format and saved to a file named
predictions.jsonl
.
This setup should help you create a robust example for your Rust project that takes command-line arguments and performs multiple text predictions, saving the results in a structured format.
Quite Impressed!
Whoa! Compared to Nemotron, there’s a number of things Qwen gets right out of the gate.
Firstly, it wrote the example to have clap
use a struct for all the command-line parameters, which I prefer. Secondly, if
you look at the source step_prediction_tests.rs file,
you can tell that it kept much more of the same parameters. Lastly, you can see that it wrote it’s own implementation of
get_test_model_path()
to make the example complete, since it didn’t have access to the one I wrote in its context.
I’m going to replace Qwen’s placeholder with my actual code, but I’m happy to see it figured that out.
However, there’s still a compiler error on this line, params.set_prompt(prompt.clone());
since the function expects
a &str
and not a String
. Lets ask Qwen to fix it.
When I paste that code into the source file, there's a compiler error
with `params.set_prompt(prompt.clone());`... can you fix that?
Qwen2.5 Coder 32B’s Response
Certainly! The error you’re encountering is likely due to the ManagedGptParams
struct not having a set_prompt
method. Let’s address this by ensuring that the ManagedGptParams
struct has the necessary methods and fields to handle the prompt.
Since the woolyrust
library might not have a set_prompt
method, we need to check the actual API of the woolyrust
crate to see how to set the prompt correctly. Let’s assume that the ManagedGptParams
struct has a field for the prompt, and we can directly set it.
Here’s the revised code with the necessary adjustments:
<…SNIP …>
Qwen’s First Misstep
So I cut the response short for this article, but it reiterated the sample code but changed that one line to:
params.prompt = prompt.clone(); // Directly set the prompt field
The problem is that there is no member called ‘prompt’ on that structure. Qwen doesn’t have the struct’s definition
in context, only my integration test so it’s trying to infer that it’s a String
, when in reality, the
ManagedGptParams
struct exists to give a lifetime for native C strings used for the FFI calls… which is
why calling set_prompt()
is necessary instead of just setting the string. Qwen hoped to simplify it by guessing
but it is wrong. It goes to show how specific you should be with Chat in continue.dev if you have the knowledge.
Lets try saving the situation:
No, that's not right. @ManagedGptParams does not have a `prompt` member. The `set_prompt()`
function just needs a `&str` passed to it instead. Can you fix your original
sample code up now that you know that?
Qwen2.5 Coder 32B’s Response
Certainly! Given that the ManagedGptParams
struct does not have a prompt
field and instead uses a set_prompt
method, we need to adjust the code accordingly. Here’s the revised version of the code:
// examples/text_prediction.rs
use clap::Parser;
use serde::{Serialize, Deserialize};
use std::fs::File;
use std::io::{self, Write};
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
#[arg(short, long, default_value_t = 1)]
count: u32,
#[arg(short, long, required = true)]
system_prompt: String,
#[arg(short, long, required = true)]
instruction_prompt: String,
}
#[derive(Serialize, Deserialize)]
struct Prediction {
prompt: String,
prediction: String,
}
fn main() -> io::Result<()> {
let cli = Cli::parse();
// Load up our test model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path();
let mut llama = Llama::new();
let load_success = llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
// Set the text generation parameters up
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
let prompt = format!("<|user|>\n{}\n<|end|>\n<|assistant|>\n{}", cli.system_prompt, cli.instruction_prompt);
params.antiprompts = antiprompts;
// Set the prompt using the set_prompt method
params.set_prompt(prompt.as_str());
params.params.dry_multiplier = 0.8;
params.params.dry_base = 1.75;
params.params.dry_allowed_length = 2;
params.params.dry_penalty_last_n = -1;
let seq_breakers = vec!["\n", ":", "\"", "*"];
params.dry_sequence_breakers = seq_breakers;
// Get the prompt ingested into the context and pull the sampler
// used in the process so that repeat penalties and such are
// accounted for.
let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
// Freeze the state after processing the prompt so that we can generate
// a second block of text after the first one without having to reprocess
// the prompt.
let frozen_prompt = llama.freeze(&mut params, None);
let mut predictions: Vec<Prediction> = Vec::new();
for _ in 0..cli.count {
// Start our prediction loop now that the prompt has been processed
let mut prediction_tokens: TokenList = vec![];
while prediction_tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut first_sampler);
if prediction_tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
prediction_tokens.push(next_token);
}
// Print out our prediction
let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
println!("Prediction (tokens: {})\n{}", prediction_tokens.len(), prediction_str);
assert!(!prediction_str.is_empty());
assert!(prediction_tokens.len() > 0);
// Store the prediction
predictions.push(Prediction {
prompt: prompt.clone(),
prediction: prediction_str,
});
// Defrost our frozen state from processing the prompt and generate something new
let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
first_sampler = next_sampler;
}
// Serialize predictions to JSONL and save to a file
let file_path = "predictions.jsonl";
let mut file = File::create(file_path)?;
for prediction in predictions {
let json = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json)?;
}
println!("Predictions saved to {}", file_path);
Ok(())
}
fn get_test_model_path() -> String {
// Replace with your actual model path
"path/to/your/model.bin".to_string()
}
Key Changes
-
Use
set_prompt
Method: Instead of directly setting theprompt
field, we use theset_prompt
method to set the prompt. -
Pass
&str
toset_prompt
: Theset_prompt
method takes a&str
, so we passprompt.as_str()
to it.
Running the Example
To run the example, use the following command:
cargo run --example text_prediction -- --count 3 --system-prompt "Write a movie collaboration" --instruction-prompt "between Quentin Tarantino and Robert Rodriguez"
This should resolve the compiler error and allow you to run the example successfully. If there are any further issues, please provide the specific error messages, and I can help you troubleshoot further.
Rollback Time
Okay, so it got the error fixed with set_prompt()
, but then it dropped the usage of set_antiprompts()
and
set_dry_sequence_breakers()
which serve the same purpose as set_prompt()
and tries to just set named members
that don’t exist instead - even after tagging the struct in Chat to pull in the definition of it. Disappointing.
Maybe it would have helped to retag the @step_prediction_test
integration test again too.
Rolling back to the first example code, I add my function to get the model path and fix the set_prompt()
call
myself and then take a closer look at the logic of the generated code.
Impressively enough it caught on to the usage of ‘freezing’ and ‘defrosting’ (saving/restoring) the state of the LLM after the initial prompt processing as an optimization.
Time to see if it runs!
cargo run --example dataset_generator -- --count 1 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."
Terminal output:
Prediction (tokens: 100)
# Sample Text Block
## Introduction
This is a sample text block written in Markdown format.
### Paragraph 1
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupid
Predictions saved to predictions.jsonl
The generated predictions.jsonl
file had the expected data.
Finishing Touches for Qwen
Okay, so it works, but because of the sampler parameters it pulled from my integration test where I wanted the generated text to mostly be the same, every generation is the same across iterations of the loop. So I figured I’d ask a little bonus question of Qwen.
But I knew I wanted to tag my updated example, so before I asked this question, I hit the ‘…’ button in continue.dev’s window and had it ’re-index’ my code.
Okay the corrected code now exists in @dataset_generator.rs however the sampling
parameters produce the same exact response every time making looping worthess.
Can you give me code just for the sampler parameters and change them up so that
there's a much greater increase in variability in the generted text?
Qwen2.5 Coder 32B’s Response
Certainly! To increase variability in the generated text, you can adjust the sampling parameters to allow for more randomness and diversity. Here are some key parameters you can modify:
- Temperature (
temp
): Lower values make the model more deterministic, while higher values make it more random. - Top-k Sampling (
top_k
): Controls the number of highest probability vocabulary tokens to keep for top-k filtering. - Top-p Sampling (
top_p
): Also known as nucleus sampling, where the model considers the smallest set of tokens whose cumulative probability exceedstop_p
. - Penalty for Repeat Tokens (
penalty_repeat
): Adjusts the penalty for repeating tokens to encourage diversity. - Penalty for Last N Tokens (
penalty_last_n
): Adjusts the penalty for repeating tokens in the lastn
tokens.
Here’s the updated code with modified sampling parameters to increase variability:
// examples/dataset_generator.rs
use clap::Parser;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{self, Write};
use std::process::exit;
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
#[arg(short, long, default_value_t = 1)]
count: u32,
#[arg(short, long, required = true)]
system_prompt: String,
#[arg(short, long, required = true)]
instruction_prompt: String,
}
#[derive(Serialize, Deserialize)]
struct Prediction {
prompt: String,
prediction: String,
}
fn main() -> io::Result<()> {
let cli = Cli::parse();
// Load up our test model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path();
let mut llama = Llama::new();
let load_success =
llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
// Set the text generation parameters up
let mut params = ManagedGptParams::defaults();
params.params.seed = 42; // You can change this to a different seed or use a random seed
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.7; // Increased temperature for more randomness
params.params.top_k = 50; // Increased top-k for more diversity
params.params.top_p = 0.9; // Increased top-p for more diversity
params.params.min_p = 0.05; // Lower min_p to allow more tokens
params.params.penalty_repeat = 1.0; // Reduced penalty for repeat tokens
params.params.penalty_last_n = 32; // Reduced penalty last n tokens
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
let prompt = format!(
"<|user|>\n{}\n<|end|>\n<|assistant|>\n{}",
cli.system_prompt, cli.instruction_prompt
);
params.set_antiprompts(&antiprompts);
params.set_prompt(prompt.as_str());
params.params.dry_multiplier = 0.8;
params.params.dry_base = 1.75;
params.params.dry_allowed_length = 2;
params.params.dry_penalty_last_n = -1;
let seq_breakers = vec!["\n", ":", "\"", "*"];
params.set_dry_sequence_breakers(&seq_breakers);
// Get the prompt ingested into the context and pull the sampler
// used in the process so that repeat penalties and such are
// accounted for.
let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
// Freeze the state after processing the prompt so that we can generate
// a second block of text after the first one without having to reprocess
// the prompt.
let frozen_prompt = llama.freeze(&mut params, None);
let mut predictions: Vec<Prediction> = Vec::new();
for _ in 0..cli.count {
// Start our prediction loop now that the prompt has been processed
let mut prediction_tokens: TokenList = vec![];
while prediction_tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut first_sampler);
if prediction_tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
prediction_tokens.push(next_token);
}
// Print out our prediction
let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
println!(
"Prediction (tokens: {})\n{}",
prediction_tokens.len(),
prediction_str
);
assert!(!prediction_str.is_empty());
assert!(prediction_tokens.len() > 0);
// Store the prediction
predictions.push(Prediction {
prompt: prompt.clone(),
prediction: prediction_str,
});
// Defrost our frozen state from processing the prompt and generate something new
let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
first_sampler = next_sampler;
}
// Serialize predictions to JSONL and save to a file
let file_path = "predictions.jsonl";
let mut file = File::create(file_path)?;
for prediction in predictions {
let json = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json)?;
}
println!("Predictions saved to {}", file_path);
Ok(())
}
// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
if let Ok(fp) = model_filepath {
return fp;
} else {
println!(
"Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
);
exit(1);
}
}
Key Changes
- Temperature (
temp
): Set to0.7
to increase randomness. - Top-k Sampling (
top_k
): Set to50
to consider more tokens. - Top-p Sampling (
top_p
): Set to0.9
to consider a larger set of tokens. - Minimum Probability (
min_p
): Set to0.05
to allow more tokens. - Penalty for Repeat Tokens (
penalty_repeat
): Set to1.0
to reduce the penalty. - Penalty for Last N Tokens (
penalty_last_n
): Set to32
to reduce the penalty.
These changes should help increase the variability in the generated text. You can further tweak these parameters to achieve the desired level of diversity and randomness.
One Last Qwen Run
It gave me the whole block of code again which I wish it didn’t do, but at least it clearly marked it’s changes, so I just copied those into my example code. It also didn’t adjust the seed, so every response will still stay the same.
I’ll keep this short, but I asked this in chat:
You didn't change the `seed` so everything will still be the same.
And Qwen regurgitated the whole example code file again but had the following segment for
setting the seed
variable:
// Set a random seed
let seed = SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("Time went backwards")
.as_secs();
params.params.seed = seed as u64;
Which is close, but wrong. seed
is a u32
type. Also, I know that a much simpler way to introduce random
seed behavior is to set seed
to u32::MAX
, so I don’t use Qwen’s suggestion and just fix the seed myself.
Once more I run the example and see what comes out:
cargo run --example dataset_generator -- --count 3 --system-prompt "You are a creative writing AI." --instruction-prompt "Generate a sample text block in Markdown format."
Terminal output:
Prediction (tokens: 100)
This text block should include a heading, a brief description, and a few key points. # Heading
## Brief Description
This sample text block demonstrates how to create a Markdown formatted text with a heading, description, and key points.
### Key Points
* **Heading**: The `# Heading` line defines the heading level, with `#` indicating the main heading, `##` indicating a subheading, and so on.
* **Description**: The `## Brief Description` line
Prediction (tokens: 100)
---
### The Mysterious Island
#### Chapter 1: The Discovery
As I stood on the creaky deck of the old sailing ship, the salty sea air whipping my hair into a frenzy, I couldn't help but feel a sense of wonder. We had been at sea for weeks, and finally, the horizon had begun to take on a familiar shape – the island of Tortuga.
Legend had it that the island was cursed, that anyone who set foot on its sandy shores would
Prediction (tokens: 100)
# Introduction
The city of Eldrador is a place of mystery and wonder. It is a city that exists outside the boundaries of time, where the past, present, and future all converge. The city is filled with strange and wondrous sights, sounds, and smells that will leave you breathless and bewildered. As you walk through the city's winding streets, you'll stumble upon ancient ruins, hidden temples, and mysterious artifacts that will challenge your perceptions and spark your imagination.
## The
Predictions saved to predictions.jsonl
The generated predictions.jsonl
file had all the predictions as expected (formatted for reading):
{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":" This text block should include a heading, a brief description, and a few key points. # Heading
## Brief Description
This sample text block demonstrates how to create a Markdown formatted text with a heading, description, and key points.
### Key Points
* **Heading**: The `# Heading` line defines the heading level, with `#` indicating the main heading, `##` indicating a subheading, and so on.
* **Description**: The `## Brief Description` line"
}
{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":"
---
### The Mysterious Island
#### Chapter 1: The Discovery
As I stood on the creaky deck of the old sailing ship, the salty sea air whipping my hair into a frenzy, I couldn't help but feel a sense of wonder. We had been at sea for weeks, and finally, the horizon had begun to take on a familiar shape – the island of Tortuga.
Legend had it that the island was cursed, that anyone who set foot on its sandy shores would"
}
{
"prompt":"<|user|>
You are a creative writing AI.
<|end|>
<|assistant|>
Generate a sample text block in Markdown format.",
"prediction":"
# Introduction
The city of Eldrador is a place of mystery and wonder. It is a city that exists outside the boundaries of time, where the past, present, and future all converge. The city is filled with strange and wondrous sights, sounds, and smells that will leave you breathless and bewildered. As you walk through the city's winding streets, you'll stumble upon ancient ruins, hidden temples, and mysterious artifacts that will challenge your perceptions and spark your imagination.
## The"
}
Final Qwen 2.5 Coder 32B Example Code
// examples/text_prediction.rs
use clap::Parser;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{self, Write};
use std::process::exit;
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser)]
#[command(author, version, about = "Text Prediction Example", long_about = None)]
struct Cli {
#[arg(short, long, default_value_t = 1)]
count: u32,
#[arg(short, long, required = true)]
system_prompt: String,
#[arg(short, long, required = true)]
instruction_prompt: String,
}
#[derive(Serialize, Deserialize)]
struct Prediction {
prompt: String,
prediction: String,
}
fn main() -> io::Result<()> {
let cli = Cli::parse();
// Load up our test model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path();
let mut llama = Llama::new();
let load_success =
llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert_eq!(load_success, true);
assert_eq!(llama.is_loaded(), true);
// Set the text generation parameters up
let mut params = ManagedGptParams::defaults();
params.params.seed = u32::MAX;
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.7; // Increased temperature for more randomness
params.params.top_k = 50; // Increased top-k for more diversity
params.params.top_p = 0.9; // Increased top-p for more diversity
params.params.min_p = 0.05; // Lower min_p to allow more tokens
params.params.penalty_repeat = 1.0; // Reduced penalty for repeat tokens
params.params.penalty_last_n = 32; // Reduced penalty last n tokens
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
let prompt = format!(
"<|user|>\n{}\n<|end|>\n<|assistant|>\n{}",
cli.system_prompt, cli.instruction_prompt
);
params.set_antiprompts(&antiprompts);
params.set_prompt(prompt.as_str());
params.params.dry_multiplier = 0.8;
params.params.dry_base = 1.75;
params.params.dry_allowed_length = 2;
params.params.dry_penalty_last_n = -1;
let seq_breakers = vec!["\n", ":", "\"", "*"];
params.set_dry_sequence_breakers(&seq_breakers);
// Get the prompt ingested into the context and pull the sampler
// used in the process so that repeat penalties and such are
// accounted for.
let (prompt_token_count, mut first_sampler) = llama.process_prompt(&mut params);
assert_eq!(prompt_token_count > 0, true);
// Freeze the state after processing the prompt so that we can generate
// a second block of text after the first one without having to reprocess
// the prompt.
let frozen_prompt = llama.freeze(&mut params, None);
let mut predictions: Vec<Prediction> = Vec::new();
for _ in 0..cli.count {
// Start our prediction loop now that the prompt has been processed
let mut prediction_tokens: TokenList = vec![];
while prediction_tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut first_sampler);
if prediction_tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
prediction_tokens.push(next_token);
}
// Print out our prediction
let prediction_str = llama.detokenize_text(&mut prediction_tokens, false);
println!(
"Prediction (tokens: {})\n{}",
prediction_tokens.len(),
prediction_str
);
assert!(!prediction_str.is_empty());
assert!(prediction_tokens.len() > 0);
// Store the prediction
predictions.push(Prediction {
prompt: prompt.clone(),
prediction: prediction_str,
});
// Defrost our frozen state from processing the prompt and generate something new
let (_, mut next_sampler) = llama.defrost(&mut params, &frozen_prompt);
first_sampler = next_sampler;
}
// Serialize predictions to JSONL and save to a file
let file_path = "predictions.jsonl";
let mut file = File::create(file_path)?;
for prediction in predictions {
let json = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json)?;
}
println!("Predictions saved to {}", file_path);
Ok(())
}
// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
if let Ok(fp) = model_filepath {
return fp;
} else {
println!(
"Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
);
exit(1);
}
}
Qwen 2.5 Coder 32B Conclusion
The model was very close to getting everything right in one shot, but trying to correct the one silly error via Chat would have sent me down a spiral of problems progressively getting worse. It’s a shame that it rewrote stuff that wasn’t a problem and made it worse … And in general, it doesn’t seem to want to be concise in chat by default. I’m almost positive I could interject a system prompt to adjust that, however.
All that said, the example it generated was actually better than Nemotron 70B, picking up on the whole prompt caching trick by freezing processed prompt tokens. I suspect that I might have to be more specific with Qwen 2.5 Coder 32B than I would have to with Nemotron 70B, but such is the case with any smaller LLM versus a significantly larger one.
The Crowdpleaser - Claude 3.5 Sonnet
Everyone seems to rave about this model, lets take it for a spin and use the exact same workflow as we did with the other models by starting off with a simple question once everything is configured correctly in continue.dev and VS Code:
Is it easy to add examples to Rust library projects?
Claude 3.5 Sonnet’s Response
Yes, it’s quite straightforward to add examples to Rust library projects! You can add example code in two main ways:
-
In the
examples/
directory of your project: Create standalone example files that demonstrate how to use your library. These can be run directly withcargo run --example example_name
. -
As documentation examples: Write examples directly in your code’s documentation using doc comments (
///
or//!
). These serve both as documentation and as testable examples throughcargo test
.
Here’s a quick example of how you would set up both:
/// A simple function that adds two numbers
///
/// # Example
///
/// ```
/// use your_crate_name::add;
/// assert_eq!(add(2, 2), 4);
/// ```
pub fn add(a: i32, b: i32) -> i32 {
a + b
}
And in a separate example file:
use your_crate_name::add;
fn main() {
let result = add(10, 5);
println!("10 + 5 = {}", result);
}
You can run this example using:
cargo run --example basic_math
Both approaches are valuable:
- Doc examples are great for showing quick, focused usage snippets
- Standalone examples are better for more complex, real-world usage scenarios
Would you like me to show more specific examples or explain any particular aspect in more detail?
An Interesting Response To a Simple Question
I don’t know how I feel about this one. On one hand, it’s clear and concise, but on the other it doesn’t explicitly
say where to put that basic_math.rs
file, though it should be implied well enough, I suppose. Lets follow up
with the first set of demands in the same way as we did with the other models:
I want to create a new example for this Rust project. It should take three
command line args, parsed with the `clap` library, for 'count',
'system prompt' and 'instruction prompt'.
Then, it should follow the text prediction workflow for the library as
illustrated in the @step_prediction_test function. I would like to perform
'count' number of text predictions for a prompt built by the 'system prompt'
and 'instruction prompt' arguments on the command line.
I want to put the resulting predicted text into a vector and then turn it
into a JSONL file at the end, serializing it and saving it to the file system.
Claude 3.5 Sonnet’s Response
I’ll help you create an example that combines clap, the text prediction workflow, and JSONL output. Here’s how we can do that:
use clap::Parser;
use serde::Serialize;
use std::fs::File;
use std::io::Write;
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
/// Number of predictions to generate
#[arg(short, long)]
count: usize,
/// System prompt to use
#[arg(short, long)]
system_prompt: String,
/// Instruction prompt to use
#[arg(short, long)]
instruction_prompt: String,
}
#[derive(Serialize)]
struct Prediction {
system_prompt: String,
instruction_prompt: String,
generated_text: String,
token_count: usize,
}
fn main() -> std::io::Result<()> {
let args = Args::parse();
// Initialize model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = "path/to/your/model.gguf"; // Update with actual path
let mut llama = Llama::new();
let load_success = llama.load_model(model_filepath, model_params, context_params, true);
assert!(load_success);
// Set up generation parameters
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
params.set_antiprompts(&antiprompts);
// Store predictions
let mut predictions: Vec<Prediction> = Vec::new();
// Generate multiple predictions
for _ in 0..args.count {
// Construct prompt
let prompt = format!(
"<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
args.system_prompt, args.instruction_prompt
);
params.set_prompt(&prompt);
// Process prompt and get sampler
let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
assert!(prompt_token_count > 0);
// Generate prediction
let mut tokens: TokenList = vec![];
while tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut sampler);
if tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
tokens.push(next_token);
}
// Get predicted text
let prediction_str = llama.detokenize_text(&mut tokens, false);
// Store prediction
predictions.push(Prediction {
system_prompt: args.system_prompt.clone(),
instruction_prompt: args.instruction_prompt.clone(),
generated_text: prediction_str,
token_count: tokens.len(),
});
}
// Write to JSONL file
let mut file = File::create("predictions.jsonl")?;
for prediction in predictions {
let json_line = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json_line)?;
}
Ok(())
}
To use this example, you’ll need to add these dependencies to your Cargo.toml
:
[dependencies]
clap = { version = "4.4", features = ["derive"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
You can run this example with:
cargo run --example batch_predictions -- --count 5 --system-prompt "You are a creative writer." --instruction-prompt "Write a short story about a robot."
This example:
- Uses clap to parse command line arguments
- Sets up the Llama model with the same parameters as your test
- Generates the requested number of predictions
- Stores each prediction with its metadata in a struct
- Writes the predictions to a JSONL file where each line is a valid JSON object
The output will be saved in predictions.jsonl
, with each line containing a JSON object with the system prompt, instruction prompt, generated text, and token count.
Note: You’ll need to update the model_filepath
variable to point to your actual model file location.
One and Done!
In a complement to Nemotron, I’d say that current Claude is generating code in much the same way. It doesn’t get stuck on my model file being pulled from an external function. And it sets the prompt appropriately.
Pasting the code into the file presents me with no compiler errors. On a more detailed review, I notice it doesn’t pick up on the prompt caching strategy that Qwen 2.5 Coder 32B did, which is a shame.
Pulling in my get_test_model_path()
function from the integration test, I update the code to call that to
get my model file path instead of the placeholder string and then run the example.
This version of the example produces no terminal output, but the predictions.jsonl
file is created
appropriately and looks fine (formatted for reading):
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block
## Introduction
This is a sample text block written in Markdown format. It includes headings, paragraphs, and links.
### Key Features of Markdown
* **Headings**: Use `# Heading` to create headings of different levels (e.g., `# Heading 1`, `## Heading 2`, etc.)
* **Paragraphs**: Simply start a new line to create a paragraph.
* **Links**: Use `[text](url)` to create links.
",
"token_count":100}
Claude 3.5 Sonnet successfully one-shotted the task. The extra text was clear and concise. Very well done! Lets see if it can properly adjust the sampler settings so that looping this over a number of iterations doesn’t produce the same output.
I update continue.dev by clicking the ‘…’ button and ’re-index’ before asking the next follow-up question:
Okay, I have the example code placed in @dataset_generator.rs and it's working
well for a `count` of 1. However, the sampler settings encourage consistent
text generation. Can you supply just the code needed to update the sampler
parameters so I can get different text each loop iteration?
Claude 3.5 Sonnet’s Response
Here’s how to update the generation parameters inside the loop to ensure more variety between generations:
// ... existing code ...
// Generate multiple predictions
for i in 0..args.count {
// Update sampling parameters for variety
params.params.seed = 42 + i as i32; // Different seed each iteration
params.params.temp = 0.7; // Higher temperature for more variety
params.params.top_k = 40; // Allow more token choices
params.params.top_p = 0.9; // Sample from more of the distribution
params.params.min_p = 0.05; // Lower minimum probability threshold
params.params.penalty_repeat = 1.2; // Slightly stronger repetition penalty
// Construct prompt
let prompt = format!(
"<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
args.system_prompt, args.instruction_prompt
);
params.set_prompt(&prompt);
// ... existing code ...
}
// ... existing code ...
These adjustments will make each generation more creative and different from the others, while still maintaining reasonable coherence. The key changes are:
- Using a different seed for each iteration
- Higher temperature (0.7 vs 0.1)
- Higher top_k (40 vs 1)
- Lower top_p (0.9 vs 1.0)
- Lower min_p (0.05 vs 0.1)
I Mean … Yeah …
The fix is a little sloppy as the only thing changing is the seed, and even then it will generate dupes when invoked
multiple times on the command-line. But it is basically correct - it misses the type of seed
being u32
, but I
don’t believe it had access to that information in its context window.
After making the modifications, I run the example with a count
of three and this is the resulting predictions.jsonl
file
(formatted for reading):
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"## Sample Text Block
### Heading 1
This is the first heading.
#### Subheading 2
Another subheading that's part of this main title.
#### *Italicized* Text
This piece of content will be displayed as italicized.
#### **Bold** Text
And another bold one, because there should be multiple levels of emphasis here too!
#### _Underlined_ Text _
To create an underlined link or highlighter you'll use underscores instead. This won't work for",
"token_count":100
}
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block
## Introduction
Welcome to our new restaurant, where the food is delicious and the atmosphere is cozy. Our menu features a variety of dishes from around the world.
### Menu Items
* Appetizers:
\t+ Spinach and Artichoke Dip ($8)
\t+ Bruschetta with Fresh Tomatoes ($7)
### Beverages
* Coffee: $2-$5 per cup
* Tea: $3-$6 per pot
- - -
Would you like me",
"token_count":100
}
{
"system_prompt":"You are a creative writing AI.",
"instruction_prompt":"Generate a sample text block in Markdown format.",
"generated_text":"# Sample Text Block
## Introduction
Welcome to our tutorial on creating a compelling narrative. In this section, we will explore the fundamentals of storytelling and provide you with practical tips to improve your writing skills.
### Key Elements of Storytelling
* **Character Development**: Well-rounded characters are essential for engaging readers. Give your characters unique personalities, motivations, and backstories.
* **Plot Structure**: A clear plot structure helps maintain reader interest. Use techniques like conflict resolution, rising action,",
"token_count":100
}
Final Claude 3.5 Sonnet Example Code
use clap::Parser;
use serde::Serialize;
use std::io::Write;
use std::{fs::File, process::exit};
use woolyrust::{Llama, ManagedGptParams, TokenList};
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
/// Number of predictions to generate
#[arg(short, long)]
count: usize,
/// System prompt to use
#[arg(short, long)]
system_prompt: String,
/// Instruction prompt to use
#[arg(short, long)]
instruction_prompt: String,
}
#[derive(Serialize)]
struct Prediction {
system_prompt: String,
instruction_prompt: String,
generated_text: String,
token_count: usize,
}
fn main() -> std::io::Result<()> {
let args = Args::parse();
// Initialize model
let mut model_params = woolyrust::get_default_model_params();
model_params.n_gpu_layers = 100;
let mut context_params = woolyrust::get_default_context_params();
context_params.n_ctx = 1024 * 2;
let model_filepath = get_test_model_path();
let mut llama = Llama::new();
let load_success =
llama.load_model(model_filepath.as_str(), model_params, context_params, true);
assert!(load_success);
// Set up generation parameters
let mut params = ManagedGptParams::defaults();
params.params.seed = 42;
params.params.n_threads = -1;
params.params.n_predict = 100;
params.params.temp = 0.1;
params.params.top_k = 1;
params.params.top_p = 1.0;
params.params.min_p = 0.1;
params.params.penalty_repeat = 1.1;
params.params.penalty_last_n = 512;
params.params.ignore_eos = false;
params.params.flash_attn = true;
params.params.n_batch = 128;
params.params.prompt_cache_all = false;
let antiprompts = vec!["<|end|>"];
params.set_antiprompts(&antiprompts);
// Store predictions
let mut predictions: Vec<Prediction> = Vec::new();
// Generate multiple predictions
for i in 0..args.count {
// Update sampling parameters for variety
params.params.seed = 42 + i as u32; // Different seed each iteration
params.params.temp = 0.7; // Higher temperature for more variety
params.params.top_k = 40; // Allow more token choices
params.params.top_p = 0.9; // Sample from more of the distribution
params.params.min_p = 0.05; // Lower minimum probability threshold
params.params.penalty_repeat = 1.2; // Slightly stronger repetition penalty
// Construct prompt
let prompt = format!(
"<|system|>\n{}<|end|>\n<|user|>\n{}<|end|>\n<|assistant|>\n",
args.system_prompt, args.instruction_prompt
);
params.set_prompt(&prompt);
// Process prompt and get sampler
let (prompt_token_count, mut sampler) = llama.process_prompt(&mut params);
assert!(prompt_token_count > 0);
// Generate prediction
let mut tokens: TokenList = vec![];
while tokens.len() < params.params.n_predict as usize {
let next_token = llama.sample_next_token(&mut sampler);
if tokens.len() < params.params.n_predict as usize {
let processed = llama.process_next_token(next_token);
assert!(processed);
}
tokens.push(next_token);
}
// Get predicted text
let prediction_str = llama.detokenize_text(&mut tokens, false);
// Store prediction
predictions.push(Prediction {
system_prompt: args.system_prompt.clone(),
instruction_prompt: args.instruction_prompt.clone(),
generated_text: prediction_str,
token_count: tokens.len(),
});
}
// Write to JSONL file
let mut file = File::create("predictions.jsonl")?;
for prediction in predictions {
let json_line = serde_json::to_string(&prediction)?;
writeln!(file, "{}", json_line)?;
}
Ok(())
}
// the relative path to the model to load for the tests
pub fn get_test_model_path() -> String {
let model_filepath = std::env::var("WOOLY_TEST_MODEL_FILE");
if let Ok(fp) = model_filepath {
return fp;
} else {
println!(
"Set WOOLY_TEST_MODEL_FILE environment variable to the gguf file to use for testing"
);
exit(1);
}
}
Claude 3.5 Sonnet Conclusion
Well, what’s to say? It was efficient, clear and nailed everything in one shot. Very impressive. The latency through openrouter was much more noticeable with this model, though I wouldn’t say it was inconveniently slow. The price of this model ate into my credits much faster. Just these three calls cost me about $0.06, which admittedly isn’t much … but I think that going crazy with it might see my credit balance getting absolutely annihilated.
Overall Conclusion
I will say that I’m a little sad that Qwen 2.5 Coder 32B (quantized!) is the only model that picked up on the ability to cache the processed prompt tokens. I figured for sure Claude would have figured that out. It literally is the whole point of the integration test function I put in context with continue.dev tagging for the models to base the new example on.
Otherwise, I’d say Qwen 2.5 Coder 32B had a real good showing. It’s a capable ‘smaller’ model that fits into consumer GPUs more readily and just like other ‘smaller’ models, it needs a little more hand holding to achieve maximum results. Claude and Nemotron both felt a little easier to work with because of that.
Maybe it’s also no surprise that Claude 3.5 Sonnet is just the best tested. It’s expensive and slower, but it got everything right. If I ever get around to writing another one of these blog articles with a more advanced test than this basic Rust example based off of existing unit test, I’ll make sure to pit it against and OpenAI model. In my limited experience, ‘o1-preview’ gets things right that nothing else does…
Llama 3.1 Nemotron 70B did okay. I think for the price on openrouter I would have no problems keeping it in my rotation and I’m definitely going to test it out further. I also don’t have much experience with other 70B class LLMs like Llama 3.3 70B, so I don’t know how it compares.
I think an easy conclusion to draw is to configure multiple models in continue.dev so that you can switch only to high-performance/high-cost models only when needed.
This took a few hours to put together, even with how sloppy and unedited my writing is, but I regret not being able to test different quants of Qwen 2.5 Coder 32B to see if anything changes significantly. There are also other models that deserve to be mentioned in this context such as the smaller ‘Codestral-22b’ and of course OpenAI’s ‘o1-preview’. Already it felt like this article is too long for anyone to actually read, so I stopped at just these three models.
Ahhh … But Which Version Was The Chosen One?
Wondering which version I kept to develop further?
Drumroll, please! … … …
The Qwen 2.5 Coder 32B version. It had the prompt caching already enabled and was the easiest to take forward.
Errata
I didn’t realize until the very end that the way I was copying responses from the AI had somehow escaped all the strings and also replaced the “\n” string with actual newlines, so I had to repair all of that by hand. If you spot something that doesn’t look right because quotes are escaped or there’s a newline where there shouldn’t be … that’s probably my fault. My bad.