LLamaSharp
Reserved to be used by the compiler for tracking metadata.
This class should not be used by developers in source code.
This definition is provided by the IsExternalInit NuGet package (https://www.nuget.org/packages/IsExternalInit).
Please see https://github.com/manuelroemer/IsExternalInit for more information.
The parameters for initializing a LLama context from a model.
Model context size (n_ctx)
maximum batch size that can be submitted at once (must be >=32 to use BLAS) (n_batch)
Physical batch size
max number of sequences (i.e. distinct states for recurrent models)
If true, extract embeddings (together with logits).
RoPE base frequency (null to fetch from the model)
RoPE frequency scaling factor (null to fetch from the model)
The encoding to use for models
Number of threads (null = autodetect) (n_threads)
Number of threads to use for batch processing (null = autodetect) (n_threads)
YaRN extrapolation mix factor (null = from model)
YaRN magnitude scaling factor (null = from model)
YaRN low correction dim (null = from model)
YaRN high correction dim (null = from model)
YaRN original context length (null = from model)
YaRN scaling method to use.
Override the type of the K cache
Override the type of the V cache
Whether to disable offloading the KQV cache to the GPU
Whether to use flash attention
defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default)
defragment the KV cache if holes/size > defrag_threshold, Set to or < 0 to disable (default)
How to pool (sum) embedding results by sequence id (ignored if no pooling layer)
Attention type to use for embeddings
Transform history to plain text and vice versa.
Convert a ChatHistory instance to plain text.
The ChatHistory instance
Converts plain text to a ChatHistory instance.
The role for the author.
The chat history as plain text.
The updated history.
Copy the transform.
The parameters used for inference.
number of tokens to keep from initial prompt
how many new tokens to predict (n_predict), set to -1 to inifinitely generate response
until it complete.
Sequences where the model will stop generating further tokens.
Set a custom sampling pipeline to use.
A high level interface for LLama models.
The loaded context for this executor.
Identify if it's a multi-modal model and there is a image to process.
Multi-Modal Projections / Clip Model weights
List of images: List of images in byte array format.
Asynchronously infers a response from the model.
Your prompt
Any additional parameters
A cancellation token.
Convenience interface for implementing both type of parameters.
Mostly exists for backwards compatibility reasons, when these two were not split.
The parameters for initializing a LLama model.
main_gpu interpretation depends on split_mode:
-
None
The GPU that is used for the entire mode.
-
Row
The GPU that is used for small tensors and intermediate results.
-
Layer
Ignored.
How to split the model across multiple GPUs
Number of layers to run in VRAM / GPU memory (n_gpu_layers)
Use mmap for faster loads (use_mmap)
Use mlock to keep model in memory (use_mlock)
Model path (model)
how split tensors should be distributed across GPUs
Load vocab only (no weights)
Validate model tensor data before loading
Override specific metadata items in the model
A fixed size array to set the tensor splits across multiple GPUs
The size of this array
Get or set the proportion of work to do on the given device.
"[ 3, 2 ]" will assign 60% of the data to GPU 0 and 40% to GPU 1.
Create a new tensor splits collection, copying the given values
Create a new tensor splits collection with all values initialised to the default
Set all values to zero
A JSON converter for
An override for a single key/value pair in model metadata
Get the key being overridden by this override
Create a new override for an int key
Create a new override for a float key
Create a new override for a boolean key
Create a new override for a string key
A JSON converter for
Descriptor of a native library.
Metadata of this library.
Prepare the native library file and returns the local path of it.
If it's a relative path, LLamaSharp will search the path in the search directies you set.
The system information of the current machine.
The log callback.
The relative paths of the library. You could return multiple paths to try them one by one. If no file is available, please return an empty array.
Takes a stream of tokens and transforms them.
Takes a stream of tokens and transforms them, returning a new stream of tokens asynchronously.
Copy the transform.
An interface for text transformations.
These can be used to compose a pipeline of text transformations, such as:
- Tokenization
- Lowercasing
- Punctuation removal
- Trimming
- etc.
Takes a string and transforms it.
Copy the transform.
Extension methods to the interface.
Gets an instance for the specified .
The executor.
The to use to transform an input list messages into a prompt.
The to use to transform the output into text.
An instance for the provided .
is null.
Format the chat messages into a string prompt.
Convert the chat options to inference parameters.
A default transform that appends "Assistant: " to the end.
AntipromptProcessor keeps track of past tokens looking for any set Anti-Prompts
Initializes a new instance of the class.
The antiprompts.
Add an antiprompt to the collection
Overwrite all current antiprompts with a new set
Add some text and check if the buffer now ends with any antiprompt
true if the text buffer ends with any antiprompt
A batched executor that can infer multiple separate "conversations" simultaneously.
Set to 1 using interlocked exchange while inference is running
Epoch is incremented twice every time Infer is called. Conversations can use this to keep track of
whether they're waiting for inference, or can be sampled.
The this executor is using
The this executor is using
Get the number of tokens in the batch, waiting for to be called
Number of batches in the queue, waiting for to be called
Check if this executor has been disposed.
Create a new batched executor
The model to use
Parameters to create a new context
Start a new
Load a conversation that was previously saved to a file. Once loaded the conversation will
need to be prompted.
Load a conversation that was previously saved into memory. Once loaded the conversation will need to be prompted.
Run inference for all conversations in the batch which have pending tokens.
If the result is `NoKvSlot` then there is not enough memory for inference, try disposing some conversation
threads and running inference again.
Get a reference to a batch that tokens can be added to.
Get a reference to a batch that embeddings can be added to.
A single conversation thread that can be prompted (adding tokens from the user) or inferred (extracting a token from the LLM)
Indicates if this conversation has been "forked" and may share logits with another conversation.
Stores the indices to sample from. Contains valid items.
The executor which this conversation belongs to
Unique ID for this conversation
Total number of tokens in this conversation, cannot exceed the context length.
Indicates if this conversation has been disposed, nothing can be done with a disposed conversation
Indicates if this conversation is waiting for inference to be run on the executor. "Prompt" and "Sample" cannot be called when this is true.
Indicates that this conversation should be sampled.
Finalizer for Conversation
End this conversation, freeing all resources used by it
Create a copy of the current conversation
The copy shares internal state, so consumes very little extra memory.
Get the index in the context which each token can be sampled from, the return value of this function get be used to retrieve logits
() or to sample a token (.
How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires
allLogits to have been set during prompting.
For example if 5 tokens were supplied in the last prompt call:
- The logits of the first token can be accessed with 4
- The logits of the second token can be accessed with 3
- The logits of the third token can be accessed with 2
- The logits of the fourth token can be accessed with 1
- The logits of the fifth token can be accessed with 0
Thrown if this conversation was not prompted before the previous call to infer
Thrown if Infer() must be called on the executor
Get the logits from this conversation, ready for sampling
How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires allLogits to have been set during prompting
Thrown if this conversation was not prompted before the previous call to infer
Thrown if Infer() must be called on the executor
Add tokens to this conversation
If true, generate logits for all tokens. If false, only generate logits for the last token.
Add tokens to this conversation
If true, generate logits for all tokens. If false, only generate logits for the last token.
Add a single token to this conversation
Prompt this conversation with an image embedding
Prompt this conversation with embeddings
The raw values of the embeddings. This span must divide equally by the embedding size of this model.
Directly modify the KV cache of this conversation
Thrown if this method is called while == true
Provides direct access to the KV cache of a .
See for how to use this.
Removes all tokens that have positions in [start, end)
Start position (inclusive)
End position (exclusive)
Removes all tokens starting from the given position
Start position (inclusive)
Number of tokens
Adds relative position "delta" to all tokens that have positions in [p0, p1).
If the KV cache is RoPEd, the KV data is updated
accordingly
Start position (inclusive)
End position (exclusive)
Amount to add on to each token position
Integer division of the positions by factor of `d > 1`.
If the KV cache is RoPEd, the KV data is updated accordingly.
Start position (inclusive). If less than zero, it is clamped to zero.
End position (exclusive). If less than zero, it is treated as "infinity".
Amount to divide each position by.
A function which can temporarily access the KV cache of a to modify it directly
The current end token of this conversation
An which allows direct access to modify the KV cache
The new end token position
Save the complete state of this conversation to a file. if the file already exists it will be overwritten.
Save the complete state of this conversation in system memory.
Load state from a file
This should only ever be called by the BatchedExecutor, on a newly created conversation object!
Load state from a previously saved state.
This should only ever be called by the BatchedExecutor, on a newly created conversation object!
In memory saved state of a
Indicates if this state has been disposed
Get the size in bytes of this state object
Internal constructor prevent anyone outside of LLamaSharp extending this class
Extension method for
Sample a token from this conversation using the given sampler chain
to sample from
Offset from the end of the conversation to the logits to sample, see for more details
Sample a token from this conversation using the given sampling pipeline
to sample from
Offset from the end of the conversation to the logits to sample, see for more details
Rewind a back to an earlier state by removing tokens from the end
The conversation to rewind
The number of tokens to rewind
Thrown if `tokens` parameter is larger than TokenCount
Shift all tokens over to the left, removing "count" tokens from the start and shifting everything over.
Leaves "keep" tokens at the start completely untouched. This can be used to free up space when the context
gets full, keeping the prompt at the start intact.
The conversation to rewind
How much to shift tokens over by
The number of tokens at the start which should not be shifted
Base class for exceptions thrown from
This exception is thrown when "Prompt()" is called on a which has
already been prompted and before "Infer()" has been called on the associated
.
This exception is thrown when "Sample()" is called on a which has
already been prompted and before "Infer()" has been called on the associated
.
This exception is thrown when "Sample()" is called on a which was not
first prompted.
.
This exception is thrown when is called when = true
This exception is thrown when "Save()" is called on a which has
already been prompted and before "Infer()" has been called.
.
Save the state of a particular sequence to specified path. Also save some extra data which will be returned when loading.
Data saved with this method must be saved with
Load the state from the specified path into a particular sequence. Also reading header data. Must only be used with
data previously saved with
The main chat session class.
The filename for the serialized model state (KV cache, etc).
The filename for the serialized executor state.
The filename for the serialized chat history.
The filename for the serialized input transform pipeline.
The filename for the serialized output transform.
The filename for the serialized history transform.
The executor for this session.
The chat history for this session.
The history transform used in this session.
The input transform pipeline used in this session.
The output transform used in this session.
Create a new chat session and preprocess history.
The executor for this session
History for this session
History Transform for this session
A new chat session.
Create a new chat session.
The executor for this session
Create a new chat session with a custom history.
Use a custom history transform.
Add a text transform to the input transform pipeline.
Use a custom output transform.
Save a session from a directory.
Get the session state.
SessionState object representing session state in-memory
Load a session from a session state.
If true loads transforms saved in the session state.
Load a session from a directory.
If true loads transforms saved in the session state.
Add a message to the chat history.
Add a system message to the chat history.
Add an assistant message to the chat history.
Add a user message to the chat history.
Remove the last message from the chat history.
Compute KV cache for the message and add it to the chat history.
Compute KV cache for the system message and add it to the chat history.
Compute KV cache for the user message and add it to the chat history.
Compute KV cache for the assistant message and add it to the chat history.
Replace a user message with a new message and remove all messages after the new message.
This is useful when the user wants to edit a message. And regenerate the response.
Chat with the model.
Chat with the model.
Chat with the model.
Chat with the model.
Regenerate the last assistant message.
The state of a chat session in-memory.
Saved executor state for the session in JSON format.
Saved context state (KV cache) for the session.
The input transform pipeline used in this session.
The output transform used in this session.
The history transform used in this session.
The the chat history messages for this session.
Create a new session state.
Save the session state to folder.
Load the session state from folder.
Throws when session state is incorrect
Role of the message author, e.g. user/assistant/system
Role is unknown
Message comes from a "system" prompt, not written by a user or language model
Message comes from the user
Messages was generated by the language model
The chat history class
Chat message representation
Role of the message author, e.g. user/assistant/system
Message content
Create a new instance
Role of message author
Message content
List of messages in the chat
Create a new instance of the chat content class
Create a new instance of the chat history from array of messages
Add a message to the chat history
Role of the message author
Message content
Serialize the chat history to JSON
Deserialize a chat history from JSON
A queue with fixed storage size.
Currently it's only a naive implementation and needs to be further optimized in the future.
Number of items in this queue
Maximum number of items allowed in this queue
Create a new queue
the maximum number of items to store in this queue
Fill the quene with the data. Please ensure that data.Count <= size
Enquene an element.
The parameters used for inference.
number of tokens to keep from initial prompt when applying context shifting
how many new tokens to predict (n_predict), set to -1 to inifinitely generate response
until it complete.
Sequences where the model will stop generating further tokens.
Type of "mirostat" sampling to use.
https://github.com/basusourya/mirostat
Disable Mirostat sampling
Original mirostat algorithm
Mirostat 2.0 algorithm
The parameters for initializing a LLama model.
`Encoding` cannot be directly JSON serialized, instead store the name as a string which can
The model path.
Base class for LLamaSharp runtime errors (i.e. errors produced by llama.cpp, converted into exceptions)
Create a new RuntimeError
Loading model weights failed
The model path which failed to load
`llama_decode` return a non-zero status code
The return status code
`llama_decode` return a non-zero status code
`llama_get_logits_ith` returned null, indicating that the index was invalid
The incorrect index passed to the `llama_get_logits_ith` call
Extension methods to the IContextParams interface
Convert the given `IModelParams` into a `LLamaContextParams`
Extension methods to the IModelParams interface
Convert the given `IModelParams` into a `LLamaModelParams`
Find the index of `item` in `list`
list to search
item to search for
Check if the given set of tokens ends with any of the given strings
Tokens to check
Strings to search for
Model to use to convert tokens into bytes
Encoding to use to convert bytes into characters
Check if the given set of tokens ends with any of the given strings
Tokens to check
Strings to search for
Model to use to convert tokens into bytes
Encoding to use to convert bytes into characters
Extensions to the KeyValuePair struct
Deconstruct a KeyValuePair into it's constituent parts.
The KeyValuePair to deconstruct
First element, the Key
Second element, the Value
Type of the Key
Type of the Value
Run a process for a certain amount of time and then terminate it
return code, standard output, standard error, flag indicating if process exited or was terminated
Extensions to span which apply in-place normalization
In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span
The same array
In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span
The same span
In-place divide every element in the array by the sum of absolute values in the array
Also known as "Manhattan normalization".
The same array
In-place divide every element in the span by the sum of absolute values in the span
Also known as "Manhattan normalization".
The same span
In-place divide every element by the euclidean length of the vector
Also known as "L2 normalization".
The same array
In-place divide every element by the euclidean length of the vector
Also known as "L2 normalization".
The same span
Creates a new array containing an L2 normalization of the input vector.
The same span
In-place apply p-normalization. https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm
- For p = 1, this is taxicab normalization
- For p = 2, this is euclidean normalization
- As p => infinity, this approaches infinity norm or maximum norm
The same array
In-place apply p-normalization. https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm
- For p = 1, this is taxicab normalization
- For p = 2, this is euclidean normalization
- As p => infinity, this approaches infinity norm or maximum norm
The same span
A llama_context, which holds all the context required to interact with a model
Total number of tokens in the context
Dimension of embedding vectors
The context params set for this context
The native handle, which is used to be passed to the native APIs
Be careful how you use this!
The encoding set for this model to deal with text input.
Get or set the number of threads to use for generation
Get or set the number of threads to use for batch processing
Get the maximum batch size for this context
Get the special tokens for the model associated with this context
Create a new LLamaContext for the given LLamaWeights
Tokenize a string.
Whether to add a bos to the text.
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.
Detokenize the tokens to text.
Save the state to specified path.
Save the state of a particular sequence to specified path.
Get the state data as an opaque handle, which can be loaded later using
Use if you intend to save this state to disk.
Get the state data as an opaque handle, which can be loaded later using
Use if you intend to save this state to disk.
Load the state from specified path.
Load the state from specified path into a particular sequence
Load the state from memory.
Load the state from memory into a particular sequence
A tuple, containing the decode result, the number of tokens that have not been decoded yet and the total number of tokens that have been decoded.
The state of this context, which can be reloaded later
Get the size in bytes of this state object
Write all the bytes of this state to the given stream
Write all the bytes of this state to the given stream
Load a state from a stream
Load a state from a stream
The state of a single sequence, which can be reloaded later
Get the size in bytes of this state object
Copy bytes to a destination pointer.
Destination to write to
Length of the destination buffer
Offset from start of src to start copying from
Number of bytes written to destination
Generate high dimensional embedding vectors from text
Dimension of embedding vectors
LLama Context
Create a new embedder, using the given LLamaWeights
Get high dimensional embedding vectors for the given text. Depending on the pooling type used when constructing
this this may return an embedding vector per token, or one single embedding vector for the entire string.
Embedding vectors are not normalized, consider using one of the extensions in .
The base class for stateful LLama executors.
The logger used by this executor.
The tokens that were already processed by the model.
The tokens that were consumed by the model during the current inference.
The path of the session file.
A container of the tokens to be processed and after processed.
A container for the tokens of input.
The last tokens generated by the model.
The context used by the executor.
This API is currently not verified.
This API has not been verified currently.
After running out of the context, take some tokens from the original prompt and recompute the logits in batches.
Try to reuse the matching prefix from the session file.
Decide whether to continue the loop.
Preprocess the inputs before the inference.
Do some post processing after the inference.
The core inference logic.
Save the current state to a file.
Get the current state data.
Load the state from data.
Load the state from a file.
Execute the inference.
The prompt. If null, generation will continue where it left off previously.
Asynchronously runs a prompt through the model to compute KV cache without generating any new tokens.
It could reduce the latency of the first time response if the first input from the user is not immediate.
Prompt to process
State arguments that are used in single inference
Tokens count remained to be used. (n_remain)
The LLama executor for instruct mode.
The descriptor of the state of the instruct executor.
Whether the executor is running for the first time (running the prompt).
Instruction prefix tokens.
Instruction suffix tokens.
The LLama executor for interactive mode.
Define whether to continue the loop to generate responses.
Return whether to break the generation.
The descriptor of the state of the interactive executor.
Whether the executor is running for the first time (running the prompt).
The quantizer to quantize the model.
Quantize the model.
The model file to be quantized.
The path to save the quantized model.
The type of quantization.
Thread to be used during the quantization. By default it's the physical core number.
Whether the quantization is successful.
Quantize the model.
The model file to be quantized.
The path to save the quantized model.
The type of quantization.
Thread to be used during the quantization. By default it's the physical core number.
Whether the quantization is successful.
Parse a string into a LLamaFtype. This is a "relaxed" parsing, which allows any string which is contained within
the enum name to be used.
For example "Q5_K_M" will convert to "LLAMA_FTYPE_MOSTLY_Q5_K_M"
This executor infer the input as one-time job. Previous inputs won't impact on the
response to current input.
The context used by the executor when running the inference.
If true, applies the default template to the prompt as defined in the rules for llama_chat_apply_template template.
The system message to use with the prompt. Only used when is true.
Create a new stateless executor which will use the given model
Converts a sequence of messages into text according to a model template
Custom template. May be null if a model was supplied to the constructor.
Keep a cache of roles converted into bytes. Roles are very frequently re-used, so this saves converting them many times.
Array of messages. The property indicates how many messages there are
Backing field for
Temporary array of messages in the format llama.cpp needs, used when applying the template
Indicates how many bytes are in array
Result bytes of last call to
Indicates if this template has been modified and needs regenerating
The encoding algorithm to use
Number of messages added to this template
Get the message at the given index
Thrown if index is less than zero or greater than or equal to
Whether to end the prompt with the token(s) that indicate the start of an assistant message.
Construct a new template, using the default model template
Construct a new template, using the default model template
Construct a new template, using a custom template.
Only support a pre-defined list of templates. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
Add a new message to the end of this template
This template, for chaining calls.
Add a new message to the end of this template
This template, for chaining calls.
Remove a message at the given index
This template, for chaining calls.
Remove all messages from the template and resets internal state to accept/generate new messages
Apply the template to the messages and return a span containing the results
A span over the buffer that holds the applied template
A message that has been added to a template
The "role" string for this message
The text content of this message
Deconstruct this message into role and content
A class that contains all the transforms provided internally by LLama.
The default history transform.
Uses plain text with the following format:
[Author]: [Message]
Drop the name at the beginning and the end of the text.
A text input transform that only trims the text.
A no-op text input transform.
A text output transform that removes the keywords from the response.
Keywords that you want to remove from the response.
This property is used for JSON serialization.
Maximum length of the keywords.
This property is used for JSON serialization.
If set to true, when getting a matched keyword, all the related tokens will be removed.
Otherwise only the part of keyword will be removed.
This property is used for JSON serialization.
JSON constructor.
Keywords that you want to remove from the response.
The extra length when searching for the keyword. For example, if your only keyword is "highlight",
maybe the token you get is "\r\nhighligt". In this condition, if redundancyLength=0, the token cannot be successfully matched because the length of "\r\nhighligt" (10)
has already exceeded the maximum length of the keywords (8). On the contrary, setting redundancyLengyh >= 2 leads to successful match.
The larger the redundancyLength is, the lower the processing speed. But as an experience, it won't introduce too much performance impact when redundancyLength <= 5
If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed.
A set of model weights, loaded into memory.
The native handle, which is used in the native APIs
Be careful how you use this!
Total number of tokens in the context
Get the size of this model in bytes
Get the number of parameters in this model
Dimension of embedding vectors
Get the special tokens of this model
All metadata keys in this model
Load weights into memory
Load weights into memory
Parameters to use to load the model
A cancellation token that can interrupt model loading
Receives progress updates as the model loads (0 to 1)
Thrown if weights failed to load for any reason. e.g. Invalid file format or loading cancelled.
Thrown if the cancellation token is cancelled.
Create a llama_context using this model
Convert a string of text into tokens
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.
A set of llava model weights (mmproj), loaded into memory.
The native handle, which is used in the native APIs
Be careful how you use this!
Load weights into memory
path to the "mmproj" model file
Load weights into memory
path to the "mmproj" model file
Create the Image Embeddings from the bytes of an image.
Image bytes. Supported formats:
- JPG
- PNG
- BMP
- TGA
Create the Image Embeddings.
Image in binary format (it supports jpeg format only)
Number of threads to use
return the SafeHandle of these embeddings
Create the Image Embeddings from the bytes of an image.
Path to the image file. Supported formats:
- JPG
- PNG
- BMP
- TGA
Create the Image Embeddings from the bytes of an image.
Path to the image file. Supported formats:
- JPG
- PNG
- BMP
- TGA
Eval the image embeddings
Return codes from llama_decode
An unspecified error
Ok.
Could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
Return codes from llama_encode
An unspecified error
Ok.
Possible GGML quantisation types
Full 32 bit float
16 bit float
4 bit float
4 bit float
5 bit float
5 bit float
8 bit float
8 bit float
"type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight.
Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
"type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights.
Scales are quantized with 6 bits. This end up using 3.4375 bpw.
"type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights.
Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
"type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
"type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights.
Scales are quantized with 8 bits. This ends up using 6.5625 bpw
"type-0" 8-bit quantization. Only used for quantizing intermediate results.
The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
Integer, 8 bit
Integer, 16 bit
Integer, 32 bit
The value of this entry is the count of the number of possible quant types.
llama_split_mode
Single GPU
Split layers and KV across GPUs
split layers and KV across GPUs, use tensor parallelism if supported
Disposes all contained disposables when this class is disposed
llama_attention_type
A batch allows submitting multiple tokens to multiple sequences simultaneously
Keep a list of where logits can be sampled from
Get the number of logit positions that will be generated from this batch
The number of tokens in this batch
Maximum number of tokens that can be added to this batch (automatically grows if exceeded)
Maximum number of sequences a token can be assigned to (automatically grows if exceeded)
Create a new batch for submitting inputs to llama.cpp
Add a single token to the batch at the same position in several sequences
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
The token to add
The position to add it att
The set of sequences to add this token to
The index that the token was added at. Use this for GetLogitsIth
Add a single token to the batch at the same position in several sequences
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
The token to add
The position to add it att
The set of sequences to add this token to
The index that the token was added at. Use this for GetLogitsIth
Add a single token to the batch at a certain position for a single sequences
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
The token to add
The position to add it att
The sequence to add this token to
The index that the token was added at. Use this for GetLogitsIth
Add a range of tokens to a single sequence, start at the given position.
The tokens to add
The starting position to add tokens at
The sequence to add this token to
Whether the final token should generate logits
The index that the final token was added at. Use this for GetLogitsIth
Set TokenCount to zero for this batch
Get the positions where logits can be sampled from
An embeddings batch allows submitting embeddings to multiple sequences simultaneously
Keep a list of where logits can be sampled from
Get the number of logit positions that will be generated from this batch
Size of an individual embedding
The number of items in this batch
Maximum number of items that can be added to this batch (automatically grows if exceeded)
Maximum number of sequences an item can be assigned to (automatically grows if exceeded)
Create a new batch for submitting inputs to llama.cpp
Add a single embedding to the batch at the same position in several sequences
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
The embedding to add
The position to add it att
The set of sequences to add this token to
The index that the token was added at. Use this for GetLogitsIth
Add a single embedding to the batch for a single sequence
The index that the token was added at. Use this for GetLogitsIth
Called by embeddings batch to write embeddings into a destination span
Type of user data parameter passed in
Destination to write data to. Entire destination must be filled!
User data parameter passed in
Add a single embedding to the batch at the same position in several sequences
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
Type of userdata passed to write delegate
Userdata passed to write delegate
Delegate called once to write data into a span
Position to write this embedding to
All sequences to assign this embedding to
Whether logits should be generated for this embedding
The index that the token was added at. Use this for GetLogitsIth
Add a single embedding to the batch at a position for one sequence
https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2
Type of userdata passed to write delegate
Userdata passed to write delegate
Delegate called once to write data into a span
Position to write this embedding to
Sequence to assign this embedding to
Whether logits should be generated for this embedding
The index that the token was added at. Use this for GetLogitsIth
Set EmbeddingsCount to zero for this batch
Get the positions where logits can be sampled from
llama_chat_message
Pointer to the null terminated bytes that make up the role string
Pointer to the null terminated bytes that make up the content string
Called by llama.cpp with a progress value between 0 and 1
If the provided progress_callback returns true, model loading continues.
If it returns false, model loading is immediately aborted.
llama_progress_callback
A C# representation of the llama.cpp `llama_context_params` struct
changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations
https://github.com/ggerganov/llama.cpp/pull/7544
text context, 0 = from model
logical maximum batch size that can be submitted to llama_decode
physical maximum batch size
max number of sequences (i.e. distinct states for recurrent models)
number of threads to use for generation
number of threads to use for batch processing
RoPE scaling type, from `enum llama_rope_scaling_type`
whether to pool (sum) embedding results by sequence id
Attention type to use for embeddings
RoPE base frequency, 0 = from model
RoPE frequency scaling factor, 0 = from model
YaRN extrapolation mix factor, negative = from model
YaRN magnitude scaling factor
YaRN low correction dim
YaRN high correction dim
YaRN original context size
defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default)
ggml_backend_sched_eval_callback
User data passed into cb_eval
data type for K cache. EXPERIMENTAL
data type for V cache. EXPERIMENTAL
Deprecated!
if true, extract embeddings (together with logits)
whether to offload the KQV ops (including the KV cache) to GPU
whether to use flash attention. EXPERIMENTAL
whether to measure performance timings
ggml_abort_callback
User data passed into abort_callback
Get the default LLamaContextParams
Supported model file types
C# representation of llama_ftype
All f32
Benchmark@7B: 26GB
Mostly f16
Benchmark@7B: 13GB
Mostly 8 bit
Benchmark@7B: 6.7GB, +0.0004ppl
Mostly 4 bit
Benchmark@7B: 3.50GB, +0.2499 ppl
Mostly 4 bit
Benchmark@7B: 3.90GB, +0.1846 ppl
Mostly 5 bit
Benchmark@7B: 4.30GB @ 7B tokens, +0.0796 ppl
Mostly 5 bit
Benchmark@7B: 4.70GB, +0.0415 ppl
K-Quant 2 bit
Benchmark@7B: 2.67GB @ 7N parameters, +0.8698 ppl
K-Quant 3 bit (Small)
Benchmark@7B: 2.75GB, +0.5505 ppl
K-Quant 3 bit (Medium)
Benchmark@7B: 3.06GB, +0.2437 ppl
K-Quant 3 bit (Large)
Benchmark@7B: 3.35GB, +0.1803 ppl
K-Quant 4 bit (Small)
Benchmark@7B: 3.56GB, +0.1149 ppl
K-Quant 4 bit (Medium)
Benchmark@7B: 3.80GB, +0.0535 ppl
K-Quant 5 bit (Small)
Benchmark@7B: 4.33GB, +0.0353 ppl
K-Quant 5 bit (Medium)
Benchmark@7B: 4.45GB, +0.0142 ppl
K-Quant 6 bit
Benchmark@7B: 5.15GB, +0.0044 ppl
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
except 1d tensors
File type was not specified
A safe handle for a LLamaKvCacheView
Number of KV cache cells. This will be the same as the context size.
Get the total number of tokens in the KV cache.
For example, if there are two populated
cells, the first with 1 sequence id in it and the second with 2 sequence
ids then you'll have 3 tokens.
Maximum number of sequences visible for a cell. There may be more sequences than this
in reality, this is simply the maximum number this view can see.
Number of populated cache cells
Maximum contiguous empty slots in the cache.
Index to the start of the MaxContiguous slot range. Can be negative when cache is full.
Initialize a LLamaKvCacheViewSafeHandle which will call `llama_kv_cache_view_free` when disposed
Allocate a new KV cache view which can be used to inspect the KV cache
The maximum number of sequences visible in this view per cell
Read the current KV cache state into this view.
Get the raw KV cache view
Get the cell at the given index
The index of the cell [0, CellCount)
Data about the cell at the given index
Thrown if index is out of range (0 <= index < CellCount)
Get all of the sequences assigned to the cell at the given index. This will contain entries
sequences even if the cell actually has more than that many sequences, allocate a new view with a larger maxSequences parameter
if necessary. Invalid sequences will be negative values.
The index of the cell [0, CellCount)
A span containing the sequences assigned to this cell
Thrown if index is out of range (0 <= index < CellCount)
Create an empty KV cache view. (use only for debugging purposes)
Free a KV cache view. (use only for debugging purposes)
Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)
Information associated with an individual cell in the KV cache view (llama_kv_cache_view_cell)
The position for this cell. Takes KV cache shifts into account.
May be negative if the cell is not populated.
An updateable view of the KV cache (llama_kv_cache_view)
Number of KV cache cells. This will be the same as the context size.
Maximum number of sequences that can exist in a cell. It's not an error
if there are more sequences in a cell than this value, however they will
not be visible in the view cells_sequences.
Number of tokens in the cache. For example, if there are two populated
cells, the first with 1 sequence id in it and the second with 2 sequence
ids then you'll have 3 tokens.
Number of populated cache cells.
Maximum contiguous empty slots in the cache.
Index to the start of the max_contiguous slot range. Can be negative
when cache is full.
Information for an individual cell.
The sequences for each cell. There will be n_seq_max items per cell.
Severity level of a log message. This enum should always be aligned with
the one defined on llama.cpp side at
https://github.com/ggerganov/llama.cpp/blob/0eb4e12beebabae46d37b78742f4c5d4dbe52dc1/ggml/include/ggml.h#L559
Logs are never written.
Logs that are used for interactive investigation during development.
Logs that track the general flow of the application.
Logs that highlight an abnormal or unexpected event in the application flow, but do not otherwise cause the application execution to stop.
Logs that highlight when the current flow of execution is stopped due to a failure.
Continue log level is equivalent to None in the way it is used in llama.cpp.
Keeps track of the previous log level to be able to handle the log level .
Override a key/value pair in the llama model metadata (llama_model_kv_override)
Key to override
Type of value
Add 4 bytes of padding, to align the next fields to 8 bytes
Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_INT
Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_FLOAT
Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_BOOL
Value, **must** only be used if Tag == String
Specifies what type of value is being overridden by LLamaModelKvOverride
llama_model_kv_override_type
Overriding an int value
Overriding a float value
Overriding a bool value
Overriding a string value
A C# representation of the llama.cpp `llama_model_params` struct
NULL-terminated list of devices to use for offloading (if NULL, all available devices are used)
todo: add support for llama_model_params.devices
// number of layers to store in VRAM
how to split the model across multiple GPUs
the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE
how to split layers across multiple GPUs (size: )
called with a progress value between 0 and 1, pass NULL to disable. If the provided progress_callback
returns true, model loading continues. If it returns false, model loading is immediately aborted.
context pointer passed to the progress callback
override key-value pairs of the model meta data
only load the vocabulary, no weights
use mmap if possible
force system to keep model in RAM
validate model tensor data
Create a LLamaModelParams with default values
Quantizer parameters used in the native API
llama_model_quantize_params
number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()
quantize to this llama_ftype
output tensor type
token embeddings tensor type
allow quantizing non-f32/f16 tensors
quantize output.weight
only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored
quantize all tensors to the default type
quantize to the same number of shards
pointer to importance matrix data
pointer to vector containing overrides
Create a LLamaModelQuantizeParams with default values
Input data for llama_decode
A llama_batch object can contain input about one or many sequences
The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens
The number of items pointed at by pos, seq_id and logits.
Either `n_tokens` of `llama_token`, or `NULL`, depending on how this batch was created
Either `n_tokens * embd * sizeof(float)` or `NULL`, depending on how this batch was created
the positions of the respective token in the sequence
(if set to NULL, the token position will be tracked automatically by llama_decode)
https://github.com/ggerganov/llama.cpp/blob/master/llama.h#L139 ???
the sequence to which the respective token belongs
(if set to NULL, the sequence ID will be assumed to be 0)
if zero, the logits for the respective token will not be output
(if set to NULL, only the logits for last token will be returned)
llama_pooling_type
No specific pooling type. Use the model default if this is specific in
Do not pool embeddings (per-token embeddings)
Take the mean of every token embedding
Return the embedding for the special "CLS" token
Used by reranking models to attach the classification head to the graph
Indicates position in a sequence
The raw value
Create a new LLamaPos
Convert a LLamaPos into an integer (extract the raw value)
Convert an integer into a LLamaPos
Increment this position
Increment this position
llama_rope_type
ID for a sequence in a batch
LLamaSeqId with value 0
The raw value
Create a new LLamaSeqId
Convert a LLamaSeqId into an integer (extract the raw value)
Convert an integer into a LLamaSeqId
LLama performance information
llama_perf_context_data
Timestamp when reset was last called
Loading milliseconds
total milliseconds spent prompt processing
Total milliseconds in eval/decode calls
number of tokens in eval calls for the prompt (with batch size > 1)
number of eval calls
Timestamp when reset was last called
Time spent loading
total milliseconds spent prompt processing
Total milliseconds in eval/decode calls
number of tokens in eval calls for the prompt (with batch size > 1)
number of eval calls
LLama performance information
llama_perf_sampler_data
A single token
Token Value used when token is inherently null
The raw value
Create a new LLamaToken
Convert a LLamaToken into an integer (extract the raw value)
Convert an integer into a LLamaToken
Get attributes for this token
Get attributes for this token
Get score for this token
Check if this is a control token
Check if this is a control token
Check if this token should end generation
Check if this token should end generation
Token attributes
C# equivalent of llama_token_attr
A single token along with probability of this token being selected
token id
log-odds of the token
probability of the token
Create a new LLamaTokenData
Contains an array of LLamaTokenData, potentially sorted.
The LLamaTokenData
Indicates if `data` is sorted by logits in descending order. If this is false the token data is in _no particular order_.
Create a new LLamaTokenDataArray
Create a new LLamaTokenDataArray, copying the data from the given logits
Create a new LLamaTokenDataArray, copying the data from the given logits into temporary memory.
The memory must not be modified while this is in use.
Temporary memory which will be used to work on these logits. Must be at least as large as logits array
Overwrite the logit values for all given tokens
tuples of token and logit value to overwrite
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
Contains a pointer to an array of LLamaTokenData which is pinned in memory.
C# equivalent of llama_token_data_array
A pointer to an array of LlamaTokenData
Memory must be pinned in place for all the time this LLamaTokenDataArrayNative is in use (i.e. `fixed` or `.Pin()`)
Number of LLamaTokenData in the array
The index in the array (i.e. not the token id)
A pointer to an array of LlamaTokenData
Indicates if the items in the array are sorted, so the most likely token is first
The index of the selected token (i.e. not the token id)
Number of LLamaTokenData in the array. Set this to shrink the array
Create a new LLamaTokenDataArrayNative around the data in the LLamaTokenDataArray
Data source
Created native array
A memory handle, pinning the data in place until disposed
C# equivalent of llama_vocab struct. This struct is an opaque type, with no fields in the API and is only used for typed pointers.
Get attributes for a specific token
Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
Identify if Token Id is a control token or a render-able token
beginning-of-sentence
end-of-sentence
end-of-turn
sentence separator
next-line
padding
llama_vocab_pre_type
llama_vocab_type
For models without vocab
LLaMA tokenizer based on byte-level BPE with byte fallback
GPT-2 tokenizer based on byte-level BPE
BERT tokenizer based on WordPiece
T5 tokenizer based on Unigram
RWKV tokenizer based on greedy tokenization
LLaVa Image embeddings
llava_image_embed
Set configurations for all the native libraries, including LLama and LLava
Set configurations for all the native libraries, including LLama and LLava
Configuration for LLama native library
Configuration for LLava native library
Check if the native library has already been loaded. Configuration cannot be modified if this is true.
Set the log callback that will be used for all llama.cpp log messages
Set the log callback that will be used for all llama.cpp log messages
Try to load the native library with the current configurations,
but do not actually set it to .
You can still modify the configuration after this calling but only before any call from .
The loaded livrary. When the loading failed, this will be null.
However if you are using .NET standard2.0, this will never return null.
Whether the running is successful.
A class to set same configurations to multiple libraries at the same time.
Do an action for all the configs in this container.
Set the log callback that will be used for all llama.cpp log messages
Set the log callback that will be used for all llama.cpp log messages
Try to load the native library with the current configurations,
but do not actually set it to .
You can still modify the configuration after this calling but only before any call from .
Whether the running is successful.
The name of the native library
The native library compiled from llama.cpp.
The native library compiled from the LLaVA example of llama.cpp.
A native library specified with a local file path.
Information of a native library file.
Which kind of library it is.
Whether it's compiled with cublas.
Whether it's compiled with vulkan.
Which AvxLevel it's compiled with.
Information of a native library file.
Which kind of library it is.
Whether it's compiled with cublas.
Whether it's compiled with vulkan.
Which AvxLevel it's compiled with.
Which kind of library it is.
Whether it's compiled with cublas.
Whether it's compiled with vulkan.
Which AvxLevel it's compiled with.
Avx support configuration
No AVX
Advanced Vector Extensions (supported by most processors after 2011)
AVX2 (supported by most processors after 2013)
AVX512 (supported by some processors after 2016, not widely supported)
Try to load libllama/llava_shared, using CPU feature detection to try and load a more specialised DLL if possible
The library handle to unload later, or IntPtr.Zero if no library was loaded
Operating system information.
Operating system information.
Get the system information of the current machine.
When you are using .NET standard2.0, dynamic native library loading is not supported.
This class will be returned in .
A LoRA adapter which can be applied to a context for a specific model
The model which this LoRA adapter was loaded with.
The full path of the file this adapter was loaded from
Native pointer of the loaded adapter, will be automatically freed when the model is unloaded
Indicates if this adapter has been unloaded
Unload this adapter
Direct translation of the llama.cpp API
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
Call once at the end of the program - currently only used for MPI
Get the maximum number of devices supported by llama.cpp
Check if memory mapping is supported
Check if memory locking is supported
Check if GPU offload is supported
Check if RPC offload is supported
Initialize the llama + ggml backend. Call once at the start of the program.
This is private because LLamaSharp automatically calls it, and it's only valid to call it once!
Load session file
Save session file
Set whether to use causal attention or not. If set to true, the model will only attend to the past tokens
Set whether the model is in embeddings mode or not.
If true, embeddings will be returned but logits will not
Set abort callback
Get the n_seq_max for this context
Get all output token embeddings.
When pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model, the embeddings for which
llama_batch.logits[i] != 0 are stored contiguously in the order they have appeared in the batch.
shape: [n_outputs*n_embd]
Otherwise, returns an empty span.
Apply chat template. Inspired by hf apply_chat_template() on python.
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.
Pointer to a list of multiple llama_chat_message
Number of llama_chat_message in this chat
Whether to end the prompt with the token(s) that indicate the start of an assistant message.
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)
The size of the allocated buffer
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
Get list of built-in chat templates
Print out timing information for this context
Print system information
Convert a single token into text
buffer to write string into
User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')
If true, special tokens are rendered in the output
The length written, or if the buffer is too small a negative that indicates the length required
Convert text into tokens
The tokens pointer must be large enough to hold the resulting tokens.
add_special Allow to add BOS and EOS tokens if model is configured to do so.
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
Convert the provided tokens into text (inverse of llama_tokenize()).
The char pointer must be large enough to hold the resulting text.
remove_special Allow to remove BOS and EOS tokens if model is configured to do so.
unparse_special If true, special tokens are rendered in the output.
Returns the number of chars/bytes on success, no more than textLengthMax. Returns a negative number on failure - the number of chars/bytes that would have been returned.
Register a callback to receive llama log messages
Returns the number of tokens in the KV cache (slow, use only for debug)
If a KV cell has multiple sequences assigned to it, it will be counted multiple times
Returns the number of used KV cells (i.e. have at least one sequence assigned to them)
Clear the KV cache. Both cell info is erased and KV data is zeroed
Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails
Copy all tokens that belong to the specified sequence to another sequence
Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence
Removes all tokens that do not belong to the specified sequence
Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1)
If the KV cache is RoPEd, the KV data is updated accordingly:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
Integer division of the positions by factor of `d > 1`
If the KV cache is RoPEd, the KV data is updated accordingly:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
Returns the largest position present in the KV cache for the specified sequence
Allocates a batch of tokens on the heap
Each token can be assigned up to n_seq_max sequence ids
The batch has to be freed with llama_batch_free()
If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float)
Otherwise, llama_batch.token will be allocated to store n_tokens llama_token
The rest of the llama_batch members are allocated with size n_tokens
All members are left uninitialized
Each token can be assigned up to n_seq_max sequence ids
Frees a batch of tokens allocated with llama_batch_init()
Apply a loaded control vector to a llama_context, or if data is NULL, clear
the currently loaded vector.
n_embd should be the size of a single layer's control, and data should point
to an n_embd x n_layers buffer starting from layer 1.
il_start and il_end are the layer range the vector should apply to (both inclusive)
See llama_control_vector_load in common to load a control vector.
Build a split GGUF final path for this chunk.
llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf"
Returns the split_path length.
Extract the path prefix from the split_path if and only if the split_no and split_count match.
llama_split_prefix(split_prefix, 64, "/models/ggml-model-q4_0-00002-of-00004.gguf", 2, 4) => split_prefix = "/models/ggml-model-q4_0"
Returns the split_prefix length.
Sanity check for clip <-> llava embed size match
LLama Context
Llava Model
True if validate successfully
Build an image embed from image file bytes
SafeHandle to the Clip Model
Number of threads
Binary image in jpeg format
Bytes length of the image
SafeHandle to the Embeddings
Build an image embed from a path to an image filename
SafeHandle to the Clip Model
Number of threads
Image filename (jpeg) to generate embeddings
SafeHandle to the embeddings
Free an embedding made with llava_image_embed_make_*
Embeddings to release
Write the image represented by embed into the llama context with batch size n_batch, starting at context
pos n_past. on completion, n_past points to the next position in the context after the image embed.
Llama Context
Embedding handle
True on success
Get the loaded native library. If you are using netstandard2.0, it will always return null.
Returns 0 on success
Returns 0 on success
Configure llama.cpp logging
Callback from llama.cpp with log messages
Register a callback to receive llama log messages
A GC handle for the current log callback to ensure the callback is not collected
Register a callback to receive llama log messages
Register a callback to receive llama log messages
RoPE scaling type.
C# equivalent of llama_rope_scaling_type
No particular scaling type has been specified
Do not apply any RoPE scaling
Positional linear interpolation, as described by kaikendev: https://kaiokendev.github.io/til#extending-context-to-8k
YaRN scaling: https://arxiv.org/pdf/2309.00071.pdf
LongRope scaling
A safe wrapper around a llama_context
Total number of tokens in the context
Dimension of embedding vectors
Get the maximum batch size for this context
Get the physical maximum batch size for this context
Get or set the number of threads used for generation of a single token.
Get or set the number of threads used for prompt and batch processing (multiple token).
Get the pooling type for this context
Get the model which this context is using
Get the vocabulary for the model this context is using
Create a new llama_state for the given model
Create a new llama_context with the given model. **This should never be called directly! Always use SafeLLamaContextHandle.Create**!
Frees all allocated memory in the given llama_context
Set a callback which can abort computation
If this returns true computation is cancelled
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output
internally for later use by the decoder cross-attention layers.
0 = success
< 0 = error
Set the number of threads used for decoding
n_threads is the number of threads used for generation (single token)
n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)
Get the number of threads used for generation of a single token.
Get the number of threads used for prompt and batch processing (multiple token).
Token logits obtained from the last call to llama_decode
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
Get the size of the context window for the model for this context
Get the batch size for this context
Get the ubatch size for this context
Returns the **actual** size in bytes of the state (logits, embedding and kv_cache).
Only use when saving the state, not when restoring it, otherwise the size may be too small.
Copies the state to the specified destination address.
Destination needs to have allocated enough memory.
the number of bytes copied
Set the state reading from the specified address
the number of bytes read
Get the exact size needed to copy the KV cache of a single sequence
Copy the KV cache of a single sequence into the specified buffer
Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence
- Positive: Ok
- Zero: Failed to load
Defragment the KV cache. This will be applied:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
Check if the context supports KV cache shifting
Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results
and is not necessary to call it explicitly in most cases.
Get the pooling type for this context
Get the embeddings for a sequence id.
Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence
otherwise: float[n_embd] (1-dimensional)
A pointer to the first float in an embedding, length = ctx.EmbeddingSize
Get the embeddings for the ith sequence.
Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
A pointer to the first float in an embedding, length = ctx.EmbeddingSize
Add a LoRA adapter to this context
Remove a LoRA adapter from this context
Indicates if the lora was in this context and was remove
Remove all LoRA adapters from this context
Token logits obtained from the last call to llama_decode.
The logits for the last token are stored in the last row.
Only tokens with `logits = true` requested are present.
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
The amount of tokens whose logits should be retrieved, in [numTokens X n_vocab] format.
Tokens' order is based on their order in the LlamaBatch (so, first tokens are first, etc).
This is helpful when requesting logits for many tokens in a sequence, or want to decode multiple sequences in one go.
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
Get the embeddings for the ith sequence.
Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
A pointer to the first float in an embedding, length = ctx.EmbeddingSize
Get the embeddings for the a specific sequence.
Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
A pointer to the first float in an embedding, length = ctx.EmbeddingSize
Convert the given text into tokens
The text to tokenize
Whether the "BOS" token should be added
Encoding to use for the text
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.
Convert a single llama token into bytes
Token to decode
A span to attempt to write into. If this is too small nothing will be written
The size of this token. **nothing will be written** if this is larger than `dest`
This object exists to ensure there is only ever 1 inference running at a time. This is a workaround for thread safety issues in llama.cpp itself.
Most notably CUDA, which seems to use some global singleton resources and will crash if multiple inferences are run (even against different models).
For more information see these issues:
- https://github.com/SciSharp/LLamaSharp/issues/596
- https://github.com/ggerganov/llama.cpp/issues/3960
If these are ever resolved this lock can probably be removed.
Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results
and is not necessary to call it explicitly in most cases.
Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output
internally for later use by the decoder cross-attention layers.
0 = success
< 0 = error (the KV cache state is restored to the state before this call)
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error (the KV cache state is restored to the state before this call)
Decode a set of tokens in batch-size chunks.
A tuple, containing the decode result and the number of tokens that have not been decoded yet.
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
Get the size of the state, when saved as bytes
Get the size of the KV cache for a single sequence ID, when saved as bytes
Get the raw state of this context, encoded as bytes. Data is written into the `dest` pointer.
Destination to write to
Number of bytes available to write to in dest (check required size with `GetStateSize()`)
The number of bytes written to dest
Thrown if dest is too small
Get the raw state of a single sequence from this context, encoded as bytes. Data is written into the `dest` pointer.
Destination to write to
Number of bytes available to write to in dest (check required size with `GetStateSize()`)
The sequence to get state data for
The number of bytes written to dest
Set the raw state of this context
The pointer to read the state from
Number of bytes that can be safely read from the pointer
Number of bytes read from the src pointer
Set the raw state of a single sequence
The pointer to read the state from
Sequence ID to set
Number of bytes that can be safely read from the pointer
Number of bytes read from the src pointer
Get performance information
Reset all performance information for this context
Check if the context supports KV cache shifting
Apply KV cache updates (such as K-shifts, defragmentation, etc.)
Defragment the KV cache. This will be applied:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
Get a new KV cache view that can be used to debug the KV cache
Count the number of used cells in the KV cache (i.e. have at least one sequence assigned to them)
Returns the number of tokens in the KV cache (slow, use only for debug)
If a KV cell has multiple sequences assigned to it, it will be counted multiple times
Clear the KV cache - both cell info is erased and KV data is zeroed
Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
Copy all tokens that belong to the specified sequence to another sequence. Note that
this does not allocate extra KV cache memory - it simply assigns the tokens to the
new sequence
Removes all tokens that do not belong to the specified sequence
Adds relative position "delta" to all tokens that belong to the specified sequence
and have positions in [p0, p1. If the KV cache is RoPEd, the KV data is updated
accordingly
Integer division of the positions by factor of `d > 1`.
If the KV cache is RoPEd, the KV data is updated accordingly.
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
Returns the largest position present in the KV cache for the specified sequence
Base class for all llama handles to native resources
A reference to a set of llama model weights
Get the rope (positional embedding) type for this model
The number of tokens in the context that this model was trained for
Get the rope frequency this model was trained with
Dimension of embedding vectors
Get the size of this model in bytes
Get the number of parameters in this model
Get the number of layers in this model
Get the number of heads in this model
Returns true if the model contains an encoder that requires llama_encode() call
Returns true if the model contains a decoder that requires llama_decode() call
Returns true if the model is recurrent (like Mamba, RWKV, etc.)
Get a description of this model
Get the number of metadata key/value pairs
Get the vocabulary of this model
Load a model from the given file path into memory
Load the model from a file
If the file is split into multiple parts, the file name must follow this pattern: {name}-%05d-of-%05d.gguf
If the split file name does not follow this pattern, use llama_model_load_from_splits
The loaded model, or null on failure.
Load the model from multiple splits (support custom naming scheme)
The paths must be in the correct order
Apply a LoRA adapter to a loaded model
path_base_model is the path to a higher quality model to use as a base for
the layers modified by the adapter. Can be NULL to use the current loaded model.
The model needs to be reloaded before applying a new adapter, otherwise the adapter
will be applied on top of the previous one
Returns 0 on success
Frees all allocated memory associated with a model
Get the number of metadata key/value pairs
Get metadata key name by index
Model to fetch from
Index of key to fetch
buffer to write result into
The length of the string on success (even if the buffer is too small). -1 is the key does not exist.
Get metadata value as a string by index
Model to fetch from
Index of val to fetch
Buffer to write result into
The length of the string on success (even if the buffer is too small). -1 is the key does not exist.
Get metadata value as a string by key name
The length of the string on success, or -1 on failure
Get the number of tokens in the model vocabulary
Get the size of the context window for the model
Get the dimension of embedding vectors from this model
Get the number of layers in this model
Get the number of heads in this model
Get a string describing the model type
The length of the string on success (even if the buffer is too small)., or -1 on failure
Get the size of the model in bytes
The size of the model
Get the number of parameters in this model
The functions return the length of the string on success, or -1 on failure
Get the model's RoPE frequency scaling factor
For encoder-decoder models, this function returns id of the token that must be provided
to the decoder to start generating output sequence. For other models, it returns -1.
Returns true if the model contains an encoder that requires llama_encode() call
Returns true if the model contains a decoder that requires llama_decode() call
Returns true if the model is recurrent (like Mamba, RWKV, etc.)
Load a LoRA adapter from file. The adapter will be associated with this model but will not be applied
Convert a single llama token into bytes
Token to decode
A span to attempt to write into. If this is too small nothing will be written
User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')
If true, special characters will be converted to text. If false they will be invisible.
The size of this token. **nothing will be written** if this is larger than `dest`
Convert a sequence of tokens into characters.
The section of the span which has valid data in it.
If there was insufficient space in the output span this will be
filled with as many characters as possible, starting from the _last_ token.
Convert a string of text into tokens
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.
Create a new context for this model
Get the metadata value for the given key
The key to fetch
The value, null if there is no such key
Get the metadata key for the given index
The index to get
The key, null if there is no such key or if the buffer was too small
Get the metadata value for the given index
The index to get
The value, null if there is no such value or if the buffer was too small
Get the default chat template. Returns nullptr if not available
If name is NULL, returns the default chat template
Get tokens for a model
Total number of tokens in this vocabulary
Get the the type of this vocabulary
Get the Beginning of Sentence token for this model
Get the End of Sentence token for this model
Get the newline token for this model
Get the padding token for this model
Get the sentence separator token for this model
Codellama beginning of infill prefix
Codellama beginning of infill middle
Codellama beginning of infill suffix
Codellama pad
Codellama rep
Codellama rep
end-of-turn token
For encoder-decoder models, this function returns id of the token that must be provided
to the decoder to start generating output sequence.
Check if the current model requires a BOS token added
Check if the current model requires a EOS token added
A chain of sampler stages that can be used to select tokens from logits.
Wraps a handle returned from `llama_sampler_chain_init`. Other samplers are owned by this chain and are never directly exposed.
Get the number of samplers in this chain
Apply this sampler to a set of candidates
Sample and accept a token from the idx-th output of the last evaluation. Shorthand for:
var logits = ctx.GetLogitsIth(idx);
var token_data_array = LLamaTokenDataArray.Create(logits);
using LLamaTokenDataArrayNative.Create(token_data_array, out var native_token_data);
sampler_chain.Apply(native_token_data);
var token = native_token_data.Data.Span[native_token_data.Selected];
sampler_chain.Accept(token);
return token;
Reset the state of this sampler
Accept a token and update the internal state of this sampler
Get the name of the sampler at the given index
Get the seed of the sampler at the given index if applicable. returns LLAMA_DEFAULT_SEED otherwise
Create a new sampler chain
Clone a sampler stage from another chain and add it to this chain
The chain to clone a stage from
The index of the stage to clone
Remove a sampler stage from this chain
Add a custom sampler stage
Add a sampler which picks the most likely token.
Add a sampler which picks from the probability distribution of all tokens
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm.
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
Apply temperature to the logits.
If temperature is less than zero the maximum logit is left unchanged and the rest are set to -infinity
Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772.
XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335
This sampler is meant to be used for fill-in-the-middle infilling, after top_k + top_p sampling
1. if the sum of the EOG probs times the number of candidates is higher than the sum of the other probs -> pick EOG
2. combine probs of tokens that have the same prefix
example:
- before:
"abc": 0.5
"abcd": 0.2
"abcde": 0.1
"dummy": 0.1
- after:
"abc": 0.8
"dummy": 0.1
3. discard non-EOG tokens with low prob
4. if no tokens are left -> pick EOT
Create a sampler which makes tokens impossible unless they match the grammar
Root rule of the grammar
Create a sampler using lazy grammar sampling: https://github.com/ggerganov/llama.cpp/pull/9639
Grammar in GBNF form
Root rule of the grammar
A list of tokens that will trigger the grammar sampler.
A list of words that will trigger the grammar sampler.
Create a sampler that applies various repetition penalties.
Avoid using on the full vocabulary as searching for repeated tokens can become slow. For example, apply top-k or top-p sampling first.
How many tokens of history to consider when calculating penalties
Repetition penalty
Frequency penalty
Presence penalty
DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677.
Porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982
The model this sampler will be used with
penalty multiplier, 0.0 = disabled
exponential base
repeated sequences longer than this are penalized
how many tokens to scan for repetitions (0 = entire context)
Create a sampler that applies a bias directly to the logits
llama_sampler_chain_params
whether to measure performance timings
Get the default LLamaSamplerChainParams
A bias to apply directly to a logit
The token to apply the bias to
The bias to add
llama_sampler_i
Get the name of this sampler
Update internal sampler state after a token has been chosen
Apply this sampler to a set of logits
Reset the internal state of this sampler
Create a clone of this sampler
Free all resources held by this sampler
llama_sampler
Holds the function pointers which make up the actual sampler
Any additional context this sampler needs, may be anything. We will use it
to hold a GCHandle.
This GCHandle roots this object, preventing it from being freed.
A reference to the user code which implements the custom sampler
Get a pointer to a `llama_sampler` (LLamaSamplerNative) struct, suitable for passing to `llama_sampler_chain_add`
A custom sampler stage for modifying logits or selecting a token
The human readable name of this stage
Apply this stage to a set of logits.
This can modify logits or select a token (or both).
If logits are modified the Sorted flag must be set to false.
If the logits are no longer sorted after the custom sampler has run it is critically important to
set Sorted=false. If unsure, always set it to false, this is a safe default.
Update the internal state of the sampler when a token is chosen
Reset the internal state of this sampler
Create a clone of this sampler
A Reference to a llava Image Embed handle
Get the model used to create this image embedding
Get the number of dimensions in an embedding
Get the number of "patches" in an image embedding
Create an image embed from an image file
Path to the image file. Supported formats:
- JPG
- PNG
- BMP
- TGA
Create an image embed from an image file
Path to the image file. Supported formats:
- JPG
- PNG
- BMP
- TGA
Create an image embed from the bytes of an image.
Image bytes. Supported formats:
- JPG
- PNG
- BMP
- TGA
Create an image embed from the bytes of an image.
Image bytes. Supported formats:
- JPG
- PNG
- BMP
- TGA
Copy the embeddings data to the destination span
A reference to a set of llava model weights.
Get the number of dimensions in an embedding
Get the number of "patches" in an image embedding
Load a model from the given file path into memory
MMP File (Multi-Modal Projections)
Verbosity level
SafeHandle of the Clip Model
Create the Image Embeddings.
LLama Context
Image filename (it supports jpeg format only)
return the SafeHandle of these embeddings
Create the Image Embeddings.
Image in binary format (it supports jpeg format only)
Number of threads to use
return the SafeHandle of these embeddings
Create the Image Embeddings.
LLama Context
Image in binary format (it supports jpeg format only)
return the SafeHandle of these embeddings
Create the Image Embeddings.
Image in binary format (it supports jpeg format only)
Number of threads to use
return the SafeHandle of these embeddings
Evaluates the image embeddings.
Llama Context
The current embeddings to evaluate
True on success
Load MULTI MODAL PROJECTIONS model / Clip Model
Model path/file
Verbosity level
SafeLlavaModelHandle
Frees MULTI MODAL PROJECTIONS model / Clip Model
Internal Pointer to the model
Create a new sampler wrapping a llama.cpp sampler chain
Create a sampling chain. This will be called once, the base class will automatically dispose the chain.
An implementation of ISamplePipeline which mimics the default llama.cpp sampling
Bias values to add to certain logits
Repetition penalty, as described in https://arxiv.org/abs/1909.05858
Frequency penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text
so far, decreasing the model's likelihood to repeat the same line verbatim.
Presence penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the
text so far, increasing the model's likelihood to talk about new topics.
How many tokens should be considered for penalties
Whether the newline token should be protected from being modified by penalty
Whether the EOS token should be suppressed. Setting this to 'true' prevents EOS from being sampled
Temperature to apply (higher temperature is more "creative")
Number of tokens to keep in TopK sampling
P value for locally typical sampling
P value for TopP sampling
P value for MinP sampling
Grammar to apply to constrain possible tokens
The minimum number of tokens to keep for samplers which remove tokens
Seed to use for random sampling
A grammar in GBNF form
A grammar in GBNF form
A sampling pipeline which always selects the most likely token
Grammar to apply to constrain possible tokens
Convert a span of logits into a single sampled token. This interface can be implemented to completely customise the sampling process.
Sample a single token from the given context at the given position
The context being sampled from
Position to sample logits from
Reset all internal state of the sampling pipeline
Update the pipeline, with knowledge that a particular token was just accepted
Extension methods for
Sample a single token from the given context at the given position
The context being sampled from
Position to sample logits from
Decodes a stream of tokens into a stream of characters
The number of decoded characters waiting to be read
If true, special characters will be converted to text. If false they will be invisible.
Create a new decoder
Text encoding to use
Model weights
Create a new decoder
Context to retrieve encoding and model weights from
Create a new decoder
Text encoding to use
Context to retrieve model weights from
Create a new decoder
Text encoding to use
Models weights to use
Add a single token to the decoder
Add a single token to the decoder
Add all tokens in the given enumerable
Add all tokens in the given span
Read all decoded characters and clear the buffer
Read all decoded characters as a string and clear the buffer
Set the decoder back to its initial state
A prompt formatter that will use llama.cpp's template formatter
If your model is not supported, you will need to define your own formatter according the cchat prompt specification for your model
A prompt formatter that will use llama.cpp's template formatter
If your model is not supported, you will need to define your own formatter according the cchat prompt specification for your model
Apply the template to the messages and return the resulting prompt as a string
The formatted template string as defined by the model