LLamaSharp

Reserved to be used by the compiler for tracking metadata. This class should not be used by developers in source code.

This definition is provided by the IsExternalInit NuGet package (https://www.nuget.org/packages/IsExternalInit). Please see https://github.com/manuelroemer/IsExternalInit for more information.

The parameters for initializing a LLama context from a model.

Model context size (n_ctx)

maximum batch size that can be submitted at once (must be >=32 to use BLAS) (n_batch)

Physical batch size

max number of sequences (i.e. distinct states for recurrent models)

If true, extract embeddings (together with logits).

RoPE base frequency (null to fetch from the model)

RoPE frequency scaling factor (null to fetch from the model)

The encoding to use for models

Number of threads (null = autodetect) (n_threads)

Number of threads to use for batch processing (null = autodetect) (n_threads)

YaRN extrapolation mix factor (null = from model)

YaRN magnitude scaling factor (null = from model)

YaRN low correction dim (null = from model)

YaRN high correction dim (null = from model)

YaRN original context length (null = from model)

YaRN scaling method to use.

Override the type of the K cache

Override the type of the V cache

Whether to disable offloading the KQV cache to the GPU

Whether to use flash attention

defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default) defragment the KV cache if holes/size > defrag_threshold, Set to or < 0 to disable (default)

How to pool (sum) embedding results by sequence id (ignored if no pooling layer)

Attention type to use for embeddings

Transform history to plain text and vice versa.

Convert a ChatHistory instance to plain text.

The ChatHistory instance

Converts plain text to a ChatHistory instance.

The role for the author. The chat history as plain text. The updated history.

Copy the transform.

The parameters used for inference.

number of tokens to keep from initial prompt

how many new tokens to predict (n_predict), set to -1 to inifinitely generate response until it complete.

Sequences where the model will stop generating further tokens.

Set a custom sampling pipeline to use.

A high level interface for LLama models.

The loaded context for this executor.

Identify if it's a multi-modal model and there is a image to process.

Multi-Modal Projections / Clip Model weights

List of images: List of images in byte array format.

Asynchronously infers a response from the model.

Your prompt Any additional parameters A cancellation token.

Convenience interface for implementing both type of parameters.

Mostly exists for backwards compatibility reasons, when these two were not split.

The parameters for initializing a LLama model.

main_gpu interpretation depends on split_mode: None The GPU that is used for the entire mode. Row The GPU that is used for small tensors and intermediate results. Layer Ignored.

How to split the model across multiple GPUs

Number of layers to run in VRAM / GPU memory (n_gpu_layers)

Use mmap for faster loads (use_mmap)

Use mlock to keep model in memory (use_mlock)

Model path (model)

how split tensors should be distributed across GPUs

Load vocab only (no weights)

Validate model tensor data before loading

Override specific metadata items in the model

A fixed size array to set the tensor splits across multiple GPUs

The size of this array

Get or set the proportion of work to do on the given device.

"[ 3, 2 ]" will assign 60% of the data to GPU 0 and 40% to GPU 1.

Create a new tensor splits collection, copying the given values

Create a new tensor splits collection with all values initialised to the default

Set all values to zero

A JSON converter for

An override for a single key/value pair in model metadata

Get the key being overridden by this override

Create a new override for an int key

Create a new override for a float key

Create a new override for a boolean key

Create a new override for a string key

A JSON converter for

Descriptor of a native library.

Metadata of this library.

Prepare the native library file and returns the local path of it. If it's a relative path, LLamaSharp will search the path in the search directies you set.

The system information of the current machine. The log callback. The relative paths of the library. You could return multiple paths to try them one by one. If no file is available, please return an empty array.

Takes a stream of tokens and transforms them.

Takes a stream of tokens and transforms them, returning a new stream of tokens asynchronously.

Copy the transform.

An interface for text transformations. These can be used to compose a pipeline of text transformations, such as: - Tokenization - Lowercasing - Punctuation removal - Trimming - etc.

Takes a string and transforms it.

Copy the transform.

Extension methods to the interface.

Gets an instance for the specified .

The executor. The to use to transform an input list messages into a prompt. The to use to transform the output into text. An instance for the provided . is null.

Format the chat messages into a string prompt.

Convert the chat options to inference parameters.

A default transform that appends "Assistant: " to the end.

AntipromptProcessor keeps track of past tokens looking for any set Anti-Prompts

Initializes a new instance of the class.

The antiprompts.

Add an antiprompt to the collection

Overwrite all current antiprompts with a new set

Add some text and check if the buffer now ends with any antiprompt

true if the text buffer ends with any antiprompt

A batched executor that can infer multiple separate "conversations" simultaneously.

Set to 1 using interlocked exchange while inference is running

Epoch is incremented twice every time Infer is called. Conversations can use this to keep track of whether they're waiting for inference, or can be sampled.

The this executor is using

Get the number of tokens in the batch, waiting for to be called

Number of batches in the queue, waiting for to be called

Check if this executor has been disposed.

Create a new batched executor

The model to use Parameters to create a new context

Start a new

Load a conversation that was previously saved to a file. Once loaded the conversation will need to be prompted.

Load a conversation that was previously saved into memory. Once loaded the conversation will need to be prompted.

Run inference for all conversations in the batch which have pending tokens. If the result is `NoKvSlot` then there is not enough memory for inference, try disposing some conversation threads and running inference again.

Get a reference to a batch that tokens can be added to.

Get a reference to a batch that embeddings can be added to.

A single conversation thread that can be prompted (adding tokens from the user) or inferred (extracting a token from the LLM)

Indicates if this conversation has been "forked" and may share logits with another conversation.

Stores the indices to sample from. Contains valid items.

The executor which this conversation belongs to

Unique ID for this conversation

Total number of tokens in this conversation, cannot exceed the context length.

Indicates if this conversation has been disposed, nothing can be done with a disposed conversation

Indicates if this conversation is waiting for inference to be run on the executor. "Prompt" and "Sample" cannot be called when this is true.

Indicates that this conversation should be sampled.

Finalizer for Conversation

End this conversation, freeing all resources used by it

Create a copy of the current conversation

The copy shares internal state, so consumes very little extra memory.

Get the index in the context which each token can be sampled from, the return value of this function get be used to retrieve logits () or to sample a token (.

How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires allLogits to have been set during prompting.
For example if 5 tokens were supplied in the last prompt call: The logits of the first token can be accessed with 4 The logits of the second token can be accessed with 3 The logits of the third token can be accessed with 2 The logits of the fourth token can be accessed with 1 The logits of the fifth token can be accessed with 0 Thrown if this conversation was not prompted before the previous call to infer Thrown if Infer() must be called on the executor

Get the logits from this conversation, ready for sampling

How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires allLogits to have been set during prompting Thrown if this conversation was not prompted before the previous call to infer Thrown if Infer() must be called on the executor

Add tokens to this conversation

If true, generate logits for all tokens. If false, only generate logits for the last token.

Add tokens to this conversation

If true, generate logits for all tokens. If false, only generate logits for the last token.

Add a single token to this conversation

Prompt this conversation with an image embedding

Prompt this conversation with embeddings

The raw values of the embeddings. This span must divide equally by the embedding size of this model.

Directly modify the KV cache of this conversation

Thrown if this method is called while == true

Provides direct access to the KV cache of a . See for how to use this.

Removes all tokens that have positions in [start, end)

Start position (inclusive) End position (exclusive)

Removes all tokens starting from the given position

Start position (inclusive) Number of tokens

Adds relative position "delta" to all tokens that have positions in [p0, p1). If the KV cache is RoPEd, the KV data is updated accordingly

Start position (inclusive) End position (exclusive) Amount to add on to each token position

Integer division of the positions by factor of `d > 1`. If the KV cache is RoPEd, the KV data is updated accordingly.

Start position (inclusive). If less than zero, it is clamped to zero. End position (exclusive). If less than zero, it is treated as "infinity". Amount to divide each position by.

A function which can temporarily access the KV cache of a to modify it directly

The current end token of this conversation An which allows direct access to modify the KV cache The new end token position

Save the complete state of this conversation to a file. if the file already exists it will be overwritten.

Save the complete state of this conversation in system memory.

Load state from a file This should only ever be called by the BatchedExecutor, on a newly created conversation object!

Load state from a previously saved state. This should only ever be called by the BatchedExecutor, on a newly created conversation object!

In memory saved state of a

Indicates if this state has been disposed

Get the size in bytes of this state object

Internal constructor prevent anyone outside of LLamaSharp extending this class

Extension method for

Sample a token from this conversation using the given sampler chain

to sample from Offset from the end of the conversation to the logits to sample, see for more details

Sample a token from this conversation using the given sampling pipeline

to sample from Offset from the end of the conversation to the logits to sample, see for more details

Rewind a back to an earlier state by removing tokens from the end

The conversation to rewind The number of tokens to rewind Thrown if `tokens` parameter is larger than TokenCount

Shift all tokens over to the left, removing "count" tokens from the start and shifting everything over. Leaves "keep" tokens at the start completely untouched. This can be used to free up space when the context gets full, keeping the prompt at the start intact.

The conversation to rewind How much to shift tokens over by The number of tokens at the start which should not be shifted

Base class for exceptions thrown from

This exception is thrown when "Prompt()" is called on a which has already been prompted and before "Infer()" has been called on the associated .

This exception is thrown when "Sample()" is called on a which has already been prompted and before "Infer()" has been called on the associated .

This exception is thrown when "Sample()" is called on a which was not first prompted. .

This exception is thrown when is called when = true

This exception is thrown when "Save()" is called on a which has already been prompted and before "Infer()" has been called. .

Save the state of a particular sequence to specified path. Also save some extra data which will be returned when loading. Data saved with this method must be saved with

Load the state from the specified path into a particular sequence. Also reading header data. Must only be used with data previously saved with

The main chat session class.

The filename for the serialized model state (KV cache, etc).

The filename for the serialized executor state.

The filename for the serialized chat history.

The filename for the serialized input transform pipeline.

The filename for the serialized output transform.

The filename for the serialized history transform.

The executor for this session.

The chat history for this session.

The history transform used in this session.

The input transform pipeline used in this session.

The output transform used in this session.

Create a new chat session and preprocess history.

The executor for this session History for this session History Transform for this session A new chat session.

Create a new chat session.

The executor for this session

Create a new chat session with a custom history.

Use a custom history transform.

Add a text transform to the input transform pipeline.

Use a custom output transform.

Save a session from a directory.

Get the session state.

SessionState object representing session state in-memory

Load a session from a session state.

If true loads transforms saved in the session state.

Load a session from a directory.

If true loads transforms saved in the session state.

Add a message to the chat history.

Add a system message to the chat history.

Add an assistant message to the chat history.

Add a user message to the chat history.

Remove the last message from the chat history.

Compute KV cache for the message and add it to the chat history.

Compute KV cache for the system message and add it to the chat history.

Compute KV cache for the user message and add it to the chat history.

Compute KV cache for the assistant message and add it to the chat history.

Replace a user message with a new message and remove all messages after the new message. This is useful when the user wants to edit a message. And regenerate the response.

Chat with the model.

Regenerate the last assistant message.

The state of a chat session in-memory.

Saved executor state for the session in JSON format.

Saved context state (KV cache) for the session.

The input transform pipeline used in this session.

The output transform used in this session.

The history transform used in this session.

The the chat history messages for this session.

Create a new session state.

Save the session state to folder.

Load the session state from folder.

Throws when session state is incorrect

Role of the message author, e.g. user/assistant/system

Role is unknown

Message comes from a "system" prompt, not written by a user or language model

Message comes from the user

Messages was generated by the language model

The chat history class

Chat message representation

Role of the message author, e.g. user/assistant/system

Message content

Create a new instance

Role of message author Message content

List of messages in the chat

Create a new instance of the chat content class

Create a new instance of the chat history from array of messages

Add a message to the chat history

Role of the message author Message content

Serialize the chat history to JSON

Deserialize a chat history from JSON

A queue with fixed storage size. Currently it's only a naive implementation and needs to be further optimized in the future.

Number of items in this queue

Maximum number of items allowed in this queue

Create a new queue

the maximum number of items to store in this queue

Fill the quene with the data. Please ensure that data.Count <= size

Enquene an element.

The parameters used for inference.

number of tokens to keep from initial prompt when applying context shifting

how many new tokens to predict (n_predict), set to -1 to inifinitely generate response until it complete.

Sequences where the model will stop generating further tokens.

Type of "mirostat" sampling to use. https://github.com/basusourya/mirostat

Disable Mirostat sampling

Original mirostat algorithm

Mirostat 2.0 algorithm

The parameters for initializing a LLama model.

`Encoding` cannot be directly JSON serialized, instead store the name as a string which can

The model path.

Base class for LLamaSharp runtime errors (i.e. errors produced by llama.cpp, converted into exceptions)

Create a new RuntimeError

Loading model weights failed

The model path which failed to load

`llama_decode` return a non-zero status code

The return status code

`llama_decode` return a non-zero status code

`llama_get_logits_ith` returned null, indicating that the index was invalid

The incorrect index passed to the `llama_get_logits_ith` call

Extension methods to the IContextParams interface

Convert the given `IModelParams` into a `LLamaContextParams`

Extension methods to the IModelParams interface

Convert the given `IModelParams` into a `LLamaModelParams`

Find the index of `item` in `list`

list to search item to search for

Check if the given set of tokens ends with any of the given strings

Tokens to check Strings to search for Model to use to convert tokens into bytes Encoding to use to convert bytes into characters

Check if the given set of tokens ends with any of the given strings

Tokens to check Strings to search for Model to use to convert tokens into bytes Encoding to use to convert bytes into characters

Extensions to the KeyValuePair struct

Deconstruct a KeyValuePair into it's constituent parts.

The KeyValuePair to deconstruct First element, the Key Second element, the Value Type of the Key Type of the Value

Run a process for a certain amount of time and then terminate it

return code, standard output, standard error, flag indicating if process exited or was terminated

Extensions to span which apply in-place normalization

In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span

The same array

In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span

The same span

In-place divide every element in the array by the sum of absolute values in the array

Also known as "Manhattan normalization". The same array

In-place divide every element in the span by the sum of absolute values in the span

Also known as "Manhattan normalization". The same span

In-place divide every element by the euclidean length of the vector

Also known as "L2 normalization". The same array

In-place divide every element by the euclidean length of the vector

Also known as "L2 normalization". The same span

Creates a new array containing an L2 normalization of the input vector.

The same span

In-place apply p-normalization. https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm For p = 1, this is taxicab normalization For p = 2, this is euclidean normalization As p => infinity, this approaches infinity norm or maximum norm

The same array

The same span

A llama_context, which holds all the context required to interact with a model

Total number of tokens in the context

Dimension of embedding vectors

The context params set for this context

The native handle, which is used to be passed to the native APIs

Be careful how you use this!

The encoding set for this model to deal with text input.

Get or set the number of threads to use for generation

Get or set the number of threads to use for batch processing

Get the maximum batch size for this context

Get the special tokens for the model associated with this context

Create a new LLamaContext for the given LLamaWeights

Tokenize a string.

Whether to add a bos to the text. Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.

Detokenize the tokens to text.

Save the state to specified path.

Save the state of a particular sequence to specified path.

Get the state data as an opaque handle, which can be loaded later using

Use if you intend to save this state to disk.

Get the state data as an opaque handle, which can be loaded later using

Use if you intend to save this state to disk.

Load the state from specified path.

Load the state from specified path into a particular sequence

Load the state from memory.

Load the state from memory into a particular sequence

A tuple, containing the decode result, the number of tokens that have not been decoded yet and the total number of tokens that have been decoded.

The state of this context, which can be reloaded later

Get the size in bytes of this state object

Write all the bytes of this state to the given stream

Load a state from a stream

The state of a single sequence, which can be reloaded later

Get the size in bytes of this state object

Copy bytes to a destination pointer.

Destination to write to Length of the destination buffer Offset from start of src to start copying from Number of bytes written to destination

Generate high dimensional embedding vectors from text

Dimension of embedding vectors

LLama Context

Create a new embedder, using the given LLamaWeights

Get high dimensional embedding vectors for the given text. Depending on the pooling type used when constructing this this may return an embedding vector per token, or one single embedding vector for the entire string.

Embedding vectors are not normalized, consider using one of the extensions in .

The base class for stateful LLama executors.

The logger used by this executor.

The tokens that were already processed by the model.

The tokens that were consumed by the model during the current inference.

The path of the session file.

A container of the tokens to be processed and after processed.

A container for the tokens of input.

The last tokens generated by the model.

The context used by the executor.

This API is currently not verified.

This API has not been verified currently.

After running out of the context, take some tokens from the original prompt and recompute the logits in batches.

Try to reuse the matching prefix from the session file.

Decide whether to continue the loop.

Preprocess the inputs before the inference.

Do some post processing after the inference.

The core inference logic.

Save the current state to a file.

Get the current state data.

Load the state from data.

Load the state from a file.

Execute the inference.

The prompt. If null, generation will continue where it left off previously.

Asynchronously runs a prompt through the model to compute KV cache without generating any new tokens. It could reduce the latency of the first time response if the first input from the user is not immediate.

Prompt to process

State arguments that are used in single inference

Tokens count remained to be used. (n_remain)

The LLama executor for instruct mode.

The descriptor of the state of the instruct executor.

Whether the executor is running for the first time (running the prompt).

Instruction prefix tokens.

Instruction suffix tokens.

The LLama executor for interactive mode.

Define whether to continue the loop to generate responses.

Return whether to break the generation.

The descriptor of the state of the interactive executor.

Whether the executor is running for the first time (running the prompt).

The quantizer to quantize the model.

Quantize the model.

The model file to be quantized. The path to save the quantized model. The type of quantization. Thread to be used during the quantization. By default it's the physical core number. Whether the quantization is successful.

Quantize the model.

Parse a string into a LLamaFtype. This is a "relaxed" parsing, which allows any string which is contained within the enum name to be used. For example "Q5_K_M" will convert to "LLAMA_FTYPE_MOSTLY_Q5_K_M"

This executor infer the input as one-time job. Previous inputs won't impact on the response to current input.

The context used by the executor when running the inference.

If true, applies the default template to the prompt as defined in the rules for llama_chat_apply_template template.

The system message to use with the prompt. Only used when is true.

Create a new stateless executor which will use the given model

Converts a sequence of messages into text according to a model template

Custom template. May be null if a model was supplied to the constructor.

Keep a cache of roles converted into bytes. Roles are very frequently re-used, so this saves converting them many times.

Array of messages. The property indicates how many messages there are

Backing field for

Temporary array of messages in the format llama.cpp needs, used when applying the template

Indicates how many bytes are in array

Result bytes of last call to

Indicates if this template has been modified and needs regenerating

The encoding algorithm to use

Number of messages added to this template

Get the message at the given index

Thrown if index is less than zero or greater than or equal to

Whether to end the prompt with the token(s) that indicate the start of an assistant message.

Construct a new template, using the default model template

Construct a new template, using a custom template.

Only support a pre-defined list of templates. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

Add a new message to the end of this template

This template, for chaining calls.

Add a new message to the end of this template

This template, for chaining calls.

Remove a message at the given index

This template, for chaining calls.

Remove all messages from the template and resets internal state to accept/generate new messages

Apply the template to the messages and return a span containing the results

A span over the buffer that holds the applied template

A message that has been added to a template

The "role" string for this message

The text content of this message

Deconstruct this message into role and content

A class that contains all the transforms provided internally by LLama.

The default history transform. Uses plain text with the following format: [Author]: [Message]

Drop the name at the beginning and the end of the text.

A text input transform that only trims the text.

A no-op text input transform.

A text output transform that removes the keywords from the response.

Keywords that you want to remove from the response. This property is used for JSON serialization.

Maximum length of the keywords. This property is used for JSON serialization.

If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed. This property is used for JSON serialization.

JSON constructor.

Keywords that you want to remove from the response. The extra length when searching for the keyword. For example, if your only keyword is "highlight", maybe the token you get is "\r\nhighligt". In this condition, if redundancyLength=0, the token cannot be successfully matched because the length of "\r\nhighligt" (10) has already exceeded the maximum length of the keywords (8). On the contrary, setting redundancyLengyh >= 2 leads to successful match. The larger the redundancyLength is, the lower the processing speed. But as an experience, it won't introduce too much performance impact when redundancyLength <= 5 If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed.

A set of model weights, loaded into memory.

The native handle, which is used in the native APIs

Be careful how you use this!

Total number of tokens in the context

Get the size of this model in bytes

Get the number of parameters in this model

Dimension of embedding vectors

Get the special tokens of this model

All metadata keys in this model

Load weights into memory

Parameters to use to load the model A cancellation token that can interrupt model loading Receives progress updates as the model loads (0 to 1) Thrown if weights failed to load for any reason. e.g. Invalid file format or loading cancelled. Thrown if the cancellation token is cancelled.

Create a llama_context using this model

Convert a string of text into tokens

Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.

A set of llava model weights (mmproj), loaded into memory.

The native handle, which is used in the native APIs

Be careful how you use this!

Load weights into memory

path to the "mmproj" model file

Load weights into memory

path to the "mmproj" model file

Create the Image Embeddings from the bytes of an image.

Image bytes. Supported formats: JPG PNG BMP TGA

Create the Image Embeddings.

Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings

Create the Image Embeddings from the bytes of an image.

Path to the image file. Supported formats: JPG PNG BMP TGA

Create the Image Embeddings from the bytes of an image.

Path to the image file. Supported formats: JPG PNG BMP TGA

Eval the image embeddings

Return codes from llama_decode

An unspecified error

Ok.

Could not find a KV slot for the batch (try reducing the size of the batch or increase the context)

Return codes from llama_encode

An unspecified error

Ok.

Possible GGML quantisation types

Full 32 bit float

16 bit float

4 bit float

5 bit float

8 bit float

"type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)

"type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.

"type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.

"type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw

"type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

"type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

Integer, 8 bit

Integer, 16 bit

Integer, 32 bit

The value of this entry is the count of the number of possible quant types.

llama_split_mode

Single GPU

Split layers and KV across GPUs

split layers and KV across GPUs, use tensor parallelism if supported

Disposes all contained disposables when this class is disposed

llama_attention_type

A batch allows submitting multiple tokens to multiple sequences simultaneously

Keep a list of where logits can be sampled from

Get the number of logit positions that will be generated from this batch

The number of tokens in this batch

Maximum number of tokens that can be added to this batch (automatically grows if exceeded)

Maximum number of sequences a token can be assigned to (automatically grows if exceeded)

Create a new batch for submitting inputs to llama.cpp

Add a single token to the batch at the same position in several sequences

https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The token to add The position to add it att The set of sequences to add this token to The index that the token was added at. Use this for GetLogitsIth

Add a single token to the batch at the same position in several sequences

Add a single token to the batch at a certain position for a single sequences

https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The token to add The position to add it att The sequence to add this token to The index that the token was added at. Use this for GetLogitsIth

Add a range of tokens to a single sequence, start at the given position.

The tokens to add The starting position to add tokens at The sequence to add this token to Whether the final token should generate logits The index that the final token was added at. Use this for GetLogitsIth

Set TokenCount to zero for this batch

Get the positions where logits can be sampled from

An embeddings batch allows submitting embeddings to multiple sequences simultaneously

Keep a list of where logits can be sampled from

Get the number of logit positions that will be generated from this batch

Size of an individual embedding

The number of items in this batch

Maximum number of items that can be added to this batch (automatically grows if exceeded)

Maximum number of sequences an item can be assigned to (automatically grows if exceeded)

Create a new batch for submitting inputs to llama.cpp

Add a single embedding to the batch at the same position in several sequences

https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The embedding to add The position to add it att The set of sequences to add this token to The index that the token was added at. Use this for GetLogitsIth

Add a single embedding to the batch for a single sequence

The index that the token was added at. Use this for GetLogitsIth

Called by embeddings batch to write embeddings into a destination span

Type of user data parameter passed in Destination to write data to. Entire destination must be filled! User data parameter passed in

Add a single embedding to the batch at the same position in several sequences

https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 Type of userdata passed to write delegate Userdata passed to write delegate Delegate called once to write data into a span Position to write this embedding to All sequences to assign this embedding to Whether logits should be generated for this embedding The index that the token was added at. Use this for GetLogitsIth

Add a single embedding to the batch at a position for one sequence

https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 Type of userdata passed to write delegate Userdata passed to write delegate Delegate called once to write data into a span Position to write this embedding to Sequence to assign this embedding to Whether logits should be generated for this embedding The index that the token was added at. Use this for GetLogitsIth

Set EmbeddingsCount to zero for this batch

Get the positions where logits can be sampled from

llama_chat_message

Pointer to the null terminated bytes that make up the role string

Pointer to the null terminated bytes that make up the content string

Called by llama.cpp with a progress value between 0 and 1

If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted. llama_progress_callback

A C# representation of the llama.cpp `llama_context_params` struct

changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations https://github.com/ggerganov/llama.cpp/pull/7544

text context, 0 = from model

logical maximum batch size that can be submitted to llama_decode

physical maximum batch size

max number of sequences (i.e. distinct states for recurrent models)

number of threads to use for generation

number of threads to use for batch processing

RoPE scaling type, from `enum llama_rope_scaling_type`

whether to pool (sum) embedding results by sequence id

Attention type to use for embeddings

RoPE base frequency, 0 = from model

RoPE frequency scaling factor, 0 = from model

YaRN extrapolation mix factor, negative = from model

YaRN magnitude scaling factor

YaRN low correction dim

YaRN high correction dim

YaRN original context size

defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default)

ggml_backend_sched_eval_callback

User data passed into cb_eval

data type for K cache. EXPERIMENTAL

data type for V cache. EXPERIMENTAL

Deprecated!

if true, extract embeddings (together with logits)

whether to offload the KQV ops (including the KV cache) to GPU

whether to use flash attention. EXPERIMENTAL

whether to measure performance timings

ggml_abort_callback

User data passed into abort_callback

Get the default LLamaContextParams

Supported model file types

C# representation of llama_ftype

All f32

Benchmark@7B: 26GB

Mostly f16

Benchmark@7B: 13GB

Mostly 8 bit

Benchmark@7B: 6.7GB, +0.0004ppl

Mostly 4 bit

Benchmark@7B: 3.50GB, +0.2499 ppl

Mostly 4 bit

Benchmark@7B: 3.90GB, +0.1846 ppl

Mostly 5 bit

Benchmark@7B: 4.30GB @ 7B tokens, +0.0796 ppl

Mostly 5 bit

Benchmark@7B: 4.70GB, +0.0415 ppl

K-Quant 2 bit

Benchmark@7B: 2.67GB @ 7N parameters, +0.8698 ppl

K-Quant 3 bit (Small)

Benchmark@7B: 2.75GB, +0.5505 ppl

K-Quant 3 bit (Medium)

Benchmark@7B: 3.06GB, +0.2437 ppl

K-Quant 3 bit (Large)

Benchmark@7B: 3.35GB, +0.1803 ppl

K-Quant 4 bit (Small)

Benchmark@7B: 3.56GB, +0.1149 ppl

K-Quant 4 bit (Medium)

Benchmark@7B: 3.80GB, +0.0535 ppl

K-Quant 5 bit (Small)

Benchmark@7B: 4.33GB, +0.0353 ppl

K-Quant 5 bit (Medium)

Benchmark@7B: 4.45GB, +0.0142 ppl

K-Quant 6 bit

Benchmark@7B: 5.15GB, +0.0044 ppl

except 1d tensors

File type was not specified

A safe handle for a LLamaKvCacheView

Number of KV cache cells. This will be the same as the context size.

Get the total number of tokens in the KV cache. For example, if there are two populated cells, the first with 1 sequence id in it and the second with 2 sequence ids then you'll have 3 tokens.

Maximum number of sequences visible for a cell. There may be more sequences than this in reality, this is simply the maximum number this view can see.

Number of populated cache cells

Maximum contiguous empty slots in the cache.

Index to the start of the MaxContiguous slot range. Can be negative when cache is full.

Initialize a LLamaKvCacheViewSafeHandle which will call `llama_kv_cache_view_free` when disposed

Allocate a new KV cache view which can be used to inspect the KV cache

The maximum number of sequences visible in this view per cell

Read the current KV cache state into this view.

Get the raw KV cache view

Get the cell at the given index

The index of the cell [0, CellCount) Data about the cell at the given index Thrown if index is out of range (0 <= index < CellCount)

Get all of the sequences assigned to the cell at the given index. This will contain entries sequences even if the cell actually has more than that many sequences, allocate a new view with a larger maxSequences parameter if necessary. Invalid sequences will be negative values.

The index of the cell [0, CellCount) A span containing the sequences assigned to this cell Thrown if index is out of range (0 <= index < CellCount)

Create an empty KV cache view. (use only for debugging purposes)

Free a KV cache view. (use only for debugging purposes)

Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)

Information associated with an individual cell in the KV cache view (llama_kv_cache_view_cell)

The position for this cell. Takes KV cache shifts into account. May be negative if the cell is not populated.

An updateable view of the KV cache (llama_kv_cache_view)

Number of KV cache cells. This will be the same as the context size.

Maximum number of sequences that can exist in a cell. It's not an error if there are more sequences in a cell than this value, however they will not be visible in the view cells_sequences.

Number of tokens in the cache. For example, if there are two populated cells, the first with 1 sequence id in it and the second with 2 sequence ids then you'll have 3 tokens.

Number of populated cache cells.

Maximum contiguous empty slots in the cache.

Index to the start of the max_contiguous slot range. Can be negative when cache is full.

Information for an individual cell.

The sequences for each cell. There will be n_seq_max items per cell.

Severity level of a log message. This enum should always be aligned with the one defined on llama.cpp side at https://github.com/ggerganov/llama.cpp/blob/0eb4e12beebabae46d37b78742f4c5d4dbe52dc1/ggml/include/ggml.h#L559

Logs are never written.

Logs that are used for interactive investigation during development.

Logs that track the general flow of the application.

Logs that highlight an abnormal or unexpected event in the application flow, but do not otherwise cause the application execution to stop.

Logs that highlight when the current flow of execution is stopped due to a failure.

Continue log level is equivalent to None in the way it is used in llama.cpp.

Keeps track of the previous log level to be able to handle the log level .

Override a key/value pair in the llama model metadata (llama_model_kv_override)

Key to override

Type of value

Add 4 bytes of padding, to align the next fields to 8 bytes

Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_INT

Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_FLOAT

Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_BOOL

Value, **must** only be used if Tag == String

Specifies what type of value is being overridden by LLamaModelKvOverride

llama_model_kv_override_type

Overriding an int value

Overriding a float value

Overriding a bool value

Overriding a string value

A C# representation of the llama.cpp `llama_model_params` struct

NULL-terminated list of devices to use for offloading (if NULL, all available devices are used) todo: add support for llama_model_params.devices

// number of layers to store in VRAM

how to split the model across multiple GPUs

the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE

how to split layers across multiple GPUs (size: )

called with a progress value between 0 and 1, pass NULL to disable. If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted.

context pointer passed to the progress callback

override key-value pairs of the model meta data

only load the vocabulary, no weights

use mmap if possible

force system to keep model in RAM

validate model tensor data

Create a LLamaModelParams with default values

Quantizer parameters used in the native API

llama_model_quantize_params

number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()

quantize to this llama_ftype

output tensor type

token embeddings tensor type

allow quantizing non-f32/f16 tensors

quantize output.weight

only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored

quantize all tensors to the default type

quantize to the same number of shards

pointer to importance matrix data

pointer to vector containing overrides

Create a LLamaModelQuantizeParams with default values

Input data for llama_decode A llama_batch object can contain input about one or many sequences The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens

The number of items pointed at by pos, seq_id and logits.

Either `n_tokens` of `llama_token`, or `NULL`, depending on how this batch was created

Either `n_tokens * embd * sizeof(float)` or `NULL`, depending on how this batch was created

the positions of the respective token in the sequence (if set to NULL, the token position will be tracked automatically by llama_decode)

https://github.com/ggerganov/llama.cpp/blob/master/llama.h#L139 ???

the sequence to which the respective token belongs (if set to NULL, the sequence ID will be assumed to be 0)

if zero, the logits for the respective token will not be output (if set to NULL, only the logits for last token will be returned)

llama_pooling_type

No specific pooling type. Use the model default if this is specific in

Do not pool embeddings (per-token embeddings)

Take the mean of every token embedding

Return the embedding for the special "CLS" token

Used by reranking models to attach the classification head to the graph

Indicates position in a sequence

The raw value

Create a new LLamaPos

Convert a LLamaPos into an integer (extract the raw value)

Convert an integer into a LLamaPos

Increment this position

llama_rope_type

ID for a sequence in a batch

LLamaSeqId with value 0

The raw value

Create a new LLamaSeqId

Convert a LLamaSeqId into an integer (extract the raw value)

Convert an integer into a LLamaSeqId

LLama performance information

llama_perf_context_data

Timestamp when reset was last called

Loading milliseconds

total milliseconds spent prompt processing

Total milliseconds in eval/decode calls

number of tokens in eval calls for the prompt (with batch size > 1)

number of eval calls

Timestamp when reset was last called

Time spent loading

total milliseconds spent prompt processing

Total milliseconds in eval/decode calls

number of tokens in eval calls for the prompt (with batch size > 1)

number of eval calls

LLama performance information

llama_perf_sampler_data

A single token

Token Value used when token is inherently null

The raw value

Create a new LLamaToken

Convert a LLamaToken into an integer (extract the raw value)

Convert an integer into a LLamaToken

Get attributes for this token

Get score for this token

Check if this is a control token

Check if this token should end generation

Token attributes

C# equivalent of llama_token_attr

A single token along with probability of this token being selected

token id

log-odds of the token

probability of the token

Create a new LLamaTokenData

Contains an array of LLamaTokenData, potentially sorted.

The LLamaTokenData

Indicates if `data` is sorted by logits in descending order. If this is false the token data is in _no particular order_.

Create a new LLamaTokenDataArray

Create a new LLamaTokenDataArray, copying the data from the given logits

Create a new LLamaTokenDataArray, copying the data from the given logits into temporary memory.

The memory must not be modified while this is in use. Temporary memory which will be used to work on these logits. Must be at least as large as logits array

Overwrite the logit values for all given tokens

tuples of token and logit value to overwrite

Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.

Contains a pointer to an array of LLamaTokenData which is pinned in memory.

C# equivalent of llama_token_data_array

A pointer to an array of LlamaTokenData

Memory must be pinned in place for all the time this LLamaTokenDataArrayNative is in use (i.e. `fixed` or `.Pin()`)

Number of LLamaTokenData in the array

The index in the array (i.e. not the token id)

A pointer to an array of LlamaTokenData

Indicates if the items in the array are sorted, so the most likely token is first

The index of the selected token (i.e. not the token id)

Number of LLamaTokenData in the array. Set this to shrink the array

Create a new LLamaTokenDataArrayNative around the data in the LLamaTokenDataArray

Data source Created native array A memory handle, pinning the data in place until disposed

C# equivalent of llama_vocab struct. This struct is an opaque type, with no fields in the API and is only used for typed pointers.

Get attributes for a specific token

Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)

Identify if Token Id is a control token or a render-able token

beginning-of-sentence

end-of-sentence

end-of-turn

sentence separator

next-line

padding

llama_vocab_pre_type

llama_vocab_type

For models without vocab

LLaMA tokenizer based on byte-level BPE with byte fallback

GPT-2 tokenizer based on byte-level BPE

BERT tokenizer based on WordPiece

T5 tokenizer based on Unigram

RWKV tokenizer based on greedy tokenization

LLaVa Image embeddings

llava_image_embed

Set configurations for all the native libraries, including LLama and LLava

Configuration for LLama native library

Configuration for LLava native library

Check if the native library has already been loaded. Configuration cannot be modified if this is true.

Set the log callback that will be used for all llama.cpp log messages

Try to load the native library with the current configurations, but do not actually set it to . You can still modify the configuration after this calling but only before any call from .

The loaded livrary. When the loading failed, this will be null. However if you are using .NET standard2.0, this will never return null. Whether the running is successful.

A class to set same configurations to multiple libraries at the same time.

Do an action for all the configs in this container.

Set the log callback that will be used for all llama.cpp log messages

Try to load the native library with the current configurations, but do not actually set it to . You can still modify the configuration after this calling but only before any call from .

Whether the running is successful.

The name of the native library

The native library compiled from llama.cpp.

The native library compiled from the LLaVA example of llama.cpp.

A native library specified with a local file path.

Information of a native library file.

Which kind of library it is. Whether it's compiled with cublas. Whether it's compiled with vulkan. Which AvxLevel it's compiled with.

Information of a native library file.

Which kind of library it is. Whether it's compiled with cublas. Whether it's compiled with vulkan. Which AvxLevel it's compiled with.

Which kind of library it is.

Whether it's compiled with cublas.

Whether it's compiled with vulkan.

Which AvxLevel it's compiled with.

Avx support configuration

No AVX

Advanced Vector Extensions (supported by most processors after 2011)

AVX2 (supported by most processors after 2013)

AVX512 (supported by some processors after 2016, not widely supported)

Try to load libllama/llava_shared, using CPU feature detection to try and load a more specialised DLL if possible

The library handle to unload later, or IntPtr.Zero if no library was loaded

Operating system information.

Get the system information of the current machine.

When you are using .NET standard2.0, dynamic native library loading is not supported. This class will be returned in .

A LoRA adapter which can be applied to a context for a specific model

The model which this LoRA adapter was loaded with.

The full path of the file this adapter was loaded from

Native pointer of the loaded adapter, will be automatically freed when the model is unloaded

Indicates if this adapter has been unloaded

Unload this adapter

Direct translation of the llama.cpp API

A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.

Call once at the end of the program - currently only used for MPI

Get the maximum number of devices supported by llama.cpp

Check if memory mapping is supported

Check if memory locking is supported

Check if GPU offload is supported

Check if RPC offload is supported

Initialize the llama + ggml backend. Call once at the start of the program. This is private because LLamaSharp automatically calls it, and it's only valid to call it once!

Load session file

Save session file

Set whether to use causal attention or not. If set to true, the model will only attend to the past tokens

Set whether the model is in embeddings mode or not.

If true, embeddings will be returned but logits will not

Set abort callback

Get the n_seq_max for this context

Get all output token embeddings. When pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model, the embeddings for which llama_batch.logits[i] != 0 are stored contiguously in the order they have appeared in the batch. shape: [n_outputs*n_embd] Otherwise, returns an empty span.

Apply chat template. Inspired by hf apply_chat_template() on python.

A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead. Pointer to a list of multiple llama_chat_message Number of llama_chat_message in this chat Whether to end the prompt with the token(s) that indicate the start of an assistant message. A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages) The size of the allocated buffer The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.

Get list of built-in chat templates

Print out timing information for this context

Print system information

Convert a single token into text

buffer to write string into User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix') If true, special tokens are rendered in the output The length written, or if the buffer is too small a negative that indicates the length required

Convert text into tokens

The tokens pointer must be large enough to hold the resulting tokens. add_special Allow to add BOS and EOS tokens if model is configured to do so. Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space. Returns the number of tokens on success, no more than n_max_tokens. Returns a negative number on failure - the number of tokens that would have been returned

Convert the provided tokens into text (inverse of llama_tokenize()).

The char pointer must be large enough to hold the resulting text. remove_special Allow to remove BOS and EOS tokens if model is configured to do so. unparse_special If true, special tokens are rendered in the output. Returns the number of chars/bytes on success, no more than textLengthMax. Returns a negative number on failure - the number of chars/bytes that would have been returned.

Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times

Returns the number of used KV cells (i.e. have at least one sequence assigned to them)

Clear the KV cache. Both cell info is erased and KV data is zeroed

Removes all tokens that belong to the specified sequence and have positions in [p0, p1)

Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails

Copy all tokens that belong to the specified sequence to another sequence Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence

Removes all tokens that do not belong to the specified sequence

Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1) If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()

Integer division of the positions by factor of `d > 1` If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)

Returns the largest position present in the KV cache for the specified sequence

Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized

Each token can be assigned up to n_seq_max sequence ids

Frees a batch of tokens allocated with llama_batch_init()

Apply a loaded control vector to a llama_context, or if data is NULL, clear the currently loaded vector. n_embd should be the size of a single layer's control, and data should point to an n_embd x n_layers buffer starting from layer 1. il_start and il_end are the layer range the vector should apply to (both inclusive) See llama_control_vector_load in common to load a control vector.

Build a split GGUF final path for this chunk. llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf"

Returns the split_path length.

Extract the path prefix from the split_path if and only if the split_no and split_count match. llama_split_prefix(split_prefix, 64, "/models/ggml-model-q4_0-00002-of-00004.gguf", 2, 4) => split_prefix = "/models/ggml-model-q4_0"

Returns the split_prefix length.

Sanity check for clip <-> llava embed size match

LLama Context Llava Model True if validate successfully

Build an image embed from image file bytes

SafeHandle to the Clip Model Number of threads Binary image in jpeg format Bytes length of the image SafeHandle to the Embeddings

Build an image embed from a path to an image filename

SafeHandle to the Clip Model Number of threads Image filename (jpeg) to generate embeddings SafeHandle to the embeddings

Free an embedding made with llava_image_embed_make_*

Embeddings to release

Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed.

Llama Context Embedding handle True on success

Get the loaded native library. If you are using netstandard2.0, it will always return null.

Returns 0 on success

Configure llama.cpp logging

Callback from llama.cpp with log messages

A GC handle for the current log callback to ensure the callback is not collected

RoPE scaling type.

C# equivalent of llama_rope_scaling_type

No particular scaling type has been specified

Do not apply any RoPE scaling

Positional linear interpolation, as described by kaikendev: https://kaiokendev.github.io/til#extending-context-to-8k

YaRN scaling: https://arxiv.org/pdf/2309.00071.pdf

LongRope scaling

A safe wrapper around a llama_context

Total number of tokens in the context

Dimension of embedding vectors

Get the maximum batch size for this context

Get the physical maximum batch size for this context

Get or set the number of threads used for generation of a single token.

Get or set the number of threads used for prompt and batch processing (multiple token).

Get the pooling type for this context

Get the model which this context is using

Get the vocabulary for the model this context is using

Create a new llama_state for the given model

Create a new llama_context with the given model. **This should never be called directly! Always use SafeLLamaContextHandle.Create**!

Frees all allocated memory in the given llama_context

Set a callback which can abort computation

If this returns true computation is cancelled

Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error

Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output internally for later use by the decoder cross-attention layers.

0 = success
< 0 = error

Set the number of threads used for decoding

n_threads is the number of threads used for generation (single token) n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)

Get the number of threads used for generation of a single token.

Get the number of threads used for prompt and batch processing (multiple token).

Token logits obtained from the last call to llama_decode The logits for the last token are stored in the last row Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab

Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab

Get the size of the context window for the model for this context

Get the batch size for this context

Get the ubatch size for this context

Returns the **actual** size in bytes of the state (logits, embedding and kv_cache). Only use when saving the state, not when restoring it, otherwise the size may be too small.

Copies the state to the specified destination address. Destination needs to have allocated enough memory.

the number of bytes copied

Set the state reading from the specified address

the number of bytes read

Get the exact size needed to copy the KV cache of a single sequence

Copy the KV cache of a single sequence into the specified buffer

Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence

- Positive: Ok - Zero: Failed to load

Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()

Apply the KV cache updates (such as K-shifts, defragmentation, etc.)

Check if the context supports KV cache shifting

Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results and is not necessary to call it explicitly in most cases.

Get the pooling type for this context

Get the embeddings for a sequence id. Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence otherwise: float[n_embd] (1-dimensional)

A pointer to the first float in an embedding, length = ctx.EmbeddingSize

Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd

A pointer to the first float in an embedding, length = ctx.EmbeddingSize

Add a LoRA adapter to this context

Remove a LoRA adapter from this context

Indicates if the lora was in this context and was remove

Remove all LoRA adapters from this context

Token logits obtained from the last call to llama_decode. The logits for the last token are stored in the last row. Only tokens with `logits = true` requested are present.
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab

The amount of tokens whose logits should be retrieved, in [numTokens X n_vocab] format.
Tokens' order is based on their order in the LlamaBatch (so, first tokens are first, etc).
This is helpful when requesting logits for many tokens in a sequence, or want to decode multiple sequences in one go.

Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab

Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd

A pointer to the first float in an embedding, length = ctx.EmbeddingSize

Get the embeddings for the a specific sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd

A pointer to the first float in an embedding, length = ctx.EmbeddingSize

Convert the given text into tokens

The text to tokenize Whether the "BOS" token should be added Encoding to use for the text Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.

Convert a single llama token into bytes

Token to decode A span to attempt to write into. If this is too small nothing will be written The size of this token. **nothing will be written** if this is larger than `dest`

This object exists to ensure there is only ever 1 inference running at a time. This is a workaround for thread safety issues in llama.cpp itself. Most notably CUDA, which seems to use some global singleton resources and will crash if multiple inferences are run (even against different models). For more information see these issues: - https://github.com/SciSharp/LLamaSharp/issues/596 - https://github.com/ggerganov/llama.cpp/issues/3960 If these are ever resolved this lock can probably be removed.

Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results and is not necessary to call it explicitly in most cases.

Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output internally for later use by the decoder cross-attention layers.

0 = success
< 0 = error (the KV cache state is restored to the state before this call)

Decode a set of tokens in batch-size chunks.

A tuple, containing the decode result and the number of tokens that have not been decoded yet.

Get the size of the state, when saved as bytes

Get the size of the KV cache for a single sequence ID, when saved as bytes

Get the raw state of this context, encoded as bytes. Data is written into the `dest` pointer.

Destination to write to Number of bytes available to write to in dest (check required size with `GetStateSize()`) The number of bytes written to dest Thrown if dest is too small

Get the raw state of a single sequence from this context, encoded as bytes. Data is written into the `dest` pointer.

Destination to write to Number of bytes available to write to in dest (check required size with `GetStateSize()`) The sequence to get state data for The number of bytes written to dest

Set the raw state of this context

The pointer to read the state from Number of bytes that can be safely read from the pointer Number of bytes read from the src pointer

Set the raw state of a single sequence

The pointer to read the state from Sequence ID to set Number of bytes that can be safely read from the pointer Number of bytes read from the src pointer

Get performance information

Reset all performance information for this context

Check if the context supports KV cache shifting

Apply KV cache updates (such as K-shifts, defragmentation, etc.)

Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()

Get a new KV cache view that can be used to debug the KV cache

Count the number of used cells in the KV cache (i.e. have at least one sequence assigned to them)

Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times

Clear the KV cache - both cell info is erased and KV data is zeroed

Removes all tokens that belong to the specified sequence and have positions in [p0, p1)

Copy all tokens that belong to the specified sequence to another sequence. Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence

Removes all tokens that do not belong to the specified sequence

Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1. If the KV cache is RoPEd, the KV data is updated accordingly

Integer division of the positions by factor of `d > 1`. If the KV cache is RoPEd, the KV data is updated accordingly.
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)

Returns the largest position present in the KV cache for the specified sequence

Base class for all llama handles to native resources

A reference to a set of llama model weights

Get the rope (positional embedding) type for this model

The number of tokens in the context that this model was trained for

Get the rope frequency this model was trained with

Dimension of embedding vectors

Get the size of this model in bytes

Get the number of parameters in this model

Get the number of layers in this model

Get the number of heads in this model

Returns true if the model contains an encoder that requires llama_encode() call

Returns true if the model contains a decoder that requires llama_decode() call

Returns true if the model is recurrent (like Mamba, RWKV, etc.)

Get a description of this model

Get the number of metadata key/value pairs

Get the vocabulary of this model

Load a model from the given file path into memory

Load the model from a file If the file is split into multiple parts, the file name must follow this pattern: {name}-%05d-of-%05d.gguf If the split file name does not follow this pattern, use llama_model_load_from_splits

The loaded model, or null on failure.

Load the model from multiple splits (support custom naming scheme) The paths must be in the correct order

Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter. Can be NULL to use the current loaded model. The model needs to be reloaded before applying a new adapter, otherwise the adapter will be applied on top of the previous one

Returns 0 on success

Frees all allocated memory associated with a model

Get the number of metadata key/value pairs

Get metadata key name by index

Model to fetch from Index of key to fetch buffer to write result into The length of the string on success (even if the buffer is too small). -1 is the key does not exist.

Get metadata value as a string by index

Model to fetch from Index of val to fetch Buffer to write result into The length of the string on success (even if the buffer is too small). -1 is the key does not exist.

Get metadata value as a string by key name

The length of the string on success, or -1 on failure

Get the number of tokens in the model vocabulary

Get the size of the context window for the model

Get the dimension of embedding vectors from this model

Get the number of layers in this model

Get the number of heads in this model

Get a string describing the model type

The length of the string on success (even if the buffer is too small)., or -1 on failure

Get the size of the model in bytes

The size of the model

Get the number of parameters in this model

The functions return the length of the string on success, or -1 on failure

Get the model's RoPE frequency scaling factor

For encoder-decoder models, this function returns id of the token that must be provided to the decoder to start generating output sequence. For other models, it returns -1.

Returns true if the model contains an encoder that requires llama_encode() call

Returns true if the model contains a decoder that requires llama_decode() call

Returns true if the model is recurrent (like Mamba, RWKV, etc.)

Load a LoRA adapter from file. The adapter will be associated with this model but will not be applied

Convert a single llama token into bytes

Token to decode A span to attempt to write into. If this is too small nothing will be written User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix') If true, special characters will be converted to text. If false they will be invisible. The size of this token. **nothing will be written** if this is larger than `dest`

Convert a sequence of tokens into characters.

The section of the span which has valid data in it. If there was insufficient space in the output span this will be filled with as many characters as possible, starting from the _last_ token.

Convert a string of text into tokens

Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.

Create a new context for this model

Get the metadata value for the given key

The key to fetch The value, null if there is no such key

Get the metadata key for the given index

The index to get The key, null if there is no such key or if the buffer was too small

Get the metadata value for the given index

The index to get The value, null if there is no such value or if the buffer was too small

Get the default chat template. Returns nullptr if not available If name is NULL, returns the default chat template

Get tokens for a model

Total number of tokens in this vocabulary

Get the the type of this vocabulary

Get the Beginning of Sentence token for this model

Get the End of Sentence token for this model

Get the newline token for this model

Get the padding token for this model

Get the sentence separator token for this model

Codellama beginning of infill prefix

Codellama beginning of infill middle

Codellama beginning of infill suffix

Codellama pad

Codellama rep

end-of-turn token

For encoder-decoder models, this function returns id of the token that must be provided to the decoder to start generating output sequence.

Check if the current model requires a BOS token added

Check if the current model requires a EOS token added

A chain of sampler stages that can be used to select tokens from logits.

Wraps a handle returned from `llama_sampler_chain_init`. Other samplers are owned by this chain and are never directly exposed.

Get the number of samplers in this chain

Apply this sampler to a set of candidates

Sample and accept a token from the idx-th output of the last evaluation. Shorthand for:


                var logits = ctx.GetLogitsIth(idx);
                var token_data_array = LLamaTokenDataArray.Create(logits);
                using LLamaTokenDataArrayNative.Create(token_data_array, out var native_token_data);
                sampler_chain.Apply(native_token_data);
                var token = native_token_data.Data.Span[native_token_data.Selected];
                sampler_chain.Accept(token);
                return token;

Reset the state of this sampler

Accept a token and update the internal state of this sampler

Get the name of the sampler at the given index

Get the seed of the sampler at the given index if applicable. returns LLAMA_DEFAULT_SEED otherwise

Create a new sampler chain

Clone a sampler stage from another chain and add it to this chain

The chain to clone a stage from The index of the stage to clone

Remove a sampler stage from this chain

Add a custom sampler stage

Add a sampler which picks the most likely token.

Add a sampler which picks from the probability distribution of all tokens

Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text. The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates. The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm.

Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841

Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.

Apply temperature to the logits. If temperature is less than zero the maximum logit is left unchanged and the rest are set to -infinity

Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772.

XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335

This sampler is meant to be used for fill-in-the-middle infilling, after top_k + top_p sampling
1. if the sum of the EOG probs times the number of candidates is higher than the sum of the other probs -> pick EOG
2. combine probs of tokens that have the same prefix

example:

- before:
"abc": 0.5
"abcd": 0.2
"abcde": 0.1
"dummy": 0.1

- after:
"abc": 0.8
"dummy": 0.1

3. discard non-EOG tokens with low prob
4. if no tokens are left -> pick EOT

Create a sampler which makes tokens impossible unless they match the grammar

Root rule of the grammar

Create a sampler using lazy grammar sampling: https://github.com/ggerganov/llama.cpp/pull/9639

Grammar in GBNF form Root rule of the grammar A list of tokens that will trigger the grammar sampler. A list of words that will trigger the grammar sampler.

Create a sampler that applies various repetition penalties. Avoid using on the full vocabulary as searching for repeated tokens can become slow. For example, apply top-k or top-p sampling first.

How many tokens of history to consider when calculating penalties Repetition penalty Frequency penalty Presence penalty

DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677. Porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982

The model this sampler will be used with penalty multiplier, 0.0 = disabled exponential base repeated sequences longer than this are penalized how many tokens to scan for repetitions (0 = entire context)

Create a sampler that applies a bias directly to the logits

llama_sampler_chain_params

whether to measure performance timings

Get the default LLamaSamplerChainParams

A bias to apply directly to a logit

The token to apply the bias to

The bias to add

llama_sampler_i

Get the name of this sampler

Update internal sampler state after a token has been chosen

Apply this sampler to a set of logits

Reset the internal state of this sampler

Create a clone of this sampler

Free all resources held by this sampler

llama_sampler

Holds the function pointers which make up the actual sampler

Any additional context this sampler needs, may be anything. We will use it to hold a GCHandle.

This GCHandle roots this object, preventing it from being freed.

A reference to the user code which implements the custom sampler

Get a pointer to a `llama_sampler` (LLamaSamplerNative) struct, suitable for passing to `llama_sampler_chain_add`

A custom sampler stage for modifying logits or selecting a token

The human readable name of this stage

Apply this stage to a set of logits. This can modify logits or select a token (or both). If logits are modified the Sorted flag must be set to false.

If the logits are no longer sorted after the custom sampler has run it is critically important to set Sorted=false. If unsure, always set it to false, this is a safe default.

Update the internal state of the sampler when a token is chosen

Reset the internal state of this sampler

Create a clone of this sampler

A Reference to a llava Image Embed handle

Get the model used to create this image embedding

Get the number of dimensions in an embedding

Get the number of "patches" in an image embedding

Create an image embed from an image file

Path to the image file. Supported formats: JPG PNG BMP TGA

Create an image embed from an image file

Path to the image file. Supported formats: JPG PNG BMP TGA

Create an image embed from the bytes of an image.

Image bytes. Supported formats: JPG PNG BMP TGA

Create an image embed from the bytes of an image.

Image bytes. Supported formats: JPG PNG BMP TGA

Copy the embeddings data to the destination span

A reference to a set of llava model weights.

Get the number of dimensions in an embedding

Get the number of "patches" in an image embedding

Load a model from the given file path into memory

MMP File (Multi-Modal Projections) Verbosity level SafeHandle of the Clip Model

Create the Image Embeddings.

LLama Context Image filename (it supports jpeg format only) return the SafeHandle of these embeddings

Create the Image Embeddings.

Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings

Create the Image Embeddings.

LLama Context Image in binary format (it supports jpeg format only) return the SafeHandle of these embeddings

Create the Image Embeddings.

Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings

Evaluates the image embeddings.

Llama Context The current embeddings to evaluate True on success

Load MULTI MODAL PROJECTIONS model / Clip Model

Model path/file Verbosity level SafeLlavaModelHandle

Frees MULTI MODAL PROJECTIONS model / Clip Model

Internal Pointer to the model

Create a new sampler wrapping a llama.cpp sampler chain

Create a sampling chain. This will be called once, the base class will automatically dispose the chain.

An implementation of ISamplePipeline which mimics the default llama.cpp sampling

Bias values to add to certain logits

Repetition penalty, as described in https://arxiv.org/abs/1909.05858

Frequency penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Presence penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

How many tokens should be considered for penalties

Whether the newline token should be protected from being modified by penalty

Whether the EOS token should be suppressed. Setting this to 'true' prevents EOS from being sampled

Temperature to apply (higher temperature is more "creative")

Number of tokens to keep in TopK sampling

P value for locally typical sampling

P value for TopP sampling

P value for MinP sampling

Grammar to apply to constrain possible tokens

The minimum number of tokens to keep for samplers which remove tokens

Seed to use for random sampling

A grammar in GBNF form

A sampling pipeline which always selects the most likely token

Grammar to apply to constrain possible tokens

Convert a span of logits into a single sampled token. This interface can be implemented to completely customise the sampling process.

Sample a single token from the given context at the given position

The context being sampled from Position to sample logits from

Reset all internal state of the sampling pipeline

Update the pipeline, with knowledge that a particular token was just accepted

Extension methods for

Sample a single token from the given context at the given position

The context being sampled from Position to sample logits from

Decodes a stream of tokens into a stream of characters

The number of decoded characters waiting to be read

If true, special characters will be converted to text. If false they will be invisible.

Create a new decoder

Text encoding to use Model weights

Create a new decoder

Context to retrieve encoding and model weights from

Create a new decoder

Text encoding to use Context to retrieve model weights from

Create a new decoder

Text encoding to use Models weights to use

Add a single token to the decoder

Add all tokens in the given enumerable

Add all tokens in the given span

Read all decoded characters and clear the buffer

Read all decoded characters as a string and clear the buffer

Set the decoder back to its initial state

A prompt formatter that will use llama.cpp's template formatter If your model is not supported, you will need to define your own formatter according the cchat prompt specification for your model

Apply the template to the messages and return the resulting prompt as a string

The formatted template string as defined by the model