LLamaSharp Reserved to be used by the compiler for tracking metadata. This class should not be used by developers in source code. This definition is provided by the IsExternalInit NuGet package (https://www.nuget.org/packages/IsExternalInit). Please see https://github.com/manuelroemer/IsExternalInit for more information. The parameters for initializing a LLama context from a model. Model context size (n_ctx) maximum batch size that can be submitted at once (must be >=32 to use BLAS) (n_batch) Physical batch size max number of sequences (i.e. distinct states for recurrent models) If true, extract embeddings (together with logits). RoPE base frequency (null to fetch from the model) RoPE frequency scaling factor (null to fetch from the model) The encoding to use for models Number of threads (null = autodetect) (n_threads) Number of threads to use for batch processing (null = autodetect) (n_threads) YaRN extrapolation mix factor (null = from model) YaRN magnitude scaling factor (null = from model) YaRN low correction dim (null = from model) YaRN high correction dim (null = from model) YaRN original context length (null = from model) YaRN scaling method to use. Override the type of the K cache Override the type of the V cache Whether to disable offloading the KQV cache to the GPU Whether to use flash attention defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default) defragment the KV cache if holes/size > defrag_threshold, Set to or < 0 to disable (default) How to pool (sum) embedding results by sequence id (ignored if no pooling layer) Attention type to use for embeddings Transform history to plain text and vice versa. Convert a ChatHistory instance to plain text. The ChatHistory instance Converts plain text to a ChatHistory instance. The role for the author. The chat history as plain text. The updated history. Copy the transform. The parameters used for inference. number of tokens to keep from initial prompt how many new tokens to predict (n_predict), set to -1 to inifinitely generate response until it complete. Sequences where the model will stop generating further tokens. Set a custom sampling pipeline to use. A high level interface for LLama models. The loaded context for this executor. Identify if it's a multi-modal model and there is a image to process. Multi-Modal Projections / Clip Model weights List of images: List of images in byte array format. Asynchronously infers a response from the model. Your prompt Any additional parameters A cancellation token. Convenience interface for implementing both type of parameters. Mostly exists for backwards compatibility reasons, when these two were not split. The parameters for initializing a LLama model. main_gpu interpretation depends on split_mode: None The GPU that is used for the entire mode. Row The GPU that is used for small tensors and intermediate results. Layer Ignored. How to split the model across multiple GPUs Number of layers to run in VRAM / GPU memory (n_gpu_layers) Use mmap for faster loads (use_mmap) Use mlock to keep model in memory (use_mlock) Model path (model) how split tensors should be distributed across GPUs Load vocab only (no weights) Validate model tensor data before loading Override specific metadata items in the model A fixed size array to set the tensor splits across multiple GPUs The size of this array Get or set the proportion of work to do on the given device. "[ 3, 2 ]" will assign 60% of the data to GPU 0 and 40% to GPU 1. Create a new tensor splits collection, copying the given values Create a new tensor splits collection with all values initialised to the default Set all values to zero A JSON converter for An override for a single key/value pair in model metadata Get the key being overridden by this override Create a new override for an int key Create a new override for a float key Create a new override for a boolean key Create a new override for a string key A JSON converter for Descriptor of a native library. Metadata of this library. Prepare the native library file and returns the local path of it. If it's a relative path, LLamaSharp will search the path in the search directies you set. The system information of the current machine. The log callback. The relative paths of the library. You could return multiple paths to try them one by one. If no file is available, please return an empty array. Takes a stream of tokens and transforms them. Takes a stream of tokens and transforms them, returning a new stream of tokens asynchronously. Copy the transform. An interface for text transformations. These can be used to compose a pipeline of text transformations, such as: - Tokenization - Lowercasing - Punctuation removal - Trimming - etc. Takes a string and transforms it. Copy the transform. Extension methods to the interface. Gets an instance for the specified . The executor. The to use to transform an input list messages into a prompt. The to use to transform the output into text. An instance for the provided . is null. Format the chat messages into a string prompt. Convert the chat options to inference parameters. A default transform that appends "Assistant: " to the end. AntipromptProcessor keeps track of past tokens looking for any set Anti-Prompts Initializes a new instance of the class. The antiprompts. Add an antiprompt to the collection Overwrite all current antiprompts with a new set Add some text and check if the buffer now ends with any antiprompt true if the text buffer ends with any antiprompt A batched executor that can infer multiple separate "conversations" simultaneously. Set to 1 using interlocked exchange while inference is running Epoch is incremented twice every time Infer is called. Conversations can use this to keep track of whether they're waiting for inference, or can be sampled. The this executor is using The this executor is using Get the number of tokens in the batch, waiting for to be called Number of batches in the queue, waiting for to be called Check if this executor has been disposed. Create a new batched executor The model to use Parameters to create a new context Start a new Load a conversation that was previously saved to a file. Once loaded the conversation will need to be prompted. Load a conversation that was previously saved into memory. Once loaded the conversation will need to be prompted. Run inference for all conversations in the batch which have pending tokens. If the result is `NoKvSlot` then there is not enough memory for inference, try disposing some conversation threads and running inference again. Get a reference to a batch that tokens can be added to. Get a reference to a batch that embeddings can be added to. A single conversation thread that can be prompted (adding tokens from the user) or inferred (extracting a token from the LLM) Indicates if this conversation has been "forked" and may share logits with another conversation. Stores the indices to sample from. Contains valid items. The executor which this conversation belongs to Unique ID for this conversation Total number of tokens in this conversation, cannot exceed the context length. Indicates if this conversation has been disposed, nothing can be done with a disposed conversation Indicates if this conversation is waiting for inference to be run on the executor. "Prompt" and "Sample" cannot be called when this is true. Indicates that this conversation should be sampled. Finalizer for Conversation End this conversation, freeing all resources used by it Create a copy of the current conversation The copy shares internal state, so consumes very little extra memory. Get the index in the context which each token can be sampled from, the return value of this function get be used to retrieve logits () or to sample a token (. How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires allLogits to have been set during prompting.
For example if 5 tokens were supplied in the last prompt call: The logits of the first token can be accessed with 4 The logits of the second token can be accessed with 3 The logits of the third token can be accessed with 2 The logits of the fourth token can be accessed with 1 The logits of the fifth token can be accessed with 0 Thrown if this conversation was not prompted before the previous call to infer Thrown if Infer() must be called on the executor
Get the logits from this conversation, ready for sampling How far from the end of the previous prompt should logits be sampled. Any value other than 0 requires allLogits to have been set during prompting Thrown if this conversation was not prompted before the previous call to infer Thrown if Infer() must be called on the executor Add tokens to this conversation If true, generate logits for all tokens. If false, only generate logits for the last token. Add tokens to this conversation If true, generate logits for all tokens. If false, only generate logits for the last token. Add a single token to this conversation Prompt this conversation with an image embedding Prompt this conversation with embeddings The raw values of the embeddings. This span must divide equally by the embedding size of this model. Directly modify the KV cache of this conversation Thrown if this method is called while == true Provides direct access to the KV cache of a . See for how to use this. Removes all tokens that have positions in [start, end) Start position (inclusive) End position (exclusive) Removes all tokens starting from the given position Start position (inclusive) Number of tokens Adds relative position "delta" to all tokens that have positions in [p0, p1). If the KV cache is RoPEd, the KV data is updated accordingly Start position (inclusive) End position (exclusive) Amount to add on to each token position Integer division of the positions by factor of `d > 1`. If the KV cache is RoPEd, the KV data is updated accordingly. Start position (inclusive). If less than zero, it is clamped to zero. End position (exclusive). If less than zero, it is treated as "infinity". Amount to divide each position by. A function which can temporarily access the KV cache of a to modify it directly The current end token of this conversation An which allows direct access to modify the KV cache The new end token position Save the complete state of this conversation to a file. if the file already exists it will be overwritten. Save the complete state of this conversation in system memory. Load state from a file This should only ever be called by the BatchedExecutor, on a newly created conversation object! Load state from a previously saved state. This should only ever be called by the BatchedExecutor, on a newly created conversation object! In memory saved state of a Indicates if this state has been disposed Get the size in bytes of this state object Internal constructor prevent anyone outside of LLamaSharp extending this class Extension method for Sample a token from this conversation using the given sampler chain to sample from Offset from the end of the conversation to the logits to sample, see for more details Sample a token from this conversation using the given sampling pipeline to sample from Offset from the end of the conversation to the logits to sample, see for more details Rewind a back to an earlier state by removing tokens from the end The conversation to rewind The number of tokens to rewind Thrown if `tokens` parameter is larger than TokenCount Shift all tokens over to the left, removing "count" tokens from the start and shifting everything over. Leaves "keep" tokens at the start completely untouched. This can be used to free up space when the context gets full, keeping the prompt at the start intact. The conversation to rewind How much to shift tokens over by The number of tokens at the start which should not be shifted Base class for exceptions thrown from This exception is thrown when "Prompt()" is called on a which has already been prompted and before "Infer()" has been called on the associated . This exception is thrown when "Sample()" is called on a which has already been prompted and before "Infer()" has been called on the associated . This exception is thrown when "Sample()" is called on a which was not first prompted. . This exception is thrown when is called when = true This exception is thrown when "Save()" is called on a which has already been prompted and before "Infer()" has been called. . Save the state of a particular sequence to specified path. Also save some extra data which will be returned when loading. Data saved with this method must be saved with Load the state from the specified path into a particular sequence. Also reading header data. Must only be used with data previously saved with The main chat session class. The filename for the serialized model state (KV cache, etc). The filename for the serialized executor state. The filename for the serialized chat history. The filename for the serialized input transform pipeline. The filename for the serialized output transform. The filename for the serialized history transform. The executor for this session. The chat history for this session. The history transform used in this session. The input transform pipeline used in this session. The output transform used in this session. Create a new chat session and preprocess history. The executor for this session History for this session History Transform for this session A new chat session. Create a new chat session. The executor for this session Create a new chat session with a custom history. Use a custom history transform. Add a text transform to the input transform pipeline. Use a custom output transform. Save a session from a directory. Get the session state. SessionState object representing session state in-memory Load a session from a session state. If true loads transforms saved in the session state. Load a session from a directory. If true loads transforms saved in the session state. Add a message to the chat history. Add a system message to the chat history. Add an assistant message to the chat history. Add a user message to the chat history. Remove the last message from the chat history. Compute KV cache for the message and add it to the chat history. Compute KV cache for the system message and add it to the chat history. Compute KV cache for the user message and add it to the chat history. Compute KV cache for the assistant message and add it to the chat history. Replace a user message with a new message and remove all messages after the new message. This is useful when the user wants to edit a message. And regenerate the response. Chat with the model. Chat with the model. Chat with the model. Chat with the model. Regenerate the last assistant message. The state of a chat session in-memory. Saved executor state for the session in JSON format. Saved context state (KV cache) for the session. The input transform pipeline used in this session. The output transform used in this session. The history transform used in this session. The the chat history messages for this session. Create a new session state. Save the session state to folder. Load the session state from folder. Throws when session state is incorrect Role of the message author, e.g. user/assistant/system Role is unknown Message comes from a "system" prompt, not written by a user or language model Message comes from the user Messages was generated by the language model The chat history class Chat message representation Role of the message author, e.g. user/assistant/system Message content Create a new instance Role of message author Message content List of messages in the chat Create a new instance of the chat content class Create a new instance of the chat history from array of messages Add a message to the chat history Role of the message author Message content Serialize the chat history to JSON Deserialize a chat history from JSON A queue with fixed storage size. Currently it's only a naive implementation and needs to be further optimized in the future. Number of items in this queue Maximum number of items allowed in this queue Create a new queue the maximum number of items to store in this queue Fill the quene with the data. Please ensure that data.Count <= size Enquene an element. The parameters used for inference. number of tokens to keep from initial prompt when applying context shifting how many new tokens to predict (n_predict), set to -1 to inifinitely generate response until it complete. Sequences where the model will stop generating further tokens. Type of "mirostat" sampling to use. https://github.com/basusourya/mirostat Disable Mirostat sampling Original mirostat algorithm Mirostat 2.0 algorithm The parameters for initializing a LLama model. `Encoding` cannot be directly JSON serialized, instead store the name as a string which can The model path. Base class for LLamaSharp runtime errors (i.e. errors produced by llama.cpp, converted into exceptions) Create a new RuntimeError Loading model weights failed The model path which failed to load `llama_decode` return a non-zero status code The return status code `llama_decode` return a non-zero status code `llama_get_logits_ith` returned null, indicating that the index was invalid The incorrect index passed to the `llama_get_logits_ith` call Extension methods to the IContextParams interface Convert the given `IModelParams` into a `LLamaContextParams` Extension methods to the IModelParams interface Convert the given `IModelParams` into a `LLamaModelParams` Find the index of `item` in `list` list to search item to search for Check if the given set of tokens ends with any of the given strings Tokens to check Strings to search for Model to use to convert tokens into bytes Encoding to use to convert bytes into characters Check if the given set of tokens ends with any of the given strings Tokens to check Strings to search for Model to use to convert tokens into bytes Encoding to use to convert bytes into characters Extensions to the KeyValuePair struct Deconstruct a KeyValuePair into it's constituent parts. The KeyValuePair to deconstruct First element, the Key Second element, the Value Type of the Key Type of the Value Run a process for a certain amount of time and then terminate it return code, standard output, standard error, flag indicating if process exited or was terminated Extensions to span which apply in-place normalization In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span The same array In-place multiple every element by 32760 and divide every element in the span by the max absolute value in the span The same span In-place divide every element in the array by the sum of absolute values in the array Also known as "Manhattan normalization". The same array In-place divide every element in the span by the sum of absolute values in the span Also known as "Manhattan normalization". The same span In-place divide every element by the euclidean length of the vector Also known as "L2 normalization". The same array In-place divide every element by the euclidean length of the vector Also known as "L2 normalization". The same span Creates a new array containing an L2 normalization of the input vector. The same span In-place apply p-normalization. https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm For p = 1, this is taxicab normalization For p = 2, this is euclidean normalization As p => infinity, this approaches infinity norm or maximum norm The same array In-place apply p-normalization. https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm For p = 1, this is taxicab normalization For p = 2, this is euclidean normalization As p => infinity, this approaches infinity norm or maximum norm The same span A llama_context, which holds all the context required to interact with a model Total number of tokens in the context Dimension of embedding vectors The context params set for this context The native handle, which is used to be passed to the native APIs Be careful how you use this! The encoding set for this model to deal with text input. Get or set the number of threads to use for generation Get or set the number of threads to use for batch processing Get the maximum batch size for this context Get the special tokens for the model associated with this context Create a new LLamaContext for the given LLamaWeights Tokenize a string. Whether to add a bos to the text. Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Detokenize the tokens to text. Save the state to specified path. Save the state of a particular sequence to specified path. Get the state data as an opaque handle, which can be loaded later using Use if you intend to save this state to disk. Get the state data as an opaque handle, which can be loaded later using Use if you intend to save this state to disk. Load the state from specified path. Load the state from specified path into a particular sequence Load the state from memory. Load the state from memory into a particular sequence A tuple, containing the decode result, the number of tokens that have not been decoded yet and the total number of tokens that have been decoded. The state of this context, which can be reloaded later Get the size in bytes of this state object Write all the bytes of this state to the given stream Write all the bytes of this state to the given stream Load a state from a stream Load a state from a stream The state of a single sequence, which can be reloaded later Get the size in bytes of this state object Copy bytes to a destination pointer. Destination to write to Length of the destination buffer Offset from start of src to start copying from Number of bytes written to destination Generate high dimensional embedding vectors from text Dimension of embedding vectors LLama Context Create a new embedder, using the given LLamaWeights Get high dimensional embedding vectors for the given text. Depending on the pooling type used when constructing this this may return an embedding vector per token, or one single embedding vector for the entire string. Embedding vectors are not normalized, consider using one of the extensions in . The base class for stateful LLama executors. The logger used by this executor. The tokens that were already processed by the model. The tokens that were consumed by the model during the current inference. The path of the session file. A container of the tokens to be processed and after processed. A container for the tokens of input. The last tokens generated by the model. The context used by the executor. This API is currently not verified. This API has not been verified currently. After running out of the context, take some tokens from the original prompt and recompute the logits in batches. Try to reuse the matching prefix from the session file. Decide whether to continue the loop. Preprocess the inputs before the inference. Do some post processing after the inference. The core inference logic. Save the current state to a file. Get the current state data. Load the state from data. Load the state from a file. Execute the inference. The prompt. If null, generation will continue where it left off previously. Asynchronously runs a prompt through the model to compute KV cache without generating any new tokens. It could reduce the latency of the first time response if the first input from the user is not immediate. Prompt to process State arguments that are used in single inference Tokens count remained to be used. (n_remain) The LLama executor for instruct mode. The descriptor of the state of the instruct executor. Whether the executor is running for the first time (running the prompt). Instruction prefix tokens. Instruction suffix tokens. The LLama executor for interactive mode. Define whether to continue the loop to generate responses. Return whether to break the generation. The descriptor of the state of the interactive executor. Whether the executor is running for the first time (running the prompt). The quantizer to quantize the model. Quantize the model. The model file to be quantized. The path to save the quantized model. The type of quantization. Thread to be used during the quantization. By default it's the physical core number. Whether the quantization is successful. Quantize the model. The model file to be quantized. The path to save the quantized model. The type of quantization. Thread to be used during the quantization. By default it's the physical core number. Whether the quantization is successful. Parse a string into a LLamaFtype. This is a "relaxed" parsing, which allows any string which is contained within the enum name to be used. For example "Q5_K_M" will convert to "LLAMA_FTYPE_MOSTLY_Q5_K_M" This executor infer the input as one-time job. Previous inputs won't impact on the response to current input. The context used by the executor when running the inference. If true, applies the default template to the prompt as defined in the rules for llama_chat_apply_template template. The system message to use with the prompt. Only used when is true. Create a new stateless executor which will use the given model Converts a sequence of messages into text according to a model template Custom template. May be null if a model was supplied to the constructor. Keep a cache of roles converted into bytes. Roles are very frequently re-used, so this saves converting them many times. Array of messages. The property indicates how many messages there are Backing field for Temporary array of messages in the format llama.cpp needs, used when applying the template Indicates how many bytes are in array Result bytes of last call to Indicates if this template has been modified and needs regenerating The encoding algorithm to use Number of messages added to this template Get the message at the given index Thrown if index is less than zero or greater than or equal to Whether to end the prompt with the token(s) that indicate the start of an assistant message. Construct a new template, using the default model template Construct a new template, using the default model template Construct a new template, using a custom template. Only support a pre-defined list of templates. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template Add a new message to the end of this template This template, for chaining calls. Add a new message to the end of this template This template, for chaining calls. Remove a message at the given index This template, for chaining calls. Remove all messages from the template and resets internal state to accept/generate new messages Apply the template to the messages and return a span containing the results A span over the buffer that holds the applied template A message that has been added to a template The "role" string for this message The text content of this message Deconstruct this message into role and content A class that contains all the transforms provided internally by LLama. The default history transform. Uses plain text with the following format: [Author]: [Message] Drop the name at the beginning and the end of the text. A text input transform that only trims the text. A no-op text input transform. A text output transform that removes the keywords from the response. Keywords that you want to remove from the response. This property is used for JSON serialization. Maximum length of the keywords. This property is used for JSON serialization. If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed. This property is used for JSON serialization. JSON constructor. Keywords that you want to remove from the response. The extra length when searching for the keyword. For example, if your only keyword is "highlight", maybe the token you get is "\r\nhighligt". In this condition, if redundancyLength=0, the token cannot be successfully matched because the length of "\r\nhighligt" (10) has already exceeded the maximum length of the keywords (8). On the contrary, setting redundancyLengyh >= 2 leads to successful match. The larger the redundancyLength is, the lower the processing speed. But as an experience, it won't introduce too much performance impact when redundancyLength <= 5 If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed. A set of model weights, loaded into memory. The native handle, which is used in the native APIs Be careful how you use this! Total number of tokens in the context Get the size of this model in bytes Get the number of parameters in this model Dimension of embedding vectors Get the special tokens of this model All metadata keys in this model Load weights into memory Load weights into memory Parameters to use to load the model A cancellation token that can interrupt model loading Receives progress updates as the model loads (0 to 1) Thrown if weights failed to load for any reason. e.g. Invalid file format or loading cancelled. Thrown if the cancellation token is cancelled. Create a llama_context using this model Convert a string of text into tokens Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. A set of llava model weights (mmproj), loaded into memory. The native handle, which is used in the native APIs Be careful how you use this! Load weights into memory path to the "mmproj" model file Load weights into memory path to the "mmproj" model file Create the Image Embeddings from the bytes of an image. Image bytes. Supported formats: JPG PNG BMP TGA Create the Image Embeddings. Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings Create the Image Embeddings from the bytes of an image. Path to the image file. Supported formats: JPG PNG BMP TGA Create the Image Embeddings from the bytes of an image. Path to the image file. Supported formats: JPG PNG BMP TGA Eval the image embeddings Return codes from llama_decode An unspecified error Ok. Could not find a KV slot for the batch (try reducing the size of the batch or increase the context) Return codes from llama_encode An unspecified error Ok. Possible GGML quantisation types Full 32 bit float 16 bit float 4 bit float 4 bit float 5 bit float 5 bit float 8 bit float 8 bit float "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type. Integer, 8 bit Integer, 16 bit Integer, 32 bit The value of this entry is the count of the number of possible quant types. llama_split_mode Single GPU Split layers and KV across GPUs split layers and KV across GPUs, use tensor parallelism if supported Disposes all contained disposables when this class is disposed llama_attention_type A batch allows submitting multiple tokens to multiple sequences simultaneously Keep a list of where logits can be sampled from Get the number of logit positions that will be generated from this batch The number of tokens in this batch Maximum number of tokens that can be added to this batch (automatically grows if exceeded) Maximum number of sequences a token can be assigned to (automatically grows if exceeded) Create a new batch for submitting inputs to llama.cpp Add a single token to the batch at the same position in several sequences https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The token to add The position to add it att The set of sequences to add this token to The index that the token was added at. Use this for GetLogitsIth Add a single token to the batch at the same position in several sequences https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The token to add The position to add it att The set of sequences to add this token to The index that the token was added at. Use this for GetLogitsIth Add a single token to the batch at a certain position for a single sequences https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The token to add The position to add it att The sequence to add this token to The index that the token was added at. Use this for GetLogitsIth Add a range of tokens to a single sequence, start at the given position. The tokens to add The starting position to add tokens at The sequence to add this token to Whether the final token should generate logits The index that the final token was added at. Use this for GetLogitsIth Set TokenCount to zero for this batch Get the positions where logits can be sampled from An embeddings batch allows submitting embeddings to multiple sequences simultaneously Keep a list of where logits can be sampled from Get the number of logit positions that will be generated from this batch Size of an individual embedding The number of items in this batch Maximum number of items that can be added to this batch (automatically grows if exceeded) Maximum number of sequences an item can be assigned to (automatically grows if exceeded) Create a new batch for submitting inputs to llama.cpp Add a single embedding to the batch at the same position in several sequences https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 The embedding to add The position to add it att The set of sequences to add this token to The index that the token was added at. Use this for GetLogitsIth Add a single embedding to the batch for a single sequence The index that the token was added at. Use this for GetLogitsIth Called by embeddings batch to write embeddings into a destination span Type of user data parameter passed in Destination to write data to. Entire destination must be filled! User data parameter passed in Add a single embedding to the batch at the same position in several sequences https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 Type of userdata passed to write delegate Userdata passed to write delegate Delegate called once to write data into a span Position to write this embedding to All sequences to assign this embedding to Whether logits should be generated for this embedding The index that the token was added at. Use this for GetLogitsIth Add a single embedding to the batch at a position for one sequence https://github.com/ggerganov/llama.cpp/blob/ad939626577cd25b462e8026cc543efb71528472/common/common.cpp#L829C2-L829C2 Type of userdata passed to write delegate Userdata passed to write delegate Delegate called once to write data into a span Position to write this embedding to Sequence to assign this embedding to Whether logits should be generated for this embedding The index that the token was added at. Use this for GetLogitsIth Set EmbeddingsCount to zero for this batch Get the positions where logits can be sampled from llama_chat_message Pointer to the null terminated bytes that make up the role string Pointer to the null terminated bytes that make up the content string Called by llama.cpp with a progress value between 0 and 1 If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted. llama_progress_callback A C# representation of the llama.cpp `llama_context_params` struct changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations https://github.com/ggerganov/llama.cpp/pull/7544 text context, 0 = from model logical maximum batch size that can be submitted to llama_decode physical maximum batch size max number of sequences (i.e. distinct states for recurrent models) number of threads to use for generation number of threads to use for batch processing RoPE scaling type, from `enum llama_rope_scaling_type` whether to pool (sum) embedding results by sequence id Attention type to use for embeddings RoPE base frequency, 0 = from model RoPE frequency scaling factor, 0 = from model YaRN extrapolation mix factor, negative = from model YaRN magnitude scaling factor YaRN low correction dim YaRN high correction dim YaRN original context size defragment the KV cache if holes/size > defrag_threshold, Set to < 0 to disable (default) ggml_backend_sched_eval_callback User data passed into cb_eval data type for K cache. EXPERIMENTAL data type for V cache. EXPERIMENTAL Deprecated! if true, extract embeddings (together with logits) whether to offload the KQV ops (including the KV cache) to GPU whether to use flash attention. EXPERIMENTAL whether to measure performance timings ggml_abort_callback User data passed into abort_callback Get the default LLamaContextParams Supported model file types C# representation of llama_ftype All f32 Benchmark@7B: 26GB Mostly f16 Benchmark@7B: 13GB Mostly 8 bit Benchmark@7B: 6.7GB, +0.0004ppl Mostly 4 bit Benchmark@7B: 3.50GB, +0.2499 ppl Mostly 4 bit Benchmark@7B: 3.90GB, +0.1846 ppl Mostly 5 bit Benchmark@7B: 4.30GB @ 7B tokens, +0.0796 ppl Mostly 5 bit Benchmark@7B: 4.70GB, +0.0415 ppl K-Quant 2 bit Benchmark@7B: 2.67GB @ 7N parameters, +0.8698 ppl K-Quant 3 bit (Small) Benchmark@7B: 2.75GB, +0.5505 ppl K-Quant 3 bit (Medium) Benchmark@7B: 3.06GB, +0.2437 ppl K-Quant 3 bit (Large) Benchmark@7B: 3.35GB, +0.1803 ppl K-Quant 4 bit (Small) Benchmark@7B: 3.56GB, +0.1149 ppl K-Quant 4 bit (Medium) Benchmark@7B: 3.80GB, +0.0535 ppl K-Quant 5 bit (Small) Benchmark@7B: 4.33GB, +0.0353 ppl K-Quant 5 bit (Medium) Benchmark@7B: 4.45GB, +0.0142 ppl K-Quant 6 bit Benchmark@7B: 5.15GB, +0.0044 ppl except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors except 1d tensors File type was not specified A safe handle for a LLamaKvCacheView Number of KV cache cells. This will be the same as the context size. Get the total number of tokens in the KV cache. For example, if there are two populated cells, the first with 1 sequence id in it and the second with 2 sequence ids then you'll have 3 tokens. Maximum number of sequences visible for a cell. There may be more sequences than this in reality, this is simply the maximum number this view can see. Number of populated cache cells Maximum contiguous empty slots in the cache. Index to the start of the MaxContiguous slot range. Can be negative when cache is full. Initialize a LLamaKvCacheViewSafeHandle which will call `llama_kv_cache_view_free` when disposed Allocate a new KV cache view which can be used to inspect the KV cache The maximum number of sequences visible in this view per cell Read the current KV cache state into this view. Get the raw KV cache view Get the cell at the given index The index of the cell [0, CellCount) Data about the cell at the given index Thrown if index is out of range (0 <= index < CellCount) Get all of the sequences assigned to the cell at the given index. This will contain entries sequences even if the cell actually has more than that many sequences, allocate a new view with a larger maxSequences parameter if necessary. Invalid sequences will be negative values. The index of the cell [0, CellCount) A span containing the sequences assigned to this cell Thrown if index is out of range (0 <= index < CellCount) Create an empty KV cache view. (use only for debugging purposes) Free a KV cache view. (use only for debugging purposes) Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes) Information associated with an individual cell in the KV cache view (llama_kv_cache_view_cell) The position for this cell. Takes KV cache shifts into account. May be negative if the cell is not populated. An updateable view of the KV cache (llama_kv_cache_view) Number of KV cache cells. This will be the same as the context size. Maximum number of sequences that can exist in a cell. It's not an error if there are more sequences in a cell than this value, however they will not be visible in the view cells_sequences. Number of tokens in the cache. For example, if there are two populated cells, the first with 1 sequence id in it and the second with 2 sequence ids then you'll have 3 tokens. Number of populated cache cells. Maximum contiguous empty slots in the cache. Index to the start of the max_contiguous slot range. Can be negative when cache is full. Information for an individual cell. The sequences for each cell. There will be n_seq_max items per cell. Severity level of a log message. This enum should always be aligned with the one defined on llama.cpp side at https://github.com/ggerganov/llama.cpp/blob/0eb4e12beebabae46d37b78742f4c5d4dbe52dc1/ggml/include/ggml.h#L559 Logs are never written. Logs that are used for interactive investigation during development. Logs that track the general flow of the application. Logs that highlight an abnormal or unexpected event in the application flow, but do not otherwise cause the application execution to stop. Logs that highlight when the current flow of execution is stopped due to a failure. Continue log level is equivalent to None in the way it is used in llama.cpp. Keeps track of the previous log level to be able to handle the log level . Override a key/value pair in the llama model metadata (llama_model_kv_override) Key to override Type of value Add 4 bytes of padding, to align the next fields to 8 bytes Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_INT Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_FLOAT Value, **must** only be used if Tag == LLAMA_KV_OVERRIDE_BOOL Value, **must** only be used if Tag == String Specifies what type of value is being overridden by LLamaModelKvOverride llama_model_kv_override_type Overriding an int value Overriding a float value Overriding a bool value Overriding a string value A C# representation of the llama.cpp `llama_model_params` struct NULL-terminated list of devices to use for offloading (if NULL, all available devices are used) todo: add support for llama_model_params.devices // number of layers to store in VRAM how to split the model across multiple GPUs the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE how to split layers across multiple GPUs (size: ) called with a progress value between 0 and 1, pass NULL to disable. If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted. context pointer passed to the progress callback override key-value pairs of the model meta data only load the vocabulary, no weights use mmap if possible force system to keep model in RAM validate model tensor data Create a LLamaModelParams with default values Quantizer parameters used in the native API llama_model_quantize_params number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency() quantize to this llama_ftype output tensor type token embeddings tensor type allow quantizing non-f32/f16 tensors quantize output.weight only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored quantize all tensors to the default type quantize to the same number of shards pointer to importance matrix data pointer to vector containing overrides Create a LLamaModelQuantizeParams with default values Input data for llama_decode A llama_batch object can contain input about one or many sequences The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens The number of items pointed at by pos, seq_id and logits. Either `n_tokens` of `llama_token`, or `NULL`, depending on how this batch was created Either `n_tokens * embd * sizeof(float)` or `NULL`, depending on how this batch was created the positions of the respective token in the sequence (if set to NULL, the token position will be tracked automatically by llama_decode) https://github.com/ggerganov/llama.cpp/blob/master/llama.h#L139 ??? the sequence to which the respective token belongs (if set to NULL, the sequence ID will be assumed to be 0) if zero, the logits for the respective token will not be output (if set to NULL, only the logits for last token will be returned) llama_pooling_type No specific pooling type. Use the model default if this is specific in Do not pool embeddings (per-token embeddings) Take the mean of every token embedding Return the embedding for the special "CLS" token Used by reranking models to attach the classification head to the graph Indicates position in a sequence The raw value Create a new LLamaPos Convert a LLamaPos into an integer (extract the raw value) Convert an integer into a LLamaPos Increment this position Increment this position llama_rope_type ID for a sequence in a batch LLamaSeqId with value 0 The raw value Create a new LLamaSeqId Convert a LLamaSeqId into an integer (extract the raw value) Convert an integer into a LLamaSeqId LLama performance information llama_perf_context_data Timestamp when reset was last called Loading milliseconds total milliseconds spent prompt processing Total milliseconds in eval/decode calls number of tokens in eval calls for the prompt (with batch size > 1) number of eval calls Timestamp when reset was last called Time spent loading total milliseconds spent prompt processing Total milliseconds in eval/decode calls number of tokens in eval calls for the prompt (with batch size > 1) number of eval calls LLama performance information llama_perf_sampler_data A single token Token Value used when token is inherently null The raw value Create a new LLamaToken Convert a LLamaToken into an integer (extract the raw value) Convert an integer into a LLamaToken Get attributes for this token Get attributes for this token Get score for this token Check if this is a control token Check if this is a control token Check if this token should end generation Check if this token should end generation Token attributes C# equivalent of llama_token_attr A single token along with probability of this token being selected token id log-odds of the token probability of the token Create a new LLamaTokenData Contains an array of LLamaTokenData, potentially sorted. The LLamaTokenData Indicates if `data` is sorted by logits in descending order. If this is false the token data is in _no particular order_. Create a new LLamaTokenDataArray Create a new LLamaTokenDataArray, copying the data from the given logits Create a new LLamaTokenDataArray, copying the data from the given logits into temporary memory. The memory must not be modified while this is in use. Temporary memory which will be used to work on these logits. Must be at least as large as logits array Overwrite the logit values for all given tokens tuples of token and logit value to overwrite Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits. Contains a pointer to an array of LLamaTokenData which is pinned in memory. C# equivalent of llama_token_data_array A pointer to an array of LlamaTokenData Memory must be pinned in place for all the time this LLamaTokenDataArrayNative is in use (i.e. `fixed` or `.Pin()`) Number of LLamaTokenData in the array The index in the array (i.e. not the token id) A pointer to an array of LlamaTokenData Indicates if the items in the array are sorted, so the most likely token is first The index of the selected token (i.e. not the token id) Number of LLamaTokenData in the array. Set this to shrink the array Create a new LLamaTokenDataArrayNative around the data in the LLamaTokenDataArray Data source Created native array A memory handle, pinning the data in place until disposed C# equivalent of llama_vocab struct. This struct is an opaque type, with no fields in the API and is only used for typed pointers. Get attributes for a specific token Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.) Identify if Token Id is a control token or a render-able token beginning-of-sentence end-of-sentence end-of-turn sentence separator next-line padding llama_vocab_pre_type llama_vocab_type For models without vocab LLaMA tokenizer based on byte-level BPE with byte fallback GPT-2 tokenizer based on byte-level BPE BERT tokenizer based on WordPiece T5 tokenizer based on Unigram RWKV tokenizer based on greedy tokenization LLaVa Image embeddings llava_image_embed Set configurations for all the native libraries, including LLama and LLava Set configurations for all the native libraries, including LLama and LLava Configuration for LLama native library Configuration for LLava native library Check if the native library has already been loaded. Configuration cannot be modified if this is true. Set the log callback that will be used for all llama.cpp log messages Set the log callback that will be used for all llama.cpp log messages Try to load the native library with the current configurations, but do not actually set it to . You can still modify the configuration after this calling but only before any call from . The loaded livrary. When the loading failed, this will be null. However if you are using .NET standard2.0, this will never return null. Whether the running is successful. A class to set same configurations to multiple libraries at the same time. Do an action for all the configs in this container. Set the log callback that will be used for all llama.cpp log messages Set the log callback that will be used for all llama.cpp log messages Try to load the native library with the current configurations, but do not actually set it to . You can still modify the configuration after this calling but only before any call from . Whether the running is successful. The name of the native library The native library compiled from llama.cpp. The native library compiled from the LLaVA example of llama.cpp. A native library specified with a local file path. Information of a native library file. Which kind of library it is. Whether it's compiled with cublas. Whether it's compiled with vulkan. Which AvxLevel it's compiled with. Information of a native library file. Which kind of library it is. Whether it's compiled with cublas. Whether it's compiled with vulkan. Which AvxLevel it's compiled with. Which kind of library it is. Whether it's compiled with cublas. Whether it's compiled with vulkan. Which AvxLevel it's compiled with. Avx support configuration No AVX Advanced Vector Extensions (supported by most processors after 2011) AVX2 (supported by most processors after 2013) AVX512 (supported by some processors after 2016, not widely supported) Try to load libllama/llava_shared, using CPU feature detection to try and load a more specialised DLL if possible The library handle to unload later, or IntPtr.Zero if no library was loaded Operating system information. Operating system information. Get the system information of the current machine. When you are using .NET standard2.0, dynamic native library loading is not supported. This class will be returned in . A LoRA adapter which can be applied to a context for a specific model The model which this LoRA adapter was loaded with. The full path of the file this adapter was loaded from Native pointer of the loaded adapter, will be automatically freed when the model is unloaded Indicates if this adapter has been unloaded Unload this adapter Direct translation of the llama.cpp API A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded. Call once at the end of the program - currently only used for MPI Get the maximum number of devices supported by llama.cpp Check if memory mapping is supported Check if memory locking is supported Check if GPU offload is supported Check if RPC offload is supported Initialize the llama + ggml backend. Call once at the start of the program. This is private because LLamaSharp automatically calls it, and it's only valid to call it once! Load session file Save session file Set whether to use causal attention or not. If set to true, the model will only attend to the past tokens Set whether the model is in embeddings mode or not. If true, embeddings will be returned but logits will not Set abort callback Get the n_seq_max for this context Get all output token embeddings. When pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model, the embeddings for which llama_batch.logits[i] != 0 are stored contiguously in the order they have appeared in the batch. shape: [n_outputs*n_embd] Otherwise, returns an empty span. Apply chat template. Inspired by hf apply_chat_template() on python. A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead. Pointer to a list of multiple llama_chat_message Number of llama_chat_message in this chat Whether to end the prompt with the token(s) that indicate the start of an assistant message. A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages) The size of the allocated buffer The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template. Get list of built-in chat templates Print out timing information for this context Print system information Convert a single token into text buffer to write string into User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix') If true, special tokens are rendered in the output The length written, or if the buffer is too small a negative that indicates the length required Convert text into tokens The tokens pointer must be large enough to hold the resulting tokens. add_special Allow to add BOS and EOS tokens if model is configured to do so. Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space. Returns the number of tokens on success, no more than n_max_tokens. Returns a negative number on failure - the number of tokens that would have been returned Convert the provided tokens into text (inverse of llama_tokenize()). The char pointer must be large enough to hold the resulting text. remove_special Allow to remove BOS and EOS tokens if model is configured to do so. unparse_special If true, special tokens are rendered in the output. Returns the number of chars/bytes on success, no more than textLengthMax. Returns a negative number on failure - the number of chars/bytes that would have been returned. Register a callback to receive llama log messages Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times Returns the number of used KV cells (i.e. have at least one sequence assigned to them) Clear the KV cache. Both cell info is erased and KV data is zeroed Removes all tokens that belong to the specified sequence and have positions in [p0, p1) Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails Copy all tokens that belong to the specified sequence to another sequence Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence Removes all tokens that do not belong to the specified sequence Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1) If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update() Integer division of the positions by factor of `d > 1` If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
Returns the largest position present in the KV cache for the specified sequence Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized Each token can be assigned up to n_seq_max sequence ids Frees a batch of tokens allocated with llama_batch_init() Apply a loaded control vector to a llama_context, or if data is NULL, clear the currently loaded vector. n_embd should be the size of a single layer's control, and data should point to an n_embd x n_layers buffer starting from layer 1. il_start and il_end are the layer range the vector should apply to (both inclusive) See llama_control_vector_load in common to load a control vector. Build a split GGUF final path for this chunk. llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf" Returns the split_path length. Extract the path prefix from the split_path if and only if the split_no and split_count match. llama_split_prefix(split_prefix, 64, "/models/ggml-model-q4_0-00002-of-00004.gguf", 2, 4) => split_prefix = "/models/ggml-model-q4_0" Returns the split_prefix length. Sanity check for clip <-> llava embed size match LLama Context Llava Model True if validate successfully Build an image embed from image file bytes SafeHandle to the Clip Model Number of threads Binary image in jpeg format Bytes length of the image SafeHandle to the Embeddings Build an image embed from a path to an image filename SafeHandle to the Clip Model Number of threads Image filename (jpeg) to generate embeddings SafeHandle to the embeddings Free an embedding made with llava_image_embed_make_* Embeddings to release Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed. Llama Context Embedding handle True on success Get the loaded native library. If you are using netstandard2.0, it will always return null. Returns 0 on success Returns 0 on success Configure llama.cpp logging Callback from llama.cpp with log messages Register a callback to receive llama log messages A GC handle for the current log callback to ensure the callback is not collected Register a callback to receive llama log messages Register a callback to receive llama log messages RoPE scaling type. C# equivalent of llama_rope_scaling_type No particular scaling type has been specified Do not apply any RoPE scaling Positional linear interpolation, as described by kaikendev: https://kaiokendev.github.io/til#extending-context-to-8k YaRN scaling: https://arxiv.org/pdf/2309.00071.pdf LongRope scaling A safe wrapper around a llama_context Total number of tokens in the context Dimension of embedding vectors Get the maximum batch size for this context Get the physical maximum batch size for this context Get or set the number of threads used for generation of a single token. Get or set the number of threads used for prompt and batch processing (multiple token). Get the pooling type for this context Get the model which this context is using Get the vocabulary for the model this context is using Create a new llama_state for the given model Create a new llama_context with the given model. **This should never be called directly! Always use SafeLLamaContextHandle.Create**! Frees all allocated memory in the given llama_context Set a callback which can abort computation If this returns true computation is cancelled Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output internally for later use by the decoder cross-attention layers. 0 = success
< 0 = error
Set the number of threads used for decoding n_threads is the number of threads used for generation (single token) n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens) Get the number of threads used for generation of a single token. Get the number of threads used for prompt and batch processing (multiple token). Token logits obtained from the last call to llama_decode The logits for the last token are stored in the last row Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab Get the size of the context window for the model for this context Get the batch size for this context Get the ubatch size for this context Returns the **actual** size in bytes of the state (logits, embedding and kv_cache). Only use when saving the state, not when restoring it, otherwise the size may be too small. Copies the state to the specified destination address. Destination needs to have allocated enough memory. the number of bytes copied Set the state reading from the specified address the number of bytes read Get the exact size needed to copy the KV cache of a single sequence Copy the KV cache of a single sequence into the specified buffer Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence - Positive: Ok - Zero: Failed to load Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update() Apply the KV cache updates (such as K-shifts, defragmentation, etc.) Check if the context supports KV cache shifting Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results and is not necessary to call it explicitly in most cases. Get the pooling type for this context Get the embeddings for a sequence id. Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence otherwise: float[n_embd] (1-dimensional) A pointer to the first float in an embedding, length = ctx.EmbeddingSize Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd A pointer to the first float in an embedding, length = ctx.EmbeddingSize Add a LoRA adapter to this context Remove a LoRA adapter from this context Indicates if the lora was in this context and was remove Remove all LoRA adapters from this context Token logits obtained from the last call to llama_decode. The logits for the last token are stored in the last row. Only tokens with `logits = true` requested are present.
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
The amount of tokens whose logits should be retrieved, in [numTokens X n_vocab] format.
Tokens' order is based on their order in the LlamaBatch (so, first tokens are first, etc).
This is helpful when requesting logits for many tokens in a sequence, or want to decode multiple sequences in one go.
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd A pointer to the first float in an embedding, length = ctx.EmbeddingSize Get the embeddings for the a specific sequence. Equivalent to: llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd A pointer to the first float in an embedding, length = ctx.EmbeddingSize Convert the given text into tokens The text to tokenize Whether the "BOS" token should be added Encoding to use for the text Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Convert a single llama token into bytes Token to decode A span to attempt to write into. If this is too small nothing will be written The size of this token. **nothing will be written** if this is larger than `dest` This object exists to ensure there is only ever 1 inference running at a time. This is a workaround for thread safety issues in llama.cpp itself. Most notably CUDA, which seems to use some global singleton resources and will crash if multiple inferences are run (even against different models). For more information see these issues: - https://github.com/SciSharp/LLamaSharp/issues/596 - https://github.com/ggerganov/llama.cpp/issues/3960 If these are ever resolved this lock can probably be removed. Wait until all computations are finished. This is automatically done when using any of the functions to obtain computation results and is not necessary to call it explicitly in most cases. Processes a batch of tokens with the encoder part of the encoder-decoder model. Stores the encoder output internally for later use by the decoder cross-attention layers. 0 = success
< 0 = error (the KV cache state is restored to the state before this call)
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error (the KV cache state is restored to the state before this call)
Decode a set of tokens in batch-size chunks. A tuple, containing the decode result and the number of tokens that have not been decoded yet. Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
Get the size of the state, when saved as bytes Get the size of the KV cache for a single sequence ID, when saved as bytes Get the raw state of this context, encoded as bytes. Data is written into the `dest` pointer. Destination to write to Number of bytes available to write to in dest (check required size with `GetStateSize()`) The number of bytes written to dest Thrown if dest is too small Get the raw state of a single sequence from this context, encoded as bytes. Data is written into the `dest` pointer. Destination to write to Number of bytes available to write to in dest (check required size with `GetStateSize()`) The sequence to get state data for The number of bytes written to dest Set the raw state of this context The pointer to read the state from Number of bytes that can be safely read from the pointer Number of bytes read from the src pointer Set the raw state of a single sequence The pointer to read the state from Sequence ID to set Number of bytes that can be safely read from the pointer Number of bytes read from the src pointer Get performance information Reset all performance information for this context Check if the context supports KV cache shifting Apply KV cache updates (such as K-shifts, defragmentation, etc.) Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update() Get a new KV cache view that can be used to debug the KV cache Count the number of used cells in the KV cache (i.e. have at least one sequence assigned to them) Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times Clear the KV cache - both cell info is erased and KV data is zeroed Removes all tokens that belong to the specified sequence and have positions in [p0, p1) Copy all tokens that belong to the specified sequence to another sequence. Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence Removes all tokens that do not belong to the specified sequence Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1. If the KV cache is RoPEd, the KV data is updated accordingly Integer division of the positions by factor of `d > 1`. If the KV cache is RoPEd, the KV data is updated accordingly.
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
Returns the largest position present in the KV cache for the specified sequence Base class for all llama handles to native resources A reference to a set of llama model weights Get the rope (positional embedding) type for this model The number of tokens in the context that this model was trained for Get the rope frequency this model was trained with Dimension of embedding vectors Get the size of this model in bytes Get the number of parameters in this model Get the number of layers in this model Get the number of heads in this model Returns true if the model contains an encoder that requires llama_encode() call Returns true if the model contains a decoder that requires llama_decode() call Returns true if the model is recurrent (like Mamba, RWKV, etc.) Get a description of this model Get the number of metadata key/value pairs Get the vocabulary of this model Load a model from the given file path into memory Load the model from a file If the file is split into multiple parts, the file name must follow this pattern: {name}-%05d-of-%05d.gguf If the split file name does not follow this pattern, use llama_model_load_from_splits The loaded model, or null on failure. Load the model from multiple splits (support custom naming scheme) The paths must be in the correct order Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter. Can be NULL to use the current loaded model. The model needs to be reloaded before applying a new adapter, otherwise the adapter will be applied on top of the previous one Returns 0 on success Frees all allocated memory associated with a model Get the number of metadata key/value pairs Get metadata key name by index Model to fetch from Index of key to fetch buffer to write result into The length of the string on success (even if the buffer is too small). -1 is the key does not exist. Get metadata value as a string by index Model to fetch from Index of val to fetch Buffer to write result into The length of the string on success (even if the buffer is too small). -1 is the key does not exist. Get metadata value as a string by key name The length of the string on success, or -1 on failure Get the number of tokens in the model vocabulary Get the size of the context window for the model Get the dimension of embedding vectors from this model Get the number of layers in this model Get the number of heads in this model Get a string describing the model type The length of the string on success (even if the buffer is too small)., or -1 on failure Get the size of the model in bytes The size of the model Get the number of parameters in this model The functions return the length of the string on success, or -1 on failure Get the model's RoPE frequency scaling factor For encoder-decoder models, this function returns id of the token that must be provided to the decoder to start generating output sequence. For other models, it returns -1. Returns true if the model contains an encoder that requires llama_encode() call Returns true if the model contains a decoder that requires llama_decode() call Returns true if the model is recurrent (like Mamba, RWKV, etc.) Load a LoRA adapter from file. The adapter will be associated with this model but will not be applied Convert a single llama token into bytes Token to decode A span to attempt to write into. If this is too small nothing will be written User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix') If true, special characters will be converted to text. If false they will be invisible. The size of this token. **nothing will be written** if this is larger than `dest` Convert a sequence of tokens into characters. The section of the span which has valid data in it. If there was insufficient space in the output span this will be filled with as many characters as possible, starting from the _last_ token. Convert a string of text into tokens Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Create a new context for this model Get the metadata value for the given key The key to fetch The value, null if there is no such key Get the metadata key for the given index The index to get The key, null if there is no such key or if the buffer was too small Get the metadata value for the given index The index to get The value, null if there is no such value or if the buffer was too small Get the default chat template. Returns nullptr if not available If name is NULL, returns the default chat template Get tokens for a model Total number of tokens in this vocabulary Get the the type of this vocabulary Get the Beginning of Sentence token for this model Get the End of Sentence token for this model Get the newline token for this model Get the padding token for this model Get the sentence separator token for this model Codellama beginning of infill prefix Codellama beginning of infill middle Codellama beginning of infill suffix Codellama pad Codellama rep Codellama rep end-of-turn token For encoder-decoder models, this function returns id of the token that must be provided to the decoder to start generating output sequence. Check if the current model requires a BOS token added Check if the current model requires a EOS token added A chain of sampler stages that can be used to select tokens from logits. Wraps a handle returned from `llama_sampler_chain_init`. Other samplers are owned by this chain and are never directly exposed. Get the number of samplers in this chain Apply this sampler to a set of candidates Sample and accept a token from the idx-th output of the last evaluation. Shorthand for: var logits = ctx.GetLogitsIth(idx); var token_data_array = LLamaTokenDataArray.Create(logits); using LLamaTokenDataArrayNative.Create(token_data_array, out var native_token_data); sampler_chain.Apply(native_token_data); var token = native_token_data.Data.Span[native_token_data.Selected]; sampler_chain.Accept(token); return token; Reset the state of this sampler Accept a token and update the internal state of this sampler Get the name of the sampler at the given index Get the seed of the sampler at the given index if applicable. returns LLAMA_DEFAULT_SEED otherwise Create a new sampler chain Clone a sampler stage from another chain and add it to this chain The chain to clone a stage from The index of the stage to clone Remove a sampler stage from this chain Add a custom sampler stage Add a sampler which picks the most likely token. Add a sampler which picks from the probability distribution of all tokens Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words. The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text. The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates. The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm. Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words. The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text. The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751 Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751 Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841 Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666. Apply temperature to the logits. If temperature is less than zero the maximum logit is left unchanged and the rest are set to -infinity Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772. XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335 This sampler is meant to be used for fill-in-the-middle infilling, after top_k + top_p sampling
1. if the sum of the EOG probs times the number of candidates is higher than the sum of the other probs -> pick EOG
2. combine probs of tokens that have the same prefix

example:

- before:
"abc": 0.5
"abcd": 0.2
"abcde": 0.1
"dummy": 0.1

- after:
"abc": 0.8
"dummy": 0.1

3. discard non-EOG tokens with low prob
4. if no tokens are left -> pick EOT
Create a sampler which makes tokens impossible unless they match the grammar Root rule of the grammar Create a sampler using lazy grammar sampling: https://github.com/ggerganov/llama.cpp/pull/9639 Grammar in GBNF form Root rule of the grammar A list of tokens that will trigger the grammar sampler. A list of words that will trigger the grammar sampler. Create a sampler that applies various repetition penalties. Avoid using on the full vocabulary as searching for repeated tokens can become slow. For example, apply top-k or top-p sampling first. How many tokens of history to consider when calculating penalties Repetition penalty Frequency penalty Presence penalty DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677. Porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982 The model this sampler will be used with penalty multiplier, 0.0 = disabled exponential base repeated sequences longer than this are penalized how many tokens to scan for repetitions (0 = entire context) Create a sampler that applies a bias directly to the logits llama_sampler_chain_params whether to measure performance timings Get the default LLamaSamplerChainParams A bias to apply directly to a logit The token to apply the bias to The bias to add llama_sampler_i Get the name of this sampler Update internal sampler state after a token has been chosen Apply this sampler to a set of logits Reset the internal state of this sampler Create a clone of this sampler Free all resources held by this sampler llama_sampler Holds the function pointers which make up the actual sampler Any additional context this sampler needs, may be anything. We will use it to hold a GCHandle. This GCHandle roots this object, preventing it from being freed. A reference to the user code which implements the custom sampler Get a pointer to a `llama_sampler` (LLamaSamplerNative) struct, suitable for passing to `llama_sampler_chain_add` A custom sampler stage for modifying logits or selecting a token The human readable name of this stage Apply this stage to a set of logits. This can modify logits or select a token (or both). If logits are modified the Sorted flag must be set to false. If the logits are no longer sorted after the custom sampler has run it is critically important to set Sorted=false. If unsure, always set it to false, this is a safe default. Update the internal state of the sampler when a token is chosen Reset the internal state of this sampler Create a clone of this sampler A Reference to a llava Image Embed handle Get the model used to create this image embedding Get the number of dimensions in an embedding Get the number of "patches" in an image embedding Create an image embed from an image file Path to the image file. Supported formats: JPG PNG BMP TGA Create an image embed from an image file Path to the image file. Supported formats: JPG PNG BMP TGA Create an image embed from the bytes of an image. Image bytes. Supported formats: JPG PNG BMP TGA Create an image embed from the bytes of an image. Image bytes. Supported formats: JPG PNG BMP TGA Copy the embeddings data to the destination span A reference to a set of llava model weights. Get the number of dimensions in an embedding Get the number of "patches" in an image embedding Load a model from the given file path into memory MMP File (Multi-Modal Projections) Verbosity level SafeHandle of the Clip Model Create the Image Embeddings. LLama Context Image filename (it supports jpeg format only) return the SafeHandle of these embeddings Create the Image Embeddings. Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings Create the Image Embeddings. LLama Context Image in binary format (it supports jpeg format only) return the SafeHandle of these embeddings Create the Image Embeddings. Image in binary format (it supports jpeg format only) Number of threads to use return the SafeHandle of these embeddings Evaluates the image embeddings. Llama Context The current embeddings to evaluate True on success Load MULTI MODAL PROJECTIONS model / Clip Model Model path/file Verbosity level SafeLlavaModelHandle Frees MULTI MODAL PROJECTIONS model / Clip Model Internal Pointer to the model Create a new sampler wrapping a llama.cpp sampler chain Create a sampling chain. This will be called once, the base class will automatically dispose the chain. An implementation of ISamplePipeline which mimics the default llama.cpp sampling Bias values to add to certain logits Repetition penalty, as described in https://arxiv.org/abs/1909.05858 Frequency penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
Presence penalty as described by OpenAI: https://platform.openai.com/docs/api-reference/chat/create
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
How many tokens should be considered for penalties Whether the newline token should be protected from being modified by penalty Whether the EOS token should be suppressed. Setting this to 'true' prevents EOS from being sampled Temperature to apply (higher temperature is more "creative") Number of tokens to keep in TopK sampling P value for locally typical sampling P value for TopP sampling P value for MinP sampling Grammar to apply to constrain possible tokens The minimum number of tokens to keep for samplers which remove tokens Seed to use for random sampling A grammar in GBNF form A grammar in GBNF form A sampling pipeline which always selects the most likely token Grammar to apply to constrain possible tokens Convert a span of logits into a single sampled token. This interface can be implemented to completely customise the sampling process. Sample a single token from the given context at the given position The context being sampled from Position to sample logits from Reset all internal state of the sampling pipeline Update the pipeline, with knowledge that a particular token was just accepted Extension methods for Sample a single token from the given context at the given position The context being sampled from Position to sample logits from Decodes a stream of tokens into a stream of characters The number of decoded characters waiting to be read If true, special characters will be converted to text. If false they will be invisible. Create a new decoder Text encoding to use Model weights Create a new decoder Context to retrieve encoding and model weights from Create a new decoder Text encoding to use Context to retrieve model weights from Create a new decoder Text encoding to use Models weights to use Add a single token to the decoder Add a single token to the decoder Add all tokens in the given enumerable Add all tokens in the given span Read all decoded characters and clear the buffer Read all decoded characters as a string and clear the buffer Set the decoder back to its initial state A prompt formatter that will use llama.cpp's template formatter If your model is not supported, you will need to define your own formatter according the cchat prompt specification for your model A prompt formatter that will use llama.cpp's template formatter If your model is not supported, you will need to define your own formatter according the cchat prompt specification for your model Apply the template to the messages and return the resulting prompt as a string The formatted template string as defined by the model