snakefusion API¶

class snakefusion.Embeddings¶

finalfusion embeddings.

analogy(word1, word2, word3, limit=10, mask=(True, True, True))¶

Perform an anology query.

This returns words for the analogy query w1 is to w2 as w3 is to ?.

embedding(word, /, default=None)¶

Get the embedding for the given word.

If the word is not known, its representation is approximated using subword units. #

If no representation can be calculated:

None if default is None
an array filled with default if default is a scalar
an array if default is a 1-d array
an array filled with values from default if it is an iterator over floats.

embedding_batch(words, /, out=None)¶

Get the embedding for a batch of words. The embeddings are returned along with an array that indicates for each word whether an embedding could be found.

If a matrix is provided through the out argument, embeddings are written to that matrix. Rows corresponding to words for which no embedding could be found are not overwritten.

embedding_similarity(embeddings, /, skip=None, limit=10)¶: Perform a similarity query based on a query embedding. skip specifies the set of words that should never be returned.

embedding_with_norm(word)¶: Look up the embedding and norm of a word. The embedding and norm are returned as a tuple.

static from_bytes(data)¶: Deserialize embeddings from bytes.

metadata¶: Embeddings metadata.

quantize(n_subquantizers, /, quantizer='pq', n_subquantizer_bits=8, n_iterations=100, n_attempts=1, normalize=True)¶

Quantize the embeddings with the given hyperparemeters:

The number of subquantizers
The quantizer (pq, opq, or gausian_opq).
The number of bits per subquantizer.
The number of optimization iterations.
The number of quantization attempts per iteration.
Whether embeddings should be l2-normalized before quantization.

Returns the quantized embeddings.

static read_fasttext(path, /, lossy=False)¶

Read embeddings in the fasttext format.

Lossy decoding of the words can be toggled through the lossy param.

static read_floret_text(path)¶: Read embeddings in the floret text format.

static read_text(path, /, lossy=False)¶

Read embeddings in text format. This format uses one line per embedding. Each line starts with the word in UTF-8, followed by its vector components encoded in ASCII. The word and its components are separated by spaces.

Lossy decoding of the words can be toggled through the lossy param.

static read_text_dims(path, /, lossy=False)¶

Read embeddings in text format with dimensions. In this format, the first line states the shape of the embedding matrix. The number of rows (words) and columns (embedding dimensionality) is separated by a space character. The remainder of the file uses one line per embedding. Each line starts with the word in UTF-8, followed by its vector components encoded in ASCII. The word and its components are separated by spaces.

Lossy decoding of the words can be toggled through the lossy param.

static read_word2vec(path, /, lossy=False)¶

Read embeddings in the word2vec binary format.

Lossy decoding of the words can be toggled through the lossy param.

storage¶: Get the model’s storage.

to_bytes()¶: Serialize the embeddings to bytes.

vocab¶: Get the model’s vocabulary.

word_similarity(word, /, limit=10)¶: Perform a similarity query.

write(path)¶: Write the embeddings to a finalfusion file.

class snakefusion.Storage¶

Embedding matrix storage.

matrix_copy()¶: Copy the entire embeddings matrix.

shape¶: Get the shape of the embedding matrix.

class snakefusion.Vocab¶

Embeddings vocabulary

get(word, /, default=None)¶

Get the index or subword indices of a word.

If a word is known, returns the index of the word in the embedding matrix. If a word is unknown, return its subword indices.

The provided default parameter is returned if the word could not be looked up.

ngram_indices(word)¶: Return the of a word and their indices.

subword_indices(word)¶: Return the subword indices of a word.

snakefusion API¶

snakefusion

Navigation

Related Topics