snakefusion API

class snakefusion.Embeddings

finalfusion embeddings.

analogy(word1, word2, word3, limit=10, mask=(True, True, True))

Perform an anology query.

This returns words for the analogy query w1 is to w2 as w3 is to ?.

embedding(word, /, default=None)

Get the embedding for the given word.

If the word is not known, its representation is approximated using subword units. #

If no representation can be calculated:
  • None if default is None

  • an array filled with default if default is a scalar

  • an array if default is a 1-d array

  • an array filled with values from default if it is an iterator over floats.

embedding_batch(words, /, out=None)

Get the embedding for a batch of words. The embeddings are returned along with an array that indicates for each word whether an embedding could be found.

If a matrix is provided through the out argument, embeddings are written to that matrix. Rows corresponding to words for which no embedding could be found are not overwritten.

embedding_similarity(embeddings, /, skip=None, limit=10)

Perform a similarity query based on a query embedding. skip specifies the set of words that should never be returned.

embedding_with_norm(word)

Look up the embedding and norm of a word. The embedding and norm are returned as a tuple.

static from_bytes(data)

Deserialize embeddings from bytes.

metadata

Embeddings metadata.

quantize(n_subquantizers, /, quantizer='pq', n_subquantizer_bits=8, n_iterations=100, n_attempts=1, normalize=True)

Quantize the embeddings with the given hyperparemeters:

  • The number of subquantizers

  • The quantizer (pq, opq, or gausian_opq).

  • The number of bits per subquantizer.

  • The number of optimization iterations.

  • The number of quantization attempts per iteration.

  • Whether embeddings should be l2-normalized before quantization.

Returns the quantized embeddings.

static read_fasttext(path, /, lossy=False)

Read embeddings in the fasttext format.

Lossy decoding of the words can be toggled through the lossy param.

static read_floret_text(path)

Read embeddings in the floret text format.

static read_text(path, /, lossy=False)

Read embeddings in text format. This format uses one line per embedding. Each line starts with the word in UTF-8, followed by its vector components encoded in ASCII. The word and its components are separated by spaces.

Lossy decoding of the words can be toggled through the lossy param.

static read_text_dims(path, /, lossy=False)

Read embeddings in text format with dimensions. In this format, the first line states the shape of the embedding matrix. The number of rows (words) and columns (embedding dimensionality) is separated by a space character. The remainder of the file uses one line per embedding. Each line starts with the word in UTF-8, followed by its vector components encoded in ASCII. The word and its components are separated by spaces.

Lossy decoding of the words can be toggled through the lossy param.

static read_word2vec(path, /, lossy=False)

Read embeddings in the word2vec binary format.

Lossy decoding of the words can be toggled through the lossy param.

storage

Get the model’s storage.

to_bytes()

Serialize the embeddings to bytes.

vocab

Get the model’s vocabulary.

word_similarity(word, /, limit=10)

Perform a similarity query.

write(path)

Write the embeddings to a finalfusion file.

class snakefusion.Storage

Embedding matrix storage.

matrix_copy()

Copy the entire embeddings matrix.

shape

Get the shape of the embedding matrix.

class snakefusion.Vocab

Embeddings vocabulary

get(word, /, default=None)

Get the index or subword indices of a word.

If a word is known, returns the index of the word in the embedding matrix. If a word is unknown, return its subword indices.

The provided default parameter is returned if the word could not be looked up.

ngram_indices(word)

Return the of a word and their indices.

subword_indices(word)

Return the subword indices of a word.