calbert package

Submodules

calbert.CalBERT module

class calbert.CalBERT.CalBERT(model_path: str, num_pooling_layers: int = 1, pooling_method: str = 'mean', device: str = 'cpu')[source]

Bases: Module

add_tokens_to_tokenizer(tokens: List[str]) int[source]

Add new tokens to the CalBERT Tokenizer.

Parameters

tokens – List of tokens to add to the Tokenizer.

Returns

New vocabulary size of the Tokenizer

batch_embed(encodings: Dict[str, Tensor]) Tensor[source]

Returns the embedding representation of a batch of encodings.

Parameters

encodings – Dictionary containing the input ids, attention mask and token type ids.

Returns

Embedding representation of the batch of sentences.

batch_encode(sentences: List[str]) Dict[str, Tensor][source]

Encode a list of sentences using the CalBERT Tokenizer.

Parameters

sentences – List of sentences to encode.

Returns

Dictionary containing the input ids, attention mask and token type ids.

batch_sentence_embedding(sentences: List[str], pooling: bool = False) Tensor[source]

Returns the sentence embedding of a batch of sentences.

Parameters
  • sentences – List of sentences to embed.

  • pooling – Whether to pool the embedding.

Returns

Sentence embeddings of the batch of sentences.

distance(sentence1: str, sentence2: str, metric='cosine', pooling: bool = True) Tuple[Tensor, Tensor][source]

Returns the distance between two sentences.

Parameters
  • sentence1 – First sentence.

  • sentence2 – Second sentence.

  • metric – Metric to use for distance. Can be cosine, euclidean or manhattan.

  • pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the distance.

embed(encoding: Dict[str, Tensor]) Tensor[source]

Returns the embedding representation of an encoding.

Parameters

encoding – Dictionary containing the input ids, attention mask and token type ids.

Returns

Embedding representation of the sentence.

embedding_distance(embedding1: Tensor, embedding2: Tensor, metric: str = 'cosine') Tuple[Tensor, Tensor][source]

Returns the distance between two embeddings defined by the metric.

Parameters
  • embedding1 – First embedding.

  • embedding2 – Second embedding.

  • metric – Metric to use for distance. Can be ‘cosine’, ‘euclidean’ or ‘manhattan’.

Returns

Distance between the embeddings and the distance matrix.

static embedding_similarity(embedding1: Tensor, embedding2: Tensor) Tuple[Tensor, Tensor][source]

Returns the similarity between two embeddings.

Parameters
  • embedding1 – First embedding.

  • embedding2 – Second embedding.

Returns

Similarity between the embeddings and the similarity matrix.

encode(sentence: str) Dict[str, Tensor][source]

Encode a sentence using the CalBERT Tokenizer

Parameters

sentence – Sentence to encode.

Returns

Dictionary containing the input ids, attention mask and token type ids.

forward(sentences: List[str], pooling: bool = False) Tensor[source]

Returns the sentence embedding of a batch of sentences.

Parameters
  • sentences – List of sentences to embed.

  • pooling – Whether to pool the embedding.

static load(path: Union[Path, str], transformer_path: Optional[str] = None) CalBERT[source]

Loads the CalBERT Siamese Network model.

Parameters
  • path – The path to the CalBERT model. If this is a directory, ensure that it contains the calbert.py file and the config.json to load the Transformer. If this is a file, it should be the calbert.pt file.

  • transformer_path – The path to the Transformer model. If None, the model is loaded from the path using the config.json.

Returns

The loaded CalBERT Siamese Network model.

pooling(weights: Tensor) Tensor[source]

Returns the pooled representation of a batch of weights.

Parameters

weights – Batch of weights to pool.

Returns

Pooled representation of the batch of weights.

save(path: Union[Path, str], save_pretrained: bool = True, save_tokenizer: bool = True) None[source]

Saves the CalBERT Siamese Network model

Parameters
  • path – The directory path in which to save the model.

  • save_pretrained – Whether to save the Transformer separately.

  • save_tokenizer – Whether to save the Tokenizer for the Transformer separately. Applicable only if save_pretrained is True.

Returns

None

save_pretrained(path: Union[Path, str], save_tokenizer: bool = True) None[source]

Invokes the base Transformer save_pretrained method to save the model and Tokenizer.

Parameters
  • path – The directory path in which to save the Transformer and Tokenizer

  • save_tokenizer – Whether to save the Tokenizer.

Returns

None

sentence_embedding(sentence: str, pooling: bool = False) Tensor[source]

Returns the sentence embedding of a sentence.

Parameters
  • sentence – Sentence to embed.

  • pooling – Whether to pool the embedding.

Returns

Sentence embedding.

similarity(sentence1: str, sentence2: str, pooling: bool = True) Tuple[Tensor, Tensor][source]

Returns the similarity between two sentences.

Parameters
  • sentence1 – First sentence.

  • sentence2 – Second sentence.

  • pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the similarity.

train_new_tokenizer(sentences: List[str]) int[source]

Train a new tokenizer on a list of sentences. :param sentences: List of sentences to train the tokenizer on. :return: New vocabulary size of the tokenizer.

training: bool

calbert.CalBERTDataset module

class calbert.CalBERTDataset.CalBERTDataset(base_language_sentences: List[str], target_language_sentences: List[str], labels: Optional[float] = None, negative_sampling: bool = False, negative_sampling_size: float = 0.5, negative_sampling_count: int = 1, negative_sampling_type: str = 'target', min_count: int = 10, shuffle: bool = True)[source]

Bases: Dataset

compute_vocabulary(min_count: Optional[int] = None) List[str][source]

Compute the vocabulary of the dataset by finding tokens appearing atleast min_count times.

Parameters

min_count – Minimum frequency of a token in the dataset to be included in the vocabulary

Returns

List of tokens in the dataset appearing atleast min_count times

get_batch(start: int, end: int) Tuple[List[str], List[str], Tensor][source]

Returns a batch of examples from the dataset between the given start and end indices.

Parameters
  • start – Start index of the batch in the dataset

  • end – End index of the batch in the dataset

Returns

A tuple of base language sentences, target language sentences, and labels between the given start and end indices

get_tokens() List[str][source]

Returns the vocabulary of the dataset computed by compute_vocabulary.

Returns

List of tokens in vocabulary

static load(path: Union[str, Path]) CalBERTDataset[source]

Load a CalBertDataset object from the given path.

Parameters

path – Path to load the dataset object

Returns

CalBertDataset object

sample_negative_examples(sampling: str = 'target') None[source]

Sample negative examples from the dataset for each positive example.

Parameters

sampling – Whether to sample from the base language or the target language or both

Returns

None

save(path: Union[str, Path]) None[source]

Save the dataset object to the given path.

Parameters

path – Path to save the dataset object

Returns

None

calbert.SiamesePreTrainer module