calbert package

Submodules

calbert.CalBERT module

class calbert.CalBERT.CalBERT(model_path: str, num_pooling_layers: int = 1, pooling_method: str = 'mean', device: str = 'cpu')[source]

Bases: Module

add_tokens_to_tokenizer(tokens: List[str]) → int[source]

Add new tokens to the CalBERT Tokenizer.

Parameters: tokens – List of tokens to add to the Tokenizer.
Returns: New vocabulary size of the Tokenizer

batch_embed(encodings: Dict[str, Tensor]) → Tensor[source]

Returns the embedding representation of a batch of encodings.

Parameters: encodings – Dictionary containing the input ids, attention mask and token type ids.
Returns: Embedding representation of the batch of sentences.

batch_encode(sentences: List[str]) → Dict[str, Tensor][source]

Encode a list of sentences using the CalBERT Tokenizer.

Parameters: sentences – List of sentences to encode.
Returns: Dictionary containing the input ids, attention mask and token type ids.

batch_sentence_embedding(sentences: List[str], pooling: bool = False) → Tensor[source]

Returns the sentence embedding of a batch of sentences.

Parameters

sentences – List of sentences to embed.
pooling – Whether to pool the embedding.

Returns

Sentence embeddings of the batch of sentences.

distance(sentence1: str, sentence2: str, metric='cosine', pooling: bool = True) → Tuple[Tensor, Tensor][source]

Returns the distance between two sentences.

Parameters

sentence1 – First sentence.
sentence2 – Second sentence.
metric – Metric to use for distance. Can be cosine, euclidean or manhattan.
pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the distance.

embed(encoding: Dict[str, Tensor]) → Tensor[source]

Returns the embedding representation of an encoding.

Parameters: encoding – Dictionary containing the input ids, attention mask and token type ids.
Returns: Embedding representation of the sentence.

embedding_distance(embedding1: Tensor, embedding2: Tensor, metric: str = 'cosine') → Tuple[Tensor, Tensor][source]

Returns the distance between two embeddings defined by the metric.

Parameters

embedding1 – First embedding.
embedding2 – Second embedding.
metric – Metric to use for distance. Can be ‘cosine’, ‘euclidean’ or ‘manhattan’.

Returns

Distance between the embeddings and the distance matrix.

static embedding_similarity(embedding1: Tensor, embedding2: Tensor) → Tuple[Tensor, Tensor][source]

Returns the similarity between two embeddings.

Parameters

embedding1 – First embedding.
embedding2 – Second embedding.

Returns

Similarity between the embeddings and the similarity matrix.

encode(sentence: str) → Dict[str, Tensor][source]

Encode a sentence using the CalBERT Tokenizer

Parameters: sentence – Sentence to encode.
Returns: Dictionary containing the input ids, attention mask and token type ids.

forward(sentences: List[str], pooling: bool = False) → Tensor[source]

Returns the sentence embedding of a batch of sentences.

Parameters

sentences – List of sentences to embed.
pooling – Whether to pool the embedding.

static load(path: Union[Path, str], transformer_path: Optional[str] = None) → CalBERT[source]

Loads the CalBERT Siamese Network model.

Parameters

path – The path to the CalBERT model. If this is a directory, ensure that it contains the calbert.py file and the config.json to load the Transformer. If this is a file, it should be the calbert.pt file.
transformer_path – The path to the Transformer model. If None, the model is loaded from the path using the config.json.

Returns

The loaded CalBERT Siamese Network model.

pooling(weights: Tensor) → Tensor[source]

Returns the pooled representation of a batch of weights.

Parameters: weights – Batch of weights to pool.
Returns: Pooled representation of the batch of weights.

save(path: Union[Path, str], save_pretrained: bool = True, save_tokenizer: bool = True) → None[source]

Saves the CalBERT Siamese Network model

Parameters

path – The directory path in which to save the model.
save_pretrained – Whether to save the Transformer separately.
save_tokenizer – Whether to save the Tokenizer for the Transformer separately. Applicable only if save_pretrained is True.

Returns

None

save_pretrained(path: Union[Path, str], save_tokenizer: bool = True) → None[source]

Invokes the base Transformer save_pretrained method to save the model and Tokenizer.

Parameters

path – The directory path in which to save the Transformer and Tokenizer
save_tokenizer – Whether to save the Tokenizer.

Returns

None

sentence_embedding(sentence: str, pooling: bool = False) → Tensor[source]

Returns the sentence embedding of a sentence.

Parameters

sentence – Sentence to embed.
pooling – Whether to pool the embedding.

Returns

Sentence embedding.

similarity(sentence1: str, sentence2: str, pooling: bool = True) → Tuple[Tensor, Tensor][source]

Returns the similarity between two sentences.

Parameters

sentence1 – First sentence.
sentence2 – Second sentence.
pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the similarity.

train_new_tokenizer(sentences: List[str]) → int[source]: Train a new tokenizer on a list of sentences. :param sentences: List of sentences to train the tokenizer on. :return: New vocabulary size of the tokenizer.

training: bool

calbert.CalBERTDataset module

class calbert.CalBERTDataset.CalBERTDataset(base_language_sentences: List[str], target_language_sentences: List[str], labels: Optional[float] = None, negative_sampling: bool = False, negative_sampling_size: float = 0.5, negative_sampling_count: int = 1, negative_sampling_type: str = 'target', min_count: int = 10, shuffle: bool = True)[source]

Bases: Dataset

compute_vocabulary(min_count: Optional[int] = None) → List[str][source]

Compute the vocabulary of the dataset by finding tokens appearing atleast min_count times.

Parameters: min_count – Minimum frequency of a token in the dataset to be included in the vocabulary
Returns: List of tokens in the dataset appearing atleast min_count times

get_batch(start: int, end: int) → Tuple[List[str], List[str], Tensor][source]

Returns a batch of examples from the dataset between the given start and end indices.

Parameters

start – Start index of the batch in the dataset
end – End index of the batch in the dataset

Returns

A tuple of base language sentences, target language sentences, and labels between the given start and end indices

get_tokens() → List[str][source]

Returns the vocabulary of the dataset computed by compute_vocabulary.

Returns: List of tokens in vocabulary

static load(path: Union[str, Path]) → CalBERTDataset[source]

Load a CalBertDataset object from the given path.

Parameters: path – Path to load the dataset object
Returns: CalBertDataset object

sample_negative_examples(sampling: str = 'target') → None[source]

Sample negative examples from the dataset for each positive example.

Parameters: sampling – Whether to sample from the base language or the target language or both
Returns: None

save(path: Union[str, Path]) → None[source]

Save the dataset object to the given path.

Parameters: path – Path to save the dataset object
Returns: None

calbert package

Submodules

calbert.CalBERT module

calbert.CalBERTDataset module

calbert.SiamesePreTrainer module