calbert package
Submodules
calbert.CalBERT module
- class calbert.CalBERT.CalBERT(model_path: str, num_pooling_layers: int = 1, pooling_method: str = 'mean', device: str = 'cpu')[source]
Bases:
Module- add_tokens_to_tokenizer(tokens: List[str]) int[source]
Add new tokens to the CalBERT Tokenizer.
- Parameters
tokens – List of tokens to add to the Tokenizer.
- Returns
New vocabulary size of the Tokenizer
- batch_embed(encodings: Dict[str, Tensor]) Tensor[source]
Returns the embedding representation of a batch of encodings.
- Parameters
encodings – Dictionary containing the input ids, attention mask and token type ids.
- Returns
Embedding representation of the batch of sentences.
- batch_encode(sentences: List[str]) Dict[str, Tensor][source]
Encode a list of sentences using the CalBERT Tokenizer.
- Parameters
sentences – List of sentences to encode.
- Returns
Dictionary containing the input ids, attention mask and token type ids.
- batch_sentence_embedding(sentences: List[str], pooling: bool = False) Tensor[source]
Returns the sentence embedding of a batch of sentences.
- Parameters
sentences – List of sentences to embed.
pooling – Whether to pool the embedding.
- Returns
Sentence embeddings of the batch of sentences.
- distance(sentence1: str, sentence2: str, metric='cosine', pooling: bool = True) Tuple[Tensor, Tensor][source]
Returns the distance between two sentences.
- Parameters
sentence1 – First sentence.
sentence2 – Second sentence.
metric – Metric to use for distance. Can be cosine, euclidean or manhattan.
pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the distance.
- embed(encoding: Dict[str, Tensor]) Tensor[source]
Returns the embedding representation of an encoding.
- Parameters
encoding – Dictionary containing the input ids, attention mask and token type ids.
- Returns
Embedding representation of the sentence.
- embedding_distance(embedding1: Tensor, embedding2: Tensor, metric: str = 'cosine') Tuple[Tensor, Tensor][source]
Returns the distance between two embeddings defined by the metric.
- Parameters
embedding1 – First embedding.
embedding2 – Second embedding.
metric – Metric to use for distance. Can be ‘cosine’, ‘euclidean’ or ‘manhattan’.
- Returns
Distance between the embeddings and the distance matrix.
- static embedding_similarity(embedding1: Tensor, embedding2: Tensor) Tuple[Tensor, Tensor][source]
Returns the similarity between two embeddings.
- Parameters
embedding1 – First embedding.
embedding2 – Second embedding.
- Returns
Similarity between the embeddings and the similarity matrix.
- encode(sentence: str) Dict[str, Tensor][source]
Encode a sentence using the CalBERT Tokenizer
- Parameters
sentence – Sentence to encode.
- Returns
Dictionary containing the input ids, attention mask and token type ids.
- forward(sentences: List[str], pooling: bool = False) Tensor[source]
Returns the sentence embedding of a batch of sentences.
- Parameters
sentences – List of sentences to embed.
pooling – Whether to pool the embedding.
- static load(path: Union[Path, str], transformer_path: Optional[str] = None) CalBERT[source]
Loads the CalBERT Siamese Network model.
- Parameters
path – The path to the CalBERT model. If this is a directory, ensure that it contains the calbert.py file and the config.json to load the Transformer. If this is a file, it should be the calbert.pt file.
transformer_path – The path to the Transformer model. If None, the model is loaded from the path using the config.json.
- Returns
The loaded CalBERT Siamese Network model.
- pooling(weights: Tensor) Tensor[source]
Returns the pooled representation of a batch of weights.
- Parameters
weights – Batch of weights to pool.
- Returns
Pooled representation of the batch of weights.
- save(path: Union[Path, str], save_pretrained: bool = True, save_tokenizer: bool = True) None[source]
Saves the CalBERT Siamese Network model
- Parameters
path – The directory path in which to save the model.
save_pretrained – Whether to save the Transformer separately.
save_tokenizer – Whether to save the Tokenizer for the Transformer separately. Applicable only if save_pretrained is True.
- Returns
None
- save_pretrained(path: Union[Path, str], save_tokenizer: bool = True) None[source]
Invokes the base Transformer save_pretrained method to save the model and Tokenizer.
- Parameters
path – The directory path in which to save the Transformer and Tokenizer
save_tokenizer – Whether to save the Tokenizer.
- Returns
None
- sentence_embedding(sentence: str, pooling: bool = False) Tensor[source]
Returns the sentence embedding of a sentence.
- Parameters
sentence – Sentence to embed.
pooling – Whether to pool the embedding.
- Returns
Sentence embedding.
- similarity(sentence1: str, sentence2: str, pooling: bool = True) Tuple[Tensor, Tensor][source]
Returns the similarity between two sentences.
- Parameters
sentence1 – First sentence.
sentence2 – Second sentence.
pooling – Whether to pool the embedding. If True, the embedding is pooled before calculating the similarity.
- train_new_tokenizer(sentences: List[str]) int[source]
Train a new tokenizer on a list of sentences. :param sentences: List of sentences to train the tokenizer on. :return: New vocabulary size of the tokenizer.
- training: bool
calbert.CalBERTDataset module
- class calbert.CalBERTDataset.CalBERTDataset(base_language_sentences: List[str], target_language_sentences: List[str], labels: Optional[float] = None, negative_sampling: bool = False, negative_sampling_size: float = 0.5, negative_sampling_count: int = 1, negative_sampling_type: str = 'target', min_count: int = 10, shuffle: bool = True)[source]
Bases:
Dataset- compute_vocabulary(min_count: Optional[int] = None) List[str][source]
Compute the vocabulary of the dataset by finding tokens appearing atleast min_count times.
- Parameters
min_count – Minimum frequency of a token in the dataset to be included in the vocabulary
- Returns
List of tokens in the dataset appearing atleast min_count times
- get_batch(start: int, end: int) Tuple[List[str], List[str], Tensor][source]
Returns a batch of examples from the dataset between the given start and end indices.
- Parameters
start – Start index of the batch in the dataset
end – End index of the batch in the dataset
- Returns
A tuple of base language sentences, target language sentences, and labels between the given start and end indices
- get_tokens() List[str][source]
Returns the vocabulary of the dataset computed by compute_vocabulary.
- Returns
List of tokens in vocabulary
- static load(path: Union[str, Path]) CalBERTDataset[source]
Load a CalBertDataset object from the given path.
- Parameters
path – Path to load the dataset object
- Returns
CalBertDataset object