cyzil
Cyzil provides tools that enable quick and in-depth analysis of sequence-sequence models, especially machine translation models. It contains a Cython module that provides fast computation of standard metrics such as BLEU score.
Expand source code
"""
Cyzil provides tools that enable quick and in-depth analysis of
sequence-sequence models, especially machine translation models. It contains
a Cython module that provides fast computation of standard metrics such as
BLEU score.
"""
from bleu import bleu_sentence, bleu_corpus, bleu_points
from edit_distance import edit_distance_sentence, edit_distance_corpus, edit_distance_points
__all__ = [
"bleu_sentence",
"bleu_corpus",
"bleu_points",
"edit_distance_sentence",
"edit_distance_corpus",
"edit_distance_points"
]
Functions
def bleu_corpus(...)
-
Computes corpus-level BLEU score.
Parameters
reference_corpus
,candidate_corpus
:list
- A corpus contains a list of strings as individual sentences. Reference is assumed to be correct sequences, and candidateis is assumed to be sequences generated by some model. It assumes that a pair of reference and candidate is stored at the same index.
max_ngram
:int
- The maximum order of ngram to compute the score.
Returns
corpus_score
:list
- A list of 3 decimal values: the first value is the precision, the second value is brevity penalty, and the last value is bleu score, which is the product of precision and brevity penalty.
Example
>>> from cyzil import bleu_corpus >>> reference_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'cat']] candidate_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'dog']] >>> bleu_corpus(reference_corpus, candidate_corpus, 4) [0.8806841373443604, 1.0, 0.8806841373443604]
def bleu_points(...)
-
Computes BLEU score for each reference-candiate pair in corpus.
Parameters
reference_corpus
,candidate_corpus
:list
- A corpus contains a list of strings as individual sentences. Reference is assumed to be correct sequences, and candidateis is assumed to be sequences generated by some model. It assumes that a pair of reference and candidate is stored at the same index.
max_ngram
:int
- The maximum order of ngram to compute the score.
Returns
points
:list
ofbleu
score
[number
ofpairs
in
corpus
,3
]- A 2-dimensional list that contains precision, brevity penalty, and bleu score for each reference-candidate pair, where each row corresponds to each pair.
Example
>>> from cyzil import bleu_points >>> reference_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'cat']] candidate_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'dog']] >>> bleu_points(reference_corpus, candidate_corpus, 4) [[1.0, 1.0, 1.0], [0.809106707572937, 1.0, 0.809106707572937]]
def bleu_sentence(...)
-
Computes sentence-level BLEU score.
Parameters
reference
,candiate
:list
- A tokenized sentence stored as a list of strings. Reference is assumed to be a correct sequence, and candidateis is assumed to be a sequence generated by some model.
max_ngram
:int
- The maximum order of ngram to compute the score.
Returns
list
- A list of 3 decimal values: the first value is the precision, the second value is brevity penalty, and the last value is bleu score, which is the product of precision and brevity penalty.
Example
>>> from cyzil import bleu_sentence >>> bleu_sentence(['this', 'is', 'a', 'test', 'sentence'], ['this', 'is', 'a', 'test', 'sentence']) [1.0, 1.0, 1.0]
def edit_distance_corpus(...)
-
Computes corpus-level Edit distance (Levenshtein distance).
Parameters
reference_corpus
,candidate_corpus
:list
- A corpus contains a list of strings as individual sentences. Reference is assumed to be correct sequences, and candidateis is assumed to be sequences generated by some model. It assumes that a pair of reference and candidate is stored at the same index.
Returns
corpus_score
:list
- A list of 2 decimal values: the first value is the mean edit distance between reference corpus and candidate corpus and the second value is the mean normalized edit distance, i.e. the sum of edit distance for each pair divided by the length of the reference.
Notes
This method computes token-level edit distance. In other words, it's different from character-level distance. For example, the token-level edit distance between ['I', 'have', 'a', 'pen'] and ['I', 'have', 'a', 'dog'] is 1 because only one
editing
happens between 'pen' and 'dog'. For character-level edit distance, please refer to https://pypi.org/project/python-Levenshtein/.Example
>>> from cyzil import edit_distance_corpus >>> reference_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'cat']] candidate_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'dog']] >>> edit_distance_corpus(reference_corpus, candidate_corpus) [0.5, 0.0714285746216774]
def edit_distance_points(...)
-
Computes Edit distance (Levenshtein distance) for each reference-candiate pair in corpus.
Parameters
reference_corpus
,candidate_corpus
:list
- A corpus contains a list of strings as individual sentences. Reference is assumed to be correct sequences, and candidateis is assumed to be sequences generated by some model. It assumes that a pair of reference and candidate is stored at the same index.
Returns
points
:list
ofedit
distance
[number
ofpairs
in
corpus
,2
]- A 2-dimensional list that contains edit distance and normalized edit distance (i.e. edit distance divided by the length of reference) for each reference-candidate pair, where each row corresponds to each pair.
Notes
This method computes token-level edit distance. In other words, it's different from character-level distance. For example, the token-level edit distance between ['I', 'have', 'a', 'pen'] and ['I', 'have', 'a', 'dog'] is 1 because only one
editing
happens between 'pen' and 'dog'. For character-level edit distance, please refer to https://pypi.org/project/python-Levenshtein/.Example
>>> from cyzil import edit_distance_points >>> reference_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'cat']] candidate_corpus = [['this', 'is', 'a', 'test', 'sentence'], ['I', 'see', 'an', 'apple', 'and', 'a', 'dog']] >>> edit_distance_points(reference_corpus, candidate_corpus) [[0.0, 0.0], [1.0, 0.1428571492433548]]
def edit_distance_sentence(...)
-
Computes sentence-level Edit distance (Levenshtein distance).
Parameters
sen1
,sen2
:list
- A tokenized sentence stored as a list of strings.
Returns
distance
:int
- edit distance between a reference sentence and a candidate sentence.
Notes
This method computes token-level edit distance. In other words, it's different from character-level distance. For example, the token-level edit distance between ['I', 'have', 'a', 'pen'] and ['I', 'have', 'a', 'dog'] is 1 because only one
editing
happens between 'pen' and 'dog'. For character-level edit distance, please refer to https://pypi.org/project/python-Levenshtein/.Example
>>> from cyzil import edit_distance_sentence >>> edit_distance_sentence(['this', 'is', 'a', 'test', 'sentence'], ['this', 'is', 'a', 'test', 'sentence']) 0