Master thesis task CBT
Basics
- Children’s book test [^@taskCBT]
- Will use label-studio for any annotations if needed later on
- Sources for UA text:
- казки - Search - Anna’s Archive
- For now will use the text of Казки добрих сусідів. Золотоголова рибка: Вірменські народні казки - Anna’s Archive to write the code, will later find a good-OCR version of some book from the 1980s or sth likely to be out of copyright
- DE: library.lol/fiction/44e759bb147893cd46bf3549894f3706
- казки - Search - Anna’s Archive
- For putting other words in the correct morphology, Руководство пользователя — Морфологический анализатор pymorphy2 seems nice!
TODOs
- deduplicate options etc. by lemma (синку-син-??)
- gender of the noun giving hints!
- the bits below
Issues/Problems/TODOs
Multiple possible answers
— Синку, як ти мене знайшов? — запитав батько. — Коли вже так, віднеси обід до джерела, я туди прийду і поїмо
QUESTION: — Ні, батьку, — сказав ______ .
OPTIONS: {'хлопець', 'хлопчик', 'син', 'цар'}
Complex structures
Будь ______ , пообідайте з нами!', options={'ласка', 'ножа', 'жаль', 'візир', 'дозволь'}, answer='ласка')
Unknown/unknowable answer
│ context = 'Ein Mann und eine Frau hatten einen goldenen Ring. Das war ein │
│ Glücksring, und wer ihn besaß, hatte immer genug zu leben. Sie │
│ wußten es aber nicht und verkauften den Ring für wenig Geld. Kaum │
│ war der Ring aus dem Hause, da wurden sie immer ärmer und wußten │
│ schließlich nicht mehr, woher sie genug zum Essen nehmen sollten. │
│ Sie hatten auch einen Hund und eine Katze, die mußten mit ihnen │
│ Hunger leiden. Da ratschlagten die Tiere miteinander, wie sie den │
│ Leuten wieder zu ihrem alten Glück verhelfen könnten.' │
I'll be using "Label all tasks" then it would show me the next CBT after I submit.
Keybindings are nice for classifying text.
When importing the things, I should try to do text highlighting or whatever to make it easier visually.
Code notes
Multiple hard options
Sometimes it gives multiple options
[
Parse(
word='корів',
tag=OpencorporaTag('NOUN,inan plur,gent'),
normal_form='кір',
score=1.0,
methods_stack=((DictionaryAnalyzer(), 'корів', 498, 11),)
),
Parse(
word='корів',
tag=OpencorporaTag('NOUN,anim plur,gent'),
normal_form='корова',
score=1.0,
methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 8),)
),
Parse(
word='корів',
tag=OpencorporaTag('NOUN,anim plur,accs'),
normal_form='корова',
score=1.0,
methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 10),)
)
]
I can find the right one:
-
by morphology gotten from spacy?
-
by base form gotten from spacy?
token.lemma_
-
there’s no score for UA language in pymorphy, but they have an article on how to do this in general: Руководство пользователя — Морфологический анализатор pymorphy2
- I think my use of spacy might be relatively novel there, haha
-
If I have to get an intermediate representation:
- pymorphy’s OpenCorporaTags is part of OpenCorpora: открытый корпус русского языка, which is a Russian corpus project
- OpenCorpora: открытый корпус русского языка
- OpenCorpora/russian-tagsets: Russian morphological tagset converters library. is a library for converting stuff!
- And Universal Dependencies seems a neutral one?..
- spacy uses FEATS which is basically Universal Dependencies: https://spacy.io/api/morphology#morphanalysis links to CoNLL-U Format,
- Universal features so cool
- Spacy’s FEATS <-> dict etc. code: spaCy/spacy/morphology.pyx at master · explosion/spaCy
- pymorphy’s OpenCorporaTags is part of OpenCorpora: открытый корпус русского языка, which is a Russian corpus project
-
lang-uk/ukrainian-heteronyms-dictionary: Dictionary of heteronyms in the Ukrainian language dictionary of cases where птАшки/пташкИ
2023-11-29 11:46
-
ChatGPT suggested using spacy for this which led to googling for other options:
- pyinflect · PyPI
- links to bjascob/LemmInflect: A python module for English lemmatization and inflection., allegedly a better one
- English-only
- I so should mention how many of the inflection libraries don’t follow the bender rule and I have to guess if they are English-only or not. pyinflect doesn’t
- tooleks/shevchenko-js: JavaScript library for declension of Ukrainian anthroponyms but only for proper names
- Nothing exists, ChatGPT lied multiple times
-
Ukrainian UD UD
- UD_Ukrainian-IU TODO
-
Jena Oct 2003 orga for UA NLP: About / UkrNLP-Corpora TODO ADD
-
pymorphy2 code is quite interesting and I have much to learn from it! Decided then.
-
Writing the package
- TIL what spacy calls AUX for pymorphy it’s a VERB. The case is “був”. Interesting
- Finished
- Uploaded to pchr8/pymorphy-spacy-disambiguation: A package that picks the correct pymorphy2 morphology analysis based on morphology data from spacy, added to UA-CBT poetry dependencies!0
- I think I’ll be able to use this also for the фемінітиви task
-
Adding the newly created disambiguator to CBT-UA
- I save replacements as strings, without context - I need to save the spacy token to make use of it there too
2-3-4 and multiple plurals
- I can’t get pymorphy2 to inflect from singular to plural at all, which reminds me of the fact that it’s not just singular/plural
- Dual (grammatical number) - Wikipedia
make_agree_with_number
pymorphy2.analyzer — Морфологический анализатор pymorphy2 usesself.inflect(self.tag.numeral_agreement_grammemes(num))
(Pdb++) t.tag.numeral_agreement_grammemes(1)
{'sing', 'nomn'}
(Pdb++) t.tag.numeral_agreement_grammemes(2)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(3)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(4)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(5)
{'plur', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(6)
{'plur', 'gent'}
-
Узгодження власнекількісних простих числівників з іменниками — урок. Українська мова, 6 клас НУШ.
-
I have a suspicion that in Ukrainian it doesn’t parse singular as singular, and therefore can’t make agree with singular numbers. To plural and parsing words that are plural to begin with works fine!
(Pdb++) self.morph.parse("стіл")[1].inflect({'plur'}).tag.number
'plur'
(Pdb++) self.morph.parse("стіл")[1].tag.number
(Pdb++)
Yes:
(Pdb++) pp self.morph.parse("столи")[1].lexeme
[Parse(word='стіл', tag=OpencorporaTag('NOUN,inan masc,nomn'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стіл', 2710, 0),)),
Parse(word='стола', tag=OpencorporaTag('NOUN,inan masc,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стола', 2710, 1),)),
Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 2),)),
Parse(word='столові', tag=OpencorporaTag('NOUN,inan masc,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столові', 2710, 3),)),
Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 4),)),
Parse(word='стіл', tag=OpencorporaTag('NOUN,inan masc,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стіл', 2710, 5),)),
Parse(word='стола', tag=OpencorporaTag('NOUN,inan masc,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стола', 2710, 6),)),
Parse(word='столом', tag=OpencorporaTag('NOUN,inan masc,ablt'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столом', 2710, 7),)),
Parse(word='столі', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столі', 2710, 8),)),
Parse(word='столові', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столові', 2710, 9),)),
Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 10),)),
Parse(word='столе', tag=OpencorporaTag('NOUN,inan masc,voct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столе', 2710, 11),)),
Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,nomn'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 12),)),
Parse(word='столів', tag=OpencorporaTag('NOUN,inan plur,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столів', 2710, 13),)),
Parse(word='столам', tag=OpencorporaTag('NOUN,inan plur,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столам', 2710, 14),)),
Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 15),)),
Parse(word='столами', tag=OpencorporaTag('NOUN,inan plur,ablt'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столами', 2710, 16),)),
Parse(word='столах', tag=OpencorporaTag('NOUN,inan plur,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столах', 2710, 17),)),
Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,voct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 18),))]
-
Maybe it’s an artifact in conversion?
-
In either case I think I see the problem
chatGPT
..may help.