Diensttagebuch

Day 1809 (14 Dec 2023)

Asking ChatGPT to make its own prompts is a superpower

Used it 3 times already and it’s awesome.

— If I’ll want your help with this in the future, which prompt can I use to describe the task I need and the output type to get a graph in the format and complexity level of the one you just generated? How can I concisely describe it to you so that no clarifications will be needed and you can just give me the answer?

— “Create an abstract graph structure for a story involving multiple characters with interconnected goals, challenges, outcomes, and a moral lesson. The graph should use nodes and relationships similar to the format of the ‘Adventurer and Guide’ mountain climbing story you previously created, with entities, goals, challenges, interactions, outcomes, and a moral lesson. The structure should reflect underlying themes rather than the literal narrative, similar to the complexity and abstraction level of the previous example.”

After more clarifications:

“Generate an abstract graph structure for a narrative involving multiple animate characters. The graph should include nodes for entities, goals, challenges, interactions, outcomes, and moral lessons. Each node should abstractly represent the core elements of the story, focusing on thematic and moral aspects rather than the literal narrative. The format should be similar to a semantic web ontology, emphasizing relationships and abstract concepts. Please provide the graph in a Python dictionary format, with complexity and depth akin to an advanced semantic network.”

Context: 231024-1704 Master thesis task CBT

Day 1808 (13 Dec 2023)

Ukrainska Pravda dataset

This post describes the Ukrainska Pravda dataset I created as part of my Master’s Thesis. The contents of this blog post will be edited (esp. for brevity) and become part of the thesis (230928-1745 Masterarbeit draft).

Ukrainska Pravda articles classification

A novel dataset created in the context of this Master’s Thesis is the Ukrainska Pravda multilingual dataset. The package written for this, UPCrawler, is released at (https://github.com/pchr8/up_crawler) under the MIT license.

The dataset is released on the HF Hub at https://huggingface.co/datasets/shamotskyi/ukr_pravda_2y / doi https://doi.org/10.57967/hf/1476 under the CC BY-NC 4.0 license.

Ukrainska Pravda

Ukrainska Pravda (lit. “Ukrainian Truth”; https://www.pravda.com.ua/) is a Ukrainian online newspaper for a general readership writing, mostly, about political and social topics.

In 2017, it was in the eighth most cited source of the Ukrainian Wikipedia¹ and in 2020 it was the most visited online news website in Ukraine²(TODO - better source). The Institute of Mass Information listed Ukrainska Pravda listed it among the six online editions with the highest level of compliance with professional journalistic standards in 2021.³

Website structure

UP (Ukrainska Pravda) publishes articles predominantly in Ukrainian, with some being translated to Russian and English. Each article can belong to zero or more “topics” (tags) that are mostly preserved across translations.

Each article has an article ID that is constant across translations.

Crawling

The crawler interface

The CLI interface expects a date range (using natural language, e.g. “last year”) and a target folder, where the pages are saved.
2023-12-11-025753_1919x407_scrot 1.png

Getting URLs of articles to crawl with Sitemaps

Initially, the package UPCrawler used the daily archive pages (e.g. https://www.pravda.com.ua/archives/date_27092023/) to get the URLs of articles published on a specific day, then for each article URL accessed the expected locations of the Russian and English translations to check if a translation exists. Later, I rewrote the code to use a much better solution: parsing the XML sitemaps (e.g. https://www.pravda.com.ua/sitemap/sitemap-2023-09.xml.gz) using the advertools Python package.

Sitemaps⁴ is a XML-based protocol used to inform search engines about the URLs available for web crawling, as well as provide additional information about it such as when was the page last updated, how often does the content change, etc.

The following regex (see https://regex101.com/r/dYlIiF/4 for an interactive analysis) is used to parse each URL to get the language of the article, the article ID, the section (news, podcasts, ..) etc.:

URI_REGEX_STR_EXT = r"(?P<uri>(?P<domain>.*\.com\.ua\/)(?P<lang>(eng)|(rus))?\/?(?P<kind>.*?)\/(?P<art_id>.*(?P<date_part>....\/..\/..?)\/(?P<id>.*)\/))"

Crawling the individual articles

Crawling the articles is done using the beautifulsoup4 library. I considered the alternative option of using the newspaper3k package which was able to detect the article, title and metadata from UP surprisingly well, but it incorrectly detected some fields (which would have required manual fixes anyway), so I decided to keep my from scratch implementation.

For transparency and in the spirit of ethical crawling⁵, there were timeouts between requests, and the unique useragent contained a short explanation of my project as well as my email. At no point was I ever contacted or the crawler blocked.

The most challenging part were the tags. The URL of each tag contained a unique identifier that was consistent between translations.

Processing steps

The article text inside <article> was taken from each page. The content of the tags <p> and <li> were used to extract the plaintext while avoiding advertisements, infoboxes etc.

Paragraphs matching some standard article endings like “follow us on Twitter” weren’t added to the plaintext, but not all such endings were filtered out.

The tags required special care because they presented two problems:

There were pages with lists of tags in Ukrainian and Russian⁶ but not English
Some tags had translations to other languages, some didn’t.

Since this was supposed to be a multilingual dataset I wanted to have a list of tags for each article independent on the translations. The solution at the end was to crawl Ukrainian and Russian tags pages to save the short unique ID and both translations, and add English translations to the short IDs when they were seen in the English translations of articles.

An example tag and three translations:

{"ukr":["флот","/tags/flot/"],"rus":["флот","/rus/tags/flot/"],"eng":["naval fleet","/eng/tags/flot/"]}

The UPravda multilingual dataset

Dataset description

The UPravda multilingual dataset contains in total XX individual translations of YY articles. X articles have a Ukrainian version, Y a Russian and Z an Engish one.

The dataset has X individual tags, of which the most common ones are shown in the table below: TODO

The dataset contains articles published from the 01.01.2022 to X, since UP drastically increased the amount of articles translated to English after the start of the full-scale invasion on the 24.02.2022 ⁷ , (see picture below; TODO better x axis angle on plot).

Mitigations of issues found in multilingual datasets

A recent (2022) manual audit of available crawled multilingual datasets found surprisingly low amounts of in-language data and systematic issues in many of them. ⁸

Some issues raised in the paper in the context of this dataset:

Using standard unambiguous ISO 639-3 language codes (ukr, rus, eng). ISO 639-3 was chosen instead of the more common ISO 639-1 (uk, ru, en) because of the possibly ambiguous ‘uk’ that can be associated with Great Britain as well. Interestingly, the more familiar ‘UA’ is a valid ISO code for the country, but not the language.
The language identification was performed from the URL of the page (in turn labeled by UP), not through automated language identification processes (especially relevant in light of the ukr/rus disambiguation issues discussed in section XXX)
The texts themselves were written by proficient language users, as opposed to automated translations.
The dataset is digital-first(TODO word for this): no errors were introduced by OCR, incorrect layout parsing(TODO cite FinDE) or similar.
I manually checked random articles from the dataset to make sure the different translations are indeed text, in the correct languages, and actually refer to the same article.

Licensing questions

According to Ukrainian law, newpaper-like articles aren’t subject to copyright. According to UP’s rules on the matter⁹, reprinting (..) in other online-newspapers is free but requires a link to the UP article not later than the second paragraph. Using the materials for commercial reasons is forbidden.

I believe releasing this dataset under the CC BY-NC 4.0 license (that allows sharing and adaptation only with attribution and for non-commercial use), with clear attribution to UP in the name and the description of the dataset, fulfills the applicable obligations both in letter and in spirit.

The dataset is released at https://huggingface.co/datasets/shamotskyi/ukr_pravda_2y

Similar datasets

Appendix A: regexes for skipping paragraphs in UPravda dataset

Some UP articles have short paragraphs in the style of “follow us on Twitter” at the end. They have little to do with the actual article, so they were removed from the article text in the dataset.

All paragraphs containing text matching any of the lines/regexes below were filtered out:

"Follow (us|Ukrainska Pravda) on Twitter",
"Support UP",
"become our patron",
"(читайте|слухайте|слушайте) (також|также)",  # "read/listen also to", in Russian and Ukrainian

Ways to make a downstream task out of this

Tags
- News title|text -> tag
- cons: the tags UP uses seem chaotic and inconsistent?…
Title
- Match title to news text
- Match title to rephrased/summarized news text

ChatGPT prompts for rephrasing the news

It suggested (https://chat.openai.com/share/2f6cf1f3-caf5-4e55-9c1b-3dbd6b73ba29) to me this prompt:

Будь ласка, перефразуйте цей текст, змінюючи порядок інформації та структуру повідомлення, уникаючи збігів слів та фразових конструкцій з оригіналом. Фокусуйтеся лише на ключових фактах, уникаючи зайвих деталей:

An improved version that seems to work ~better(https://chat.openai.com/share/14f12f87-50a8-438c-9d01-a0b076c3be12) :

Будь ласка, перефразуйте цей текст, змінюючи порядок інформації та структуру повідомлення, максимально уникаючи збігів слів та фразових конструкцій з оригіналом. Довжина статті має бути приблизно такою ж, як довжина оригіналу.

GPT3.5 works just as well if not better than GPT4 (and is much faster): https://chat.openai.com/share/78927782-25fa-4047-b2a4-fd01ee9a7a54

Can I also use this to generate tasks for the UA-CBT (231024-1704 Master thesis task CBT) task?

Here GPT4 is much better than GPT3. Can’t share either link because “disabled by moderation”(???).

Interestingly, GPT3.5 used definitely Russian chiches that I document in 231214-1251 Masterarbeit benchmark task for Russian-Ukrainian interference.

Eval downstream task decision

231010-1003 Masterarbeit Tagebuch#2023-12-15

Solution: article text -> title, out of X options
- give ~10 options with
  - ~3 random from the dataset
  - ~7 from similar articles from the dataset, e.g. all of the same topic ‘war’

<_(@inbook) “Analysis of references across wikipedia languages” (2017) / Włodzimierz Lewoniewski, Krzysztof Węcel, Witold Abramowicz: z / / 10.1007/978-3-319-67642-5_47 _> ↩︎
Рейтинг топсайтів України | Інститут масової інформації, linked on Українська правда — Вікіпедія ↩︎
Compliance with professional standards in online media. The fourth wave of monitoring in 2021 | Institute of Mass Information ↩︎
<_(@Schonfeld2009) “Sitemaps: Above and beyond the crawl of duty” (2009) / Uri Schonfeld, Narayanan Shivakumar: z / https://dl.acm.org/doi/10.1145/1526709.1526842 / 10.1145/1526709.1526842 _> ↩︎
Ethics in Web Scraping. We all scrape web data. Well, those of… | by James Densmore | Towards Data Science ↩︎
https://www.pravda.com.ua/tags/; https://www.pravda.com.ua/rus/tags/ ↩︎
https://www.pravda.com.ua/eng/archives/year_2022/ ↩︎
<_(@10.1162/tacl_a_00447) “Quality at a glance: An audit of web-crawled multilingual datasets” (2022) / Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi: z / https://doi.org/10.1162/tacl_a_00447 / 10.1162/tacl_a_00447 _> ↩︎
Правила використання матеріалів сайтів Інтернет-холдингу ‘‘Українська правда’’ (Оновлено) | Українська правда ↩︎

Day 1806 (11 Dec 2023)

Backing up a wordpress installation with wp-cli

Wordpress-Backups mit der WordPress-CLI – emsgold:

# check if upgrading is needed
wp core check-update 
wp plugin status 
wp theme status 

# DB
wp db export

# the entire website files
touch backupname.tar.gz
tar --exclude=backupname.tar.gz -vczf backupname.tar.gz .

poetry running scripts after building python package

Was looking for a way to do this but it’s part of the batteries included: Pluralsight Tech Blog | Python CLI Utilities with Poetry and Typer

If you define run points in the pyproject.toml

[tool.poetry.scripts]
up_get_uris = "up_crawler.get_uris:main"
up_crawl_uris = "up_crawler.bs_oop:main"
up_run = "up_crawler.__main__:main"
up_convert = "up_crawler.up_reader:main"

Then once you install the package you built with poetry build elsewhere, these commands will be registered as cli commands, and then you’ll be able to just run up_run --help and it’ll work!

Pytest logging output through CLI

I come back to the topic every once in awhile, but this time How To Use Pytest Logging And Print To Console And File (A Comprehensive Guide) | Pytest With Eric gave me the only solution I’ll ever need:

poetry run pytest --log-cli-level=INFO

which works as-is without any additional packages etc.

Day 1804 (09 Dec 2023)

Tenacity: a retrying library for python

jd/tenacity: Retrying library for Python¹:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    before_sleep_log,
)

# ... 

@retry(
	stop=stop_after_attempt(10),  # Maximum number of retries
	wait=wait_exponential(multiplier=1, min=1, max=60),  # Exponential backoff
	before_sleep=before_sleep_log(logger, logging.INFO),
)
@staticmethod
def do_basic_uri_ops_when_crawling(
	# ...
	pass

This is much better than the various retrying mechanisms in requests (e.g. needing session adapters: Handling Retries in Python Requests – Majornetwork), and likely better than most reinvented wheels (231206-1722 Overengineered solution to retrying and exceptions in python).

Ty Python Requests: Retry Failed Requests [2023] - ZenRows ↩︎

Day 1802 (07 Dec 2023)

Notes to self and lessons learned, OOP and programming in general

I decided that I should go back to the digital garden roots of this, and use this note as a small journey of conceptual/high-level things that I believe would make me a better programmer.

And that I’ll re-read this every time I think of something to add here.

The master thesis has given me ample occasions to find out about these things, and will give me ample occasions to use them before it’s over. Just like with dashes (231205-1311 Notes from paper review#Hyphens vs dashes (em-dash, en-dash)), practiced enough it will stick.

OOP (2023-12-07)

(the post that started this page)

After refactoring my third program to use OOP this quarter, this be the wisdom:

If I'm starting a one-time simple project that looks like it doesn't need OOP - think hard, because often it does.

(Unless threads/parallelism, then it means think harder).

Crawling and converting and synchronicity (2023-12-08)

Context: UPCrawler & GBIF downloader

TL;DR: downloading bits and writing to disk each is sometimes better than to keep them in a dataframe-like-ish structure that gets written to disk in bulk. And the presence of a file on disk can be signal enough about its state, making separate data structures tracking that unneeded.

Background:

When downloading something big and of many parts, my first instinct is/was to put it into pretty dataclasses-like structures, (maybe serializable through JSONWizard), collect it and write it down.

If I think I need some intermediate results, I’d do checkpoints or something similar, usually in an ugly function of the dataframe class to do file handling etc.

Often one can download the individual bit and write it to disk, maybe inside a folder. Then a check of whether it has been downloaded would be literally a check if the file exists, making them self-documenting in a small way.

(And generally - previously I had this when writing certain converters and the second worst thing I have written in my life - I’d have dataclasses with kinds of data and separate boolean fields with has_X_data and stuff. I could have just used whether the data fields are None to signify if they are there or not instead of …that.)

Synchronicity and threads

Doing it like that makes they can happily be parallelized or whatever, downloaded separately.

In the UPCrawler, I was blocked by the need to add to each article a language-independent tag, that was an URI and one to two translations. I wanted to get the entire chunk, gather all translations of tags from them, label the chunks correctly, then serialize.

This is idiotic if I can just download the articles with the info I have to disk and then run a separate script to gather all tags from them and do this. (Or I can gather the tags in parallel while this is happening but don’t let the need to complete it block my download)

Shortcuts (2023-12-08)

Context: UPCrawler; a pattern I’ve been noticing.

Sitemaps instead of crawling archives

First I crawled and prased pages like Архив 26 сентября 2023 года | Украинская правда to get the URI of the articles published on that day, did permutations of the URI to get the other languages if any, and got the list of URIs of articles to crawl.

Yesterday I realized that UPravda has sitemaps: https://www.pravda.com.ua/sitemap/sitemap-2023-04.xml.gz, and that I can use something like advtools to nicely parse them, and advtools gave me back the data as a pandas DataFrame — leading me to the insight that I can analyze parse regex etc. the uris using pandas. Including things like groupby article ID to give me immediately the 1..3 translations of that article. Instead of me needing to track it inside a (guess what) datastructure based on dataclasses.

This inspired me to look for better solutions of another problem plaguing me - tags, with their UK and RU translations.

Lesson in all that

Make an effort — really, an effort - to look at the forest, and for each problem think if there’s an easier way to do that than the one I started implementing without thinking. Including whether there are already structures in place I know about but from other contexts.

I learned to look for solutions inside python stdlib, remembering about this at the right moments should be easy as well.

Я ускладнюю все, до чого торкаюсь (2023-12-08)

A lot of my code is more complex than needed, and to heavy for its own good/purpose. Connected to the above: think (draw? architect?) of a good design before I start writing the code. A sound structure from the beginning will remove many of the corner cases that end up in ugly code to maintain.

Use a real IDE as soon as needed (2024-01-19)

In the context of 240118-1516 RU interference masterarbeit task embeddings mapping, especially given that the models take a while to load.

A Jupyter notebook would have allowed me to experiment much better with the loaded models than a pdbpp interpreter/command line.
Pycharm would have allowed me to debug inside gensim and transmat, and therefore understand them, much better and earlier.

You can add notes to exceptions

8. Errors and Exceptions — Python 3.12.0 documentation:

try:
...     raise TypeError('bad type')
... except Exception as e:
...     e.add_note('Add some information')
...     e.add_note('Add some more information')
...     raise

A function deep down can raise the exception, then a function higher up can catch it and add more details (uri returns 404 -> when downloading image $image we got a 404).

This solves so many conceptual problems I’ve been having!

requests and urllib333 exceptions adventures

If I’m looking at this, the tenacity library post for retrying may be relevant as well (todo link).

So, TIL:

urllib3’s ConnectionError is now called ProtocolError, but aliased for backward compatibility
requests has also a ConnectionError! A totally different one I think

from requests.exceptions import ConnectionError, HTTPError
from urllib3.exceptions import ProtocolError, NameResolutionError, MaxRetryError

urllib3
- ProtocolError
  - RemoteDisconnected
- NameResolutionError - usually temporary
- MaxRetryError
requests:
- ConnectionError - catchall for all of the above?
- HTTPError - 404 and friends

requests exceptions docs: Developer Interface — Requests 2.31.0 documentation urllib3 exceptions: Exceptions and Warnings - urllib3 2.1.0 documentation

TODO: Interaction between MaxRetryError <-> other urllib3 errors - who raises whom, and what does requests do with it (I think raising a ConnectionError but am not sure)

Python filtering logging logs and warnings

zc
zc/it

Filtering logging messages in Python

class LoggingFilter(logging.Filter):
    def filter(self, record):
        if "Connection pool is full" in record.getMessage():
            return False

logger_cpool = logging.getLogger("urllib3.connectionpool")
logger_cpool.addFilter(LoggingFilter())

All filters are applied until one returns False, then the record is silenced, otherwise it gets logged normally.
Any magic can be done, incl. regex etc.!
LogRecord attributes¹ had a logrecord.message but I didn’t have it in my case (todo), but I found getMessage().

Getting the name of your logger to apply the filter to

How to list all existing loggers using python.logging module - Stack Overflow:

import logging

loggers = [logging.getLogger(name) for name in logging.root.manager.loggerDict]

Warnings

Temporarily supressing warnings

warnings — Warning control — Python 3.12.0 documentation:

import warnings

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

Dealing with warnings through logging

logging — Logging facility for Python — Python 3.12.0 documentation

# capture is True for enabling, False for disabling
logging.captureWarnings(capture)

# all warnings will become logs from logger 'py.warnings' with severity WARN

logging — Logging facility for Python — Python 3.10.13 documentation ↩︎