Diensttagebuch

Day 1419 (19 Nov 2022)

LM paper notes
- zc
- zc/it
- ml
- nlp
- nlp/lm
- paper
~~For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.~~

Basics

Perplexity and intrinsic eval
- Resources:
  - https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
    
    Closer to the end has a discussion about LM metrics and performance on downstream task:
  - https://towardsdatascience.com/evaluation-of-language-models-through-perplexity-and-shannon-visualization-method-9148fbe10bd0
    
    Intrinsic evaluation
    
    Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set.
  - Perplexity limitations and ways to go around it / smoothing:
  - As a result, the bigram probability values of those unseen bigrams would be equal to zero making the overall probability of the sentence equal to zero and in turn perplexity to infinity. This is a limitation which can be solved using smoothing techniques.
- The above cites http://web.stanford.edu/~jurafsky/slp3/3.pdf that’s longer and so much better!
- Full link: http://web.stanford.edu/~jurafsky/slp3/
  - !Screenshot_20221119-233022.png
  - P 37 about test set needing to have enough statistical power to measure improvements
  - Sampling
  - Chapter 3 about Shakespeare vs WSJ and genre
  - 42: Smoothing
    
    Unknown words so we don’t multiply 0 probs
    
    7 / 130 really nice basics of ml
- https://surge-ai.medium.com/evaluating-language-models-an-introduction-to-perplexity-in-nlp-f6019f7fb914
  - Another take on the same, but love it
  - Links the Roberta paper about the connection between perplexity and downstream it!
  - Screenshot_20221120-000131_Fennec.png
  - !Screenshot_20221119-235918_Fennec.png
  - If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens
- Excellent about the drawbacks of perplexity:
  - First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.
  - Two more
  - https://arxiv.org/pdf/2110.12609.pdf about perplexity and news cycle 6- TODO
  - The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.
- https://arxiv.org/pdf/2110.12609.pdf - about one million DS news benchmark
Papers
- [1906.03591] A Survey on Neural Network Language Models
  - Sources for basics about NLP and LM
  - Sources for perplexity and problems with it
  - Preprint but cited a lot
- A Survey of the Usages of Deep Learning for Natural Language Processing | IEEE Journals & Magazine | IEEE Xplore
  - Much more respectable-looking paper about DL in NLP specifically, and about NLP in general
- A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models | ACM Transactions on Asian and Low-Resource Language Information Processing
  - Really nice intro
- My dump with links about the statistics and relationships between ppl,bpb,bpc,cross-entropy: 221205-0009 Metrics for LM evaluation like perplexity, BPC, BPB
Interesting intrinsic eval
- They exist!
- GitHub - facebookresearch/LAMA: LAnguage Model Analysis
Benchmarks
- [2211.09110] Holistic Evaluation of Language Models is recent and a preprint, but covers a lot of different things and is a goldmine for citations
Glue
- GLUE: https://aclanthology.org/W18-5446/
  - https://gluebenchmark.com/diagnostics long and detailed
- Excellent: https://mccormickml.com/2019/11/05/GLUE/
- Tasks overview: https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/htmlview
- https://huggingface.co/datasets/glue
SuperGLUE
- https://super.gluebenchmark.com/
- Much more detailed paper than the glue one!
- More complex tasks since models better than people at easy ones
- Goldmine of sources
- At the end they list the excluded tasks + instructions from the tasks for humans!
Benchmarking
- https://ruder.io/nlp-benchmarking/ nice in general, link to competitions and journals
  - Good point about latest models having SUPERFICIAL knowledge and tests needed for this
  - Basically how to benchmark
- Roberta paper lists three: Glue, Squad & RACE
- https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html really detailed list
- xtremr below
Finance
- FinBERT / https://github.com/ProsusAI/finBERT
  - has other eng lang dataset
  - Discussion about cased etc
  - Eval on sentiment analysis, accuracy regression
  - Redundant content
- NFinbert knows numbers, there are a lot of them in finance
- “Context, language modeling and multimodal data on finance”
  - Models trained on mix better than in fin data alone
  - Really nice and involved and financial and I can’t go through it now
  - Almost exclusively sentiment analysis
- https://link.springer.com/article/10.1007/s41060-021-00285-x NER on German financial text for anonymisation
  - BERT
German
- https://www.deepset.ai/german-bert - German BERT
  - LINKS TO DOWNSTREAM TASKS
- https://www.deepset.ai/datasets squad for German dataset
- https://huggingface.co/fabianrausch Germ financial datasets
Multilingual
- https://ruder.io/tag/natural-language-processing/index.html multilingual not-english NLP seems to be an interest of his, might be interesting in the “why” context
- Best post ever: The #BenderRule: On Naming the Languages We Study and Why It Matters
  - Bender rule and English not as default
  - Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology - ACL Anthology
  - She has a lot about this: Emily M. Bender - ACL Anthology, not just there
  - Resonates with me immensely, and I think the time has come for more Ukrainian-language datasets?
  - Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles - Inria - Institut national de recherche en sciences et technologies du numérique
    
    2004.09095 The State and Fate of Linguistic Diversity and Inclusion in the NLP World
- Bits:
  - https://sites.research.google/xtreme benchmark
  - https://www.amazon.science/blog/cross-lingual-transfer-learning-for-multilingual-voice-agents G
Day 1415 (15 Nov 2022)

NN basics and resources
- zc
- zc/it
- ml
- https://ml-cheatsheet.readthedocs.io/en/latest/calculus.html#chain-rule
- Basics and with math, but not too much: https://cs231n.github.io/neural-networks-1/
- https://ml-cheatsheet.readthedocs.io/en/latest/index.html
  - https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html
- Activation functions:
  - Why:
    
    Map input to 0..1
    
    Inject non-linearity to the network for more complex learnable relationships
    
    $Wx+b$ derived is W -> degree of 1
    
    Higher complexity
  - Resources:
    
    Nice list and graphs:
    
    https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253
    
    https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
Day 1411 (11 Nov 2022)

Enums in python - set by name and value
- zc
- zc/it
- python
God I need to read documentation, all of it, including not-important sounding first sentences.

Previously: 220810-1201 Huggingface utils ExplicitEnum python bits showing me how to do str enuenums

.. you can set using both.

enum — Support for enumerations — Python 3.11.0 documentation:
- use call syntax to return members by value
- use index syntax to return members by name
```
class MyEnum(str,Enum):
    IG2 = "val1"
    IG3 = "val2"
MyEnum("val1") == MyEnum["IG3"]
```
Day 1389 (21 Oct 2022)

HF token-classification pipeline prediction text
- zc
- zc/it
- ml
- py/hf
Pipelines: in the predictions, p['word'] is not the exact string from the input text! It’s the recovered one from the subtokens - might have extra spaces etc. For the exact string the offsets should be used.

EDIT - I did another good deed today: Fix error/typo in docstring of TokenClassificationPipeline by pchr8 · Pull Request #19798 · huggingface/transformers
Day 1381 (13 Oct 2022)

pytorch dataloaders and friends
- zc
- zc/it
- pytorch
- ml
Pytorch has torchdata, roughly similar to what I used to know and love in Keras: Tutorial — TorchData main documentation

Day 1373 (05 Oct 2022)

Python raise_or_log function

Neat snippet I just wrote that will get rid of a lot of duplicated code:

def exception_or_error(
	message: str,
	fail_loudly: Optional[bool] = False,
	exception_type: Optional[Type[Exception]] = ValueError,
) -> None:
	"""Log error or raise an exception. Needed to control the decider
	in production."""

	# Raise whatever exception
	if fail_loudly:
		raise exception_type(message)
	else:
		logger.error(message)

Usage:


are_we_in_production = True

# will log or raise a ValueError based on the above
exception_or_error("File not found", fail_loudly=are_we_in_production)

# if raising something, will raise a KeyError
exception_or_error("Row not in db", fail_loudly=are_we_in_production,
				  exception_type = KeyError)

Day 1369 (01 Oct 2022)

Using cloudflared tunnels as proxy in docker
- zc
- zc/it
- linux
- self-hosting
```
cloudflared:
image: cloudflare/cloudflared:latest
command: tunnel run
environment:
  - TUNNEL_TOKEN=my-super-secred-tunnel-token
restart: unless-stopped
network_mode: "host"
```
Then whatever can run in its network with bridge driver:
```
networks:
  nextcloud:
    driver: bridge
....
services:
  nextcloud:
    networks:
      - nextcloud
	ports:
	  - "1234:80"
```
And then in the cloudflare zero trust UI add a tunnel from localhost:1234.

Neat thing is that tunnel type HTTP refers to the connection to the host running cloudflared, but the thing is accessible through cloudflare’s servers as both http and https. No need to manually do any certs stuff!
self-hosting with docker compose resources
- zc
- zc/it
- self-hosting
- linux
- docker
- General:
  - https://fleet.linuxserver.io/ has a list of docker images of things, not necessarily for self-hosting
  - List of docker-compose examples for various self-hosting-relevant apps: petersem/dockerholics: Apps and examples from the Dockerholics group.
- Specific services:
  - bitwarden-rs / vaultwarden:
    
    Their example w/ caddy for domain name: Using Docker Compose · dani-garcia/vaultwarden Wiki
    
    Generally their wiki is excellent: Home · dani-garcia/vaultwarden Wiki
frp proxy using docker (-compose)
- zc
- zc/it
- linux
- docker
- cli
- self-hosting
Wanted to run frp’s client frpc with docker to forward the SSH port.

Main issue was binding to a port already open on the host, and one not controlled by a docker thing.

My first attempt led to this: “: Error starting userland proxy: listen tcp4 0.0.0.0:22: bind: address already in use”

After looking around the Internet, found a solution.

Docker’s docker-compose.yml:
```
services:
  frpc:
    image: chenhw2/frp
    restart: unless-stopped
    environment:
      - ARGS=frpc
    volumes:
      - ./conf/frpc.ini:/frp/frpc.ini
    network_mode: "host"
    ports:
      - "22:22"
```
The key being the “nertwork_mode” part.

Neither frp server nor client configs needed anything special.

Strangely , I didn’t even need to set any capabilities like I did for dns:
```
services:
  dns:
    restart: always
    image: strm/dnsmasq
    volumes:
      - ./conf/dnsmasq.conf:/etc/dnsmasq.conf
    ports:
      - "53:53/udp"
    cap_add:
      - NET_ADMIN
```
You can use screen or tmux for your normal editing things
- zc
- zc/it
- cli
This goes into “things you’re allowed to do” (Previously: List of good things - serhii.net) territory, but:
- previously, screen/tmux’s use case was “ssh into a server far away and let things run even when your SSH session disconnects”
- had two terminals open on a remote server, had to edit the exact two files every time, over days and disconnections
- just realized that I can just have a screen session open with vim and the files I edit, and just attach to it next time I’d doing something on that server, whenever that is!

Day 1419 (19 Nov 2022)

Basics

Perplexity and intrinsic eval

Papers

Interesting intrinsic eval

Benchmarks

Glue

SuperGLUE

Benchmarking

Finance

German

Multilingual

Day 1415 (15 Nov 2022)

Day 1411 (11 Nov 2022)

Day 1389 (21 Oct 2022)

Day 1381 (13 Oct 2022)

Day 1373 (05 Oct 2022)

Day 1369 (01 Oct 2022)