In the middle of the desert you can say anything you want
For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Closer to the end has a discussion about LM metrics and performance on downstream task:
Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set.
Perplexity limitations and ways to go around it / smoothing:
As a result, the bigram probability values of those unseen bigrams would be equal to zero making the overall probability of the sentence equal to zero and in turn perplexity to infinity. This is a limitation which can be solved using smoothing techniques.
If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens
First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.
The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.
https://ruder.io/tag/natural-language-processing/index.html multilingual not-english NLP seems to be an interest of his, might be interesting in the “why” context
Best post ever: The #BenderRule: On Naming the Languages We Study and Why It Matters
Bits:
https://ml-cheatsheet.readthedocs.io/en/latest/calculus.html#chain-rule
Basics and with math, but not too much: https://cs231n.github.io/neural-networks-1/
Activation functions:
God I need to read documentation, all of it, including not-important sounding first sentences.
Previously: 220810-1201 Huggingface utils ExplicitEnum python bits showing me how to do str enuenums
.. you can set using both.
enum — Support for enumerations — Python 3.11.0 documentation:
class MyEnum(str,Enum):
IG2 = "val1"
IG3 = "val2"
MyEnum("val1") == MyEnum["IG3"]
Pipelines: in the predictions, p['word'] is not the exact string from the input text! It’s the recovered one from the subtokens - might have extra spaces etc. For the exact string the offsets should be used.
EDIT - I did another good deed today: Fix error/typo in docstring of TokenClassificationPipeline by pchr8 · Pull Request #19798 · huggingface/transformers
Pytorch has torchdata, roughly similar to what I used to know and love in Keras: Tutorial — TorchData main documentation
Neat snippet I just wrote that will get rid of a lot of duplicated code:
def exception_or_error(
message: str,
fail_loudly: Optional[bool] = False,
exception_type: Optional[Type[Exception]] = ValueError,
) -> None:
"""Log error or raise an exception. Needed to control the decider
in production."""
# Raise whatever exception
if fail_loudly:
raise exception_type(message)
else:
logger.error(message)
Usage:
are_we_in_production = True
# will log or raise a ValueError based on the above
exception_or_error("File not found", fail_loudly=are_we_in_production)
# if raising something, will raise a KeyError
exception_or_error("Row not in db", fail_loudly=are_we_in_production,
exception_type = KeyError)
cloudflared:
image: cloudflare/cloudflared:latest
command: tunnel run
environment:
- TUNNEL_TOKEN=my-super-secred-tunnel-token
restart: unless-stopped
network_mode: "host"
Then whatever can run in its network with bridge driver:
networks:
nextcloud:
driver: bridge
....
services:
nextcloud:
networks:
- nextcloud
ports:
- "1234:80"
And then in the cloudflare zero trust UI add a tunnel from localhost:1234.
Neat thing is that tunnel type HTTP refers to the connection to the host running cloudflared, but the thing is accessible through cloudflare’s servers as both http and https. No need to manually do any certs stuff!
Wanted to run frp’s client frpc with docker to forward the SSH port.
Main issue was binding to a port already open on the host, and one not controlled by a docker thing.
My first attempt led to this: “: Error starting userland proxy: listen tcp4 0.0.0.0:22: bind: address already in use”
After looking around the Internet, found a solution.
Docker’s docker-compose.yml:
services:
frpc:
image: chenhw2/frp
restart: unless-stopped
environment:
- ARGS=frpc
volumes:
- ./conf/frpc.ini:/frp/frpc.ini
network_mode: "host"
ports:
- "22:22"
The key being the “nertwork_mode” part.
Neither frp server nor client configs needed anything special.
Strangely , I didn’t even need to set any capabilities like I did for dns:
services:
dns:
restart: always
image: strm/dnsmasq
volumes:
- ./conf/dnsmasq.conf:/etc/dnsmasq.conf
ports:
- "53:53/udp"
cap_add:
- NET_ADMIN
This goes into “things you’re allowed to do” (Previously: List of good things - serhii.net) territory, but: