In the middle of the desert you can say anything you want
Really nice google colab showing more advanced datasets bits in addition to what’s on the label:
Custom Named Entity Recognition with BERT.ipynb - Colaboratory
Pasting this example from there:
class dataset(Dataset):
	def __init__(self, dataframe, tokenizer, max_len):
		self.len = len(dataframe)
		self.data = dataframe
		self.tokenizer = tokenizer
		self.max_len = max_len
	  
	def __getitem__(self, index):
		# step 1: get the sentence and word labels
		sentence = self.data.sentence[index].strip().split()
		word_labels = self.data.word_labels[index].split(",")
		  
		# step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
		# BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
		encoding = self.tokenizer(sentence,
		is_pretokenized=True,
		return_offsets_mapping=True,
		padding='max_length',
		truncation=True,
		max_length=self.max_len)
		# step 3: create token labels only for first word pieces of each tokenized word
		labels = [labels_to_ids[label] for label in word_labels]
		# code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
		# create an empty array of -100 of length max_length
		encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
		# set only labels whose first offset position is 0 and the second is not 0
		i = 0
		for idx, mapping in enumerate(encoding["offset_mapping"]):
		if mapping[0] == 0 and mapping[1] != 0:
		# overwrite label
		encoded_labels[idx] = labels[i]
		i += 1
		
		# step 4: turn everything into PyTorch tensors
		item = {key: torch.as_tensor(val) for key, val in encoding.items()}
		item['labels'] = torch.as_tensor(encoded_labels)
		return item
	  
	def __len__(self):
		return self.len
For aligning tokens, there’s Code To Align Annotations With Huggingface Tokenizers. It has a repo: LightTag/sequence-labeling-with-transformers: Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models
Also the official tutorial (Token classification) has a function to do something similar:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
git rebase -i SHA_of_commit_to_delete^ drops you into the usual screen, three you can change pick
to drop in the first line (or any others) to just delete that commit.
Generally, On undoing, fixing, or removing commits in git seems like The README for that.
git branch -d some-branch deletes a local branchgit push origin --delete some-branch deletes a remote branch(as usual, remembering that branches are pointers to commits)
Changing the timeout delay for wrong logins on linux has a lot of details, in my case the TL;DR was:
/etc/pam.d/login change the number, in microseconds;/etc/pam.d/common-auth by adding nodelay to: auth    [success=1 default=ignore]  pam_unix.so nullok_secure nodelayThe second one works also for everything inheriting that, which is a lot.
debugging - I have a hardware detection problem, what logs do I need to look into? - Ask Ubuntu:
Then, causing the problem to happen, and listing the system’s logs in reverse order of modification time:
ls -lrt /var/log,tail -n 25on recently modified log files (for reasonable values of 25), anddmesg.Read, wonder, think, guess, test, repeat as needed
Causing the problem and then looking at the recently modified logs is common sense but brilliant.
And saving ls -lrt as “list by modification time”.
-t is “sort by modification time” and is easy to remember.
When debugging an issue I had with my monitor, found a mention of inxi1, which seems to colorfully output basic system (incl. hardware) info.
The post asked for inxi -SMCGx, inxi help told me inxi -F
is the fullest possible output.
Neat!
So, noisetorch says it’s potentially compromised: Release POTENTIAL COMPROMISE · noisetorch/NoiseTorch.
An improvement for the previous more dramatic formulation: Community code review? · noisetorch/NoiseTorch@b4bb8e6
This project is dead, i’ve failed you.
Thoughts and prayers (honestly! I loved it), with a heavy heart I keep looking.
Option1: werman/noise-suppression-for-voice: Noise suppression plugin based on Xiph’s RNNoise
Reading how to install it made me very sad, kept looking.
Saw EasyEffects mentioned, but it runs on Pipewire.
TIL Pipewire is a Pulseaudio replacement.
Installed via this guide: How to install PipeWire on Ubuntu Linux - Linux Tutorials - Learn Linux Configuration
Installed and ran EasyEffects using flatpak:
flatpak install easyeffects
flatpak run com.github.wwmm.easyeffects
EasyEffects’ GUI looks awesome!
Had to choose another input source in pavucontrol, then once the input is piped thorugh it - the effect “Noise Reduction” works! Removes both keyboard and random background white noise.
You can even save the config as preset and make it run automagically on startup!
TIL about git bisect.
git help bisect for help.
TL;DR: uses binary search to find a commit that introduced a change. You run it, it gives you a commit, you tell it if it’s good or bad, and it keeps narrowing down the options.
git bisect start ->  git bisect good ->  git bisect bad ->  git bisect reset
HF Datasets’ README links this nice google colab that explain the basics: HuggingFace datasets library - Overview - Colaboratory
I use # TODOs for “Do later”.
If they exist, Pycharm asks me every time before committing if I really want to.
I guess the idea is to use them to mark things to do before committing, so much smaller scale and here-and-now?
sanitize-filename · PyPI does what it says on the box.
It’s more complex than the replace--/ that I had in mind: sanitize_filename/sanitize_filename.py · master · jplusplus / sanitize-filename · GitLab
And intution tells me using external semi-unknown libraries like this might be a security risk.
TODO - what is the best practice for user-provided values that might become filenames?.. Something not smelling of either injection vulns or dependency vulns?