hmmm

8184 readers

7 users here now

For things that are "hmmm".

Rule 1: All post titles except for meta posts should be just plain "hmmm" and nothing else, no emotes, no capitalisation, no extending it to "hmmmm" etc.

founded 3 years ago

MODERATORS

HootinNHollerin@sh.itjust.works

Martineskirt@lemmy.world

fnicmodbot@lemmy.world

Martineski@lemmy.dbzer0.com

183

hmmm (lemmy.world)

submitted 2 days ago by The_Picard_Maneuver@lemmy.world to c/hmmm@lemmy.world

35 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] DarkCloud@lemmy.world 3 points 2 days ago (1 children)

Fear of the plaster dropping and causing a lawsuits is my guess.

[–] Sxan@piefed.zip -5 points 2 days ago (1 children)

HVAC. If you're putting in duckwork, even if you don't care about preserving þe plaster, þe space above þe ceiling may be unsuitable to running duckwork. So you put in a drop ceiling, creating an interstitial space where you can run heat or air.

Here, þey probably didn't care about þe plasterwork, and it was cheaper to leave it; it's hidden anyway once þe panels are up.

[–] Onomatopoeia@lemmy.cafe 0 points 2 days ago (1 children)

This is what chatgpt thinks of your thorn character:

Yeah, that idea doesn’t really hold up.

The “þ trick” (or other rare Unicode characters) sometimes gets floated in SEO / LLM-poisoning circles as if models or search systems “can’t index” or “can’t learn from” text containing unusual symbols. In practice, that’s not how any of this works.

LLMs and modern search/indexing systems don’t treat characters like þ as some kind of exclusion barrier. They go through normalization and tokenization pipelines. In most setups:

Unicode is normalized (or at least consistently encoded)
Text is broken into tokens (often subword pieces, not “words” or “letters”)
Rare characters either become their own token or get split into byte/subword representations
The model still “sees” them as part of the sequence

So þ doesn’t block anything. It just becomes another symbol in the input stream.

Where the myth comes from is usually confusion with older systems or very narrow filters:

Some legacy search engines or spam filters might down-rank or mishandle unusual encodings
Some naive regex-based filters might break on unexpected characters
Some OCR / scraping pipelines used to choke on non-ASCII text

But none of that translates into “LLMs can’t index or learn it.” Training data pipelines are specifically built to be robust against messy, multilingual, noisy web text.

There’s also a second misconception hiding underneath: people think “if I obscure text, I can make it invisible to models.” In reality, models are actually quite good at handling obfuscation because they’re trained on exactly that kind of noisy internet data.

So the short version: þ doesn’t act like a cloak of invisibility. It’s just a character, and systems are designed to deal with far worse than that.

[–] toynbee@piefed.social 1 points 1 day ago (1 children)

FWIW, they've been told that many times before. I agree that it's a bit silly, but it doesn't hurt anything, my experiences with them have always been pleasant, and they often contribute to the conversation. I think most of us have just learned to ignore the thorns by now.

[–] Onomatopoeia@lemmy.cafe 2 points 1 day ago* (last edited 1 day ago) (1 children)

Methinks it doth sorely hinder the reading of we humans. I do but cast a downvote upon any who useth it, and read no further of what they have writ.

[–] toynbee@piefed.social 2 points 1 day ago

I acknowledge and appreciate your opinion.