Linux

17471 readers

360 users here now

Welcome to c/linux!

Welcome to our thriving Linux community! Whether you're a seasoned Linux enthusiast or just starting your journey, we're excited to have you here. Explore, learn, and collaborate with like-minded individuals who share a passion for open-source software and the endless possibilities it offers. Together, let's dive into the world of Linux and embrace the power of freedom, customization, and innovation. Enjoy your stay and feel free to join the vibrant discussions that await you!

Rules:

Stay on topic: Posts and discussions should be related to Linux, open source software, and related technologies.
Be respectful: Treat fellow community members with respect and courtesy.
Quality over quantity: Share informative and thought-provoking content.
No spam or self-promotion: Avoid excessive self-promotion or spamming.
No NSFW adult content
Follow general lemmy guidelines.

founded 2 years ago

MODERATORS

MigratingtoLemmy@lemmy.world

(bash?) matching Array of Arrays against a simple k:v json at ~200k lines? (lemmy.world)

submitted 2 months ago* (last edited 2 months ago) by j4k3@lemmy.world to c/linux@lemmy.world

11 comments fedilink hide all child comments

I have a file that contains a lot of odd slang and dialects that were written as they sound, and I want to standardize them to the ASCII character set. I want a readable script that I will understand at a glance a year later despite not touching a computer in the interim.

Maybe I am going about this the wrong way, but I want to initialize individual arrays for each character [a-z]. Then step through each character of the input word or string, passing these to a Case that matches them to the respective [a-z] array while passing unmatched characters unchanged. In the end I need to retain correlation with the original file line.

In my first attempt, I got to the point of matching each character to the name of the array using the case, but only as the name of the array as a string of text. So like, the "a" array is "aaa". Now I'm trying to relearn how to call that placeholder as an array again, like a pointer. I can make it a variable with printf -v. but then calling that variable as a pointer to the array alludes me. I don't know how to double expand a variable inside an array like "${$var[@]}". I'll figure that out. This is just where I am at in terms of abstract reference of ideas. Solve it, don't; I do not care about that aspect; solving my method is not related to what I am asking here.

What I am asking is what ways are used to solve this type of problem in general, with the constraint of readability? Egrep, sed, awk? Do it all within the json to maintain the relationship to the original key/value? Associative arrays have never really clicked for me in bash. Maybe that is the better solution? It is just a hobby thing, not work, school, or whatnot. I'm asking hackers that find this kind of problem casual fun social smalltalk.

top 11 comments

sorted by: hot top controversial new old

[–] lime@feddit.nu 8 points 2 months ago (1 children)

this sounds like a python kind of problem. i'm all for abusing bash features but there comes a point when there's just too many brackets.

[–] j4k3@lemmy.world -1 points 2 months ago

Yeah, I probably should use Python, it just takes me longer and I am much less likely to keep hacking around with it later. It feels like a foreign language relative to bash. The problem came from greping using for loops, so I'm in that mindset. Point taken though.

[–] PumaStoleMyBluff@lemmy.world 5 points 2 months ago

readable script that I will understand at a glance a year later

Well that rules out bash

[–] INeedMana@piefed.zip 4 points 2 months ago (1 children)

I think this post could use some small example with what you have vs what you want to get

Depending on how I read your description it's either doable with bash and grep or probably doable but a lot of hassle compared to using something else than bash

[–] j4k3@lemmy.world 2 points 2 months ago (2 children)

jake
Jake
j4ke
Jak3
j@k3
JAK€
jπ⸦kE
𝚥ᎪᏦ⋲
ꓙᏎ🅺Ꮛ
𞋕ꮜ𝈲𝈁
᜴ᚣᜩᗕ
It is not this, but same problem scope. Resolve all to "jake" for further processing. Also specifically looking for that ck.

[–] INeedMana@piefed.zip 2 points 2 months ago* (last edited 2 months ago) (1 children)

If you want ᜴ᚣᜩᗕ to get resolved to jake then bash will be a pain to use. I would use python

For each ASCII letter create a list of non-ASCII characters that look similar. Then, for each word you want to match construct a regex

dictionary = { ...'j': ['j', 'J', '𝚥', '𞋕', ...

regex=''  
for letter in 'jake':  
regex += f'[{"".join(dictionary[letter])}]'

so after whole loop the regex would look a bit like (I'm cutting out a lot of characters to save on copy-pasting) [jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]

>>> import re  
>>> r='[jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]'  
>>> re.fullmatch(r, 'jake')  
<re.Match object; span=(0, 4), match='jake'>  
>>> re.fullmatch(r, 'joke')  
>>>

In general the group of problems that you are touching here is https://en.wikipedia.org/wiki/String_metric but I'm not sure if there is an algorithm that would be so "visual" matching

[–] j4k3@lemmy.world 2 points 2 months ago

Thanks. This was helpful.

[–] lime@feddit.nu 2 points 2 months ago

tr could definitely do some work here. maybe something like, echo each word and its translated counterpart, sort on first column, and then echo $col1 >> $col0 for each line? it's a start at least

[–] brickfrog@lemmy.dbzer0.com 3 points 2 months ago

Similar to the other comment, not sure if you've ruled out writing a Python script? For what you're describing Python would be able to easily tackle your requirements and still be readable since it's just a script you can launch whenever you need. Python is also pretty easy to pick up so if you're familiar with scripting then it could be a fun learning experience (if you don't already know it).

Other scripting languages could work too, just feel like Bash will be less readable if you're writing a massive script like that.

[–] mhzawadi@lemmy.horwood.cloud 2 points 2 months ago

If you have raw jdon, then take a look at jq. It's amazing for read json text in bash

[–] cymor@midwest.social 2 points 2 months ago

If it's a regular enough format, you could just use DuckDB. Are you using it as a challenge?