Programming

26308 readers

650 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

UlrikHD@programming.dev

bugsmith@programming.dev

Spyro@programming.dev

How would you design parallel grep for huge JSONL files? (lemmy.world)

submitted 1 day ago by dhruv3006@lemmy.world to c/programming@programming.dev

15 comments fedilink hide all child comments

I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.

Would love to hear how you would design it.

you are viewing a single comment's thread
view the rest of the comments

[–] Jayjader@jlai.lu 1 points 15 hours ago* (last edited 15 hours ago)

chunk_size := file_size / cpu_cores. Compile regex.
spawn cpu_cores workers:
2.a. worker #n starts at n * chunk_size bytes. If n > 0, skip bytes until newline encountered.
2.b worker starts feeding bytes from file/chunk into regex. When match is found, write to output (stdout or file, whichever has better performance). When newline encountered, restart regex state automata.
2.c after having read chunk_size bytes, continue until encountering a newline to ensure the whole file is covered by the parallel search.

Optionally, keep track of byte number and attach them to the found matches when outputting, to facilitate eventually de-duplicating and/or navigating to said match in the file.

To avoid problems, have each worker output to a separate file, and only combine these output files when the workers are all finished.

As others have said, it's going to be hard to get more speedup than this, and you will ultimately be limited by your storage's read speed and throughput if the whole file cannot fit into memory.