32
Bloom filters: real-world applications
(llimllib.github.io)
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Follow the wormhole through a path of communities !webdev@programming.dev
I haven’t used them in Spark directly but here’s how they are used for computing sparse joins in a similar data processing framework:
Let’s say you want to join some data “tables” A and B. When B has many more unique keys than are present in A, computing “A inner join B” would require lots of shuffling if B, including those extra keys.
Knowing this, you can add a step before the join to compute a bloom filter of the keys in A, then apply the filter to B. Now the join from A to B-filtered only considers relevant keys from B, hopefully now with much less total computation than the original join.