this post was submitted on 02 May 2025
17 points (87.0% liked)

Programming

23228 readers
191 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



founded 2 years ago
MODERATORS
 

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

you are viewing a single comment's thread
view the rest of the comments
[โ€“] PoisonedPrisonPanda@discuss.tchncs.de 1 points 5 months ago (1 children)

first of all many thanks for the bullets. Good to have some guidance on where to start.

2nd level cache shared between services

I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

generate indexes

How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?

I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

correct, introducing caching can result in returning outdated data for awhile, which is usually not a huge deal. those caches can get tricky, but they should take pressure from your db, if you're scenario is read heavy, which is often the case. Research existing caching solutions before running ahead and implementing something from scratch, especially if you need a cache distirbuted between multiple instances of your service. In the Java world that would be something like Infinispan, but your ecosystem might over better integration with other solutions.

I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

having management on board is great and the new hardware should help a lot, migrating to another RDBMS sounds scary, but probably worth it if your organisation has more expertise with it.

generate indexes

they won't help you with your duplicates, they will help speed up your reads but could slow down writes. building a good index is not trivial, but nothing is when it comes to performance tuning a database, it's tradeoff after tradeoff. The best way to handle identical rows of data is to not write them usually, but i don't know your system nor its history, maybe there is or was a good reason for its current state.