what the fuck data are they trying to use to train LLMs? DNS? basically everything else these days is encrypted and unreadable by them (for the time being), and DNS is easily masked as well (DoH, DoT, etc). maybe i just answered my own question, they want to train a model that can surreptitiously spy on encrypted traffic.
edit: here's some info on the "traditional" method of monitoring encrypted traffic, but this is typically done in organizations and requires having special certificates on every customer device