It’s pretty easy to see the problem here: The Internet is brimming with misinformation, and most large language models are trained on a massive body of text obtained from the Internet.

Ideally, having substantially higher volumes of accurate information might overwhelm the lies. But is that really the case? A new study by researchers at New York University examines how much medical information can be included in a large language model (LLM) training set before it spits out inaccurate answers. While the study doesn’t identify a lower bound, it does show that by the time misinformation accounts for 0.001 percent of the training data, the resulting LLM is compromised.

  • gravitas_deficiency@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    9 days ago

    It’s ok; we’ll just point more LLMs trained in data curation at the data. WCGW?

    I swear to god, I feel like all of these LLM circlejerking shills have systematically forgotten one of the foundational points of computer science: garbage in, garbage out.