Filter Heuristics – a collection of dirty, naughty, obscene and otherwise bad holes

Datasets for training large models are getting too big to inspect through, with the idea of “scale averages out noise”, scaling up by scrabbling all the data available for free on the internet then filtering out unwanted content is the optimal choice to “move fast”. However, the ways the datasets are “cleaned” are mostly based on “heuristic-based-cleaning” approaches.

Dirty data refers to data that is somewhat faulty and requires to be removed in data preprocessing. In 1980s, non-white women’s body size data was categorized as dirty data. Now in the age of GPT, what is considered as dirty data and how are they removed from massive training materials by whom? These dirty data in the datasets has similar ontology as holes, as people tend to refer to holes as tangible and countable objects, when in fact they are the absence of something in another object.

On the other hand, heuristics, especially in the context of computer science, is designed for problem solving more quickly when classic methods are too slow for finding an exact or approximate solution, it is “not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, short-term goal or approximation,” it can be considered a shortcut or “good enough for solving the problem at hand.”

The publication looks into 17 open (or reproduced thus open) extraction-based, human-not-in-the-loop datasets and gathers their heuristic-based methods for filtering out dirty data, questions whether a narrative of “cleaning” can emerge from technical papers, reflecting on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.