Machine learning systems are profoundly influenced by the methods of data collections and labelling that are used in their creation. Yet there has been a lack of research into the processes of how training data is constructed and used. Since our first Data Genesis workshop in 2017, the AI Now Institute has been developing new approaches to study and understand the role of training data in the machine learning field.
Key research questions include: What type of information is used as training data? Who generates and collects it and for what purpose? What segments of society does it reflect? Who and what does it exclude? And how does that affect the functioning of AI systems themselves?
The Data Genesis program’s goal is to answer and demystify these questions through three core components:
- Archiving and analyzing the origin and construction of key datasets that serves as foundations for today’s AI systems;
- Producing visualizations, maps, and other designs to help crystallize and contextualize what this data is and what it means to communities, practitioners, companies, and policymakers; and
- Convening experts from across disciplines to help build a field around this topic.
Source: “Announcing AI Now’s data genesis program”, AI Now Institute