Abstract: Sociotechnical systems abound in examples of the ways they constitute sources of harm for historically marginalized groups. In this context, the field of machine learning has seen a rapid proliferation of new machine learning methods, model architectures, and optimization techniques. Yet, data -- which remains the backbone of machine learning research and development -- has received comparatively little research attention. My research hypothesis is that focusing exclusively on the content of training datasets the data used for algorithms to learn associations only captures part of the problem. Instead, we should identify the historical and conceptual conditions which unveil the modes of dataset construction. I propose here an analysis of datasets from the perspective of three techniques of interpretation: genealogy, problematization, and hermeneutics. First, genealogy investigates how datasets have been created and the contextual and contingent conditions of their creation. This includes questions on the role of data provenance, the conceptualization and operationalization of the categories which structure these datasets (e.g. the labels which are applied to images), methods for annotation, the consent regimes of the data authors and data subjects, and stakeholders and other related institutional logics. Second, the technique of problematization builds on the genealogical question by asking: what are the central discourses, questions, concepts, and values which constitute themselves as the solution to problems in the construction of a given dataset. Third, building on the previous two lines of inquiry, we have the hermeneutical approach, which is concerned with investigating the explicit and implicit motivations of all present and absent stakeholders (including data scientists and dataset curators) and the background assumptions operative in dataset construction.
Bio: Razvan Amironesei is a Visiting Researcher in the Ethical AI team. While at Googles Center for Responsible AI, his research and publications focus on developing a pluralistic data ethics framework by using responsible interpretive methods to analyze the construction of benchmark datasets. He is also researching the relationship between computer science pedagogy and humanistic social science, specific issues related to data annotation, the constitution of offensiveness in ML datasets, and the topic of algorithmic conservation. Previously, Razvan has done research and published on sociotechnical impacts of benchmark datasets at the Center for Applied Data Ethics at the University of San Francisco, and on the political and ethical formation of algorithms at the Institute for Practical Ethics at UC San Diego. Razvan has taught classes in English and French in Applied Ethics for Engineers, Bioethics, Political Theory, and on Religion and Politics in the US. His educational background is international and situated at the intersection of social sciences and the humanities. He completed postdoctoral studies at the Center on Global Justice at UC San Diego, a PhD in philosophy at Laval University in Canada, an MA in the history of science and technology in France and a Bachelors degree in the history of philosophy in Romania.