Dancing with data: wrangling musical datasets for unbiased insights


For the analysis of symbolic corpora of early music we depend on existing datasets, such as the Josquin Research Project and the CantusCorpus, or encodings in community-created resources like IMSLP and CPDL. The choice of compositions presented in these datasets is highly dependent on the scope of the project and/or the preferences of the contributors. As a consequence, they exhibit a selection bias that makes it hard to use them to answer a variety of musicological questions. To mitigate this problem, we propose a method, using RISM, DIAMM among other resources to compile a dataset that is more representative of the actual repertoire. We presented this work at the International Medieval and Renaissance Music Conference 2024 in Granada, Spain as part of the CORSICA panel Making corpus creation in early music rewarding and effective: finding the optimum between standardisation and scholarly autonomy. For more details, please visit the conference schedule.

Conference Presentation

Recorded Presentation