Optimization of data collection in facial recognition models through subsampling strategies

Por favor, use este identificador para citar o enlazar este ítem: https://hdl.handle.net/20.500.12008/55115 Cómo citar

Título:	Optimization of data collection in facial recognition models through subsampling strategies
Autor:	Tayler, Silvana
Tutor:	Preciozzi, Javier Fiori, Marcelo
Tipo:	Tesis de maestría
Fecha de publicación:	2026
Resumen:	Facial recognition systems have achieved remarkable performance in recent years; however, their accuracy remains highly dependent on the quality, diversity, and volume of training data. The widespread use of large-scale datasets, often collected without consent, raises significant ethical and legal concerns, while the storage and computational demands associated with such data present ongoing challenges. This thesis explores subsampling techniques to evaluate whether strategies can be identified that guide data collection, independently of the training process, with the goal of reducing data needs and improving computational efficiency. ArcFace, a state-of-the-art facial recognition model, was selected as the baseline architecture due to its strong feature discrimination and generalization capabilities. Using the MS1M-ArcFace dataset for training and LFW, AgeDB-30, and CFP-FP benchmarks for evaluation, 53 experiments were conducted. Multiple sampling ap- proaches were compared, at image and identity level, including uniform random selection, stratified sampling, k-means clustering and greedy Maximin selection. Both image and identity level subsampling were explored, with experiments designed to evaluate the effect of sample representativeness, intra- and inter-class variability, and the proportion of identities in the training set. Results indicate that k-means clustering applied to ArcFace embeddings at the image level achieved the highest overall performance across all benchmark datasets, demonstrating its effectiveness in reducing redundancy while preserving intra-class and inter-class diversity. Alternatively, random sampling at the identity level yields competitive performance compared to more complex strategies, particularly when high intra-class variability is desired. This finding suggests that identity-level random sam- pling is a valid and cost-effective approach for training data selection, significantly reducing the costs of data collection, storage, and processing. Additionally, k-means clustering may serve as a more suitable alternative in scenarios with a limited number of identities and where greater intra-class variability is not required. These insights are especially relevant in ethically constrained environments, where biometric data collection is restricted by consent. In all cases, clustering can further guide the final image selection process once consent is obtained, enhancing both the efficiency and representativeness of the dataset.
Editorial:	Udelar.FI.
Citación:	Tayler, S. Optimization of data collection in facial recognition models through subsampling strategies [en línea] Tesis de maestría. Montevideo : Udelar. FI, 2026.
ISSN:	1688-2806
Título Obtenido:	Magíster en Ingeniería Matemática
Facultad o Servicio que otorga el Título:	Universidad de la República (Uruguay). Facultad de Ingeniería
Licencia:	Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)
Aparece en las colecciones:	Tesis de Posgrado - Facultad de Ingeniería

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
Tay26.pdf	Tesis de Maestría	7,08 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons