Optimization of data collection in facial recognition models through subsampling strategies

Por favor, use este identificador para citar o enlazar este ítem: https://hdl.handle.net/20.500.12008/55115 Cómo citar

Registro completo de metadatos

Campo DC	Valor	Lengua/Idioma
dc.contributor.advisor	Preciozzi, Javier	-
dc.contributor.advisor	Fiori, Marcelo	-
dc.contributor.author	Tayler, Silvana	-
dc.date.accessioned	2026-05-20T17:27:30Z	-
dc.date.available	2026-05-20T17:27:30Z	-
dc.date.issued	2026	-
dc.identifier.citation	Tayler, S. Optimization of data collection in facial recognition models through subsampling strategies [en línea] Tesis de maestría. Montevideo : Udelar. FI, 2026.	es
dc.identifier.issn	1688-2806	-
dc.identifier.uri	https://hdl.handle.net/20.500.12008/55115	-
dc.description.abstract	Facial recognition systems have achieved remarkable performance in recent years; however, their accuracy remains highly dependent on the quality, diversity, and volume of training data. The widespread use of large-scale datasets, often collected without consent, raises significant ethical and legal concerns, while the storage and computational demands associated with such data present ongoing challenges. This thesis explores subsampling techniques to evaluate whether strategies can be identified that guide data collection, independently of the training process, with the goal of reducing data needs and improving computational efficiency. ArcFace, a state-of-the-art facial recognition model, was selected as the baseline architecture due to its strong feature discrimination and generalization capabilities. Using the MS1M-ArcFace dataset for training and LFW, AgeDB-30, and CFP-FP benchmarks for evaluation, 53 experiments were conducted. Multiple sampling ap- proaches were compared, at image and identity level, including uniform random selection, stratified sampling, k-means clustering and greedy Maximin selection. Both image and identity level subsampling were explored, with experiments designed to evaluate the effect of sample representativeness, intra- and inter-class variability, and the proportion of identities in the training set. Results indicate that k-means clustering applied to ArcFace embeddings at the image level achieved the highest overall performance across all benchmark datasets, demonstrating its effectiveness in reducing redundancy while preserving intra-class and inter-class diversity. Alternatively, random sampling at the identity level yields competitive performance compared to more complex strategies, particularly when high intra-class variability is desired. This finding suggests that identity-level random sam- pling is a valid and cost-effective approach for training data selection, significantly reducing the costs of data collection, storage, and processing. Additionally, k-means clustering may serve as a more suitable alternative in scenarios with a limited number of identities and where greater intra-class variability is not required. These insights are especially relevant in ethically constrained environments, where biometric data collection is restricted by consent. In all cases, clustering can further guide the final image selection process once consent is obtained, enhancing both the efficiency and representativeness of the dataset.	es
dc.format.extent	57 p.	es
dc.format.mimetype	application/pdf	es
dc.language.iso	en	es
dc.publisher	Udelar.FI.	es
dc.rights	Las obras depositadas en el Repositorio se rigen por la Ordenanza de los Derechos de la Propiedad Intelectual de la Universidad de la República.(Res. Nº 91 de C.D.C. de 8/III/1994 – D.O. 7/IV/1994) y por la Ordenanza del Repositorio Abierto de la Universidad de la República (Res. Nº 16 de C.D.C. de 07/10/2014)	es
dc.title	Optimization of data collection in facial recognition models through subsampling strategies	es
dc.type	Tesis de maestría	es
dc.contributor.filiacion	Tayler Silvana, Universidad de la República (Uruguay). Facultad de Ingeniería.	-
thesis.degree.grantor	Universidad de la República (Uruguay). Facultad de Ingeniería	es
thesis.degree.name	Magíster en Ingeniería Matemática	es
dc.rights.licence	Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)	es
Aparece en las colecciones:	Tesis de Posgrado - Facultad de Ingeniería

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
Tay26.pdf	Tesis de Maestría	7,08 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro sencillo del ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons