Evaluating Zero-Shot Classifiers for Font Type Classification for a Data Anonymization Pipeline – UROP Symposium

Evaluating Zero-Shot Classifiers for Font Type Classification for a Data Anonymization Pipeline

Kaiwen Mo

Research Mentor(s): Stefan Larson
Department or Program: Computer Science
Authors: Kaiwen Mo, Stefan Larson
Session: Session 2: 1:00pm-1:50pm
Poster: 33

Abstract

Safeguarding sensitive data (e.g., Social Security numbers, phone numbers, family addresses, birthdates, etc.) within documents is a paramount concern. Data anonymization, a technique that removes or modifies personally identifiable information, is critical for protecting individual privacy and maintaining compliance with data protection regulations. In a data anonymization pipeline, we seek to replace sensitive entities with a document with fake but realistic data. One important element in this pipeline is identifying the font type/style of text so that the replacement data can be rendered as close as possible to the original font style and format. However, existing tools lack the capability to perform these tasks effectively, which is fundamental for achieving high-fidelity document anonymization. To address this gap, our research introduces a tool equipped with zero-shot learning models, CLIP and ALIGN, to evaluate their capability in determining font type attributes accurately. We additionally explore how word density—and potentially text location, color, size, and capitalization—affects the models’ ability to correctly recognize font types. This advancement represents a crucial step towards robust and reliable document anonymization, with profound implications for privacy preservation across various sectors, including legal, healthcare, and financial industries.

lsa logoum logo