Street Address Validation for Sensitive Data Detection in Documents – UROP Symposium

Street Address Validation for Sensitive Data Detection in Documents

Alfredo Cruz

Research Mentor(s): Stefan Larson
Department or Program: Computer Science
Authors: Alfredo Cruz, Stefan Larson
Session: Session 2: 1:00pm-1:50pm
Poster: 9

Abstract

Training machine learning models requires large datasets to enhance their sophistication. However, issues related to privacy arise when these datasets contain sensitive information, such as phone numbers, email addresses, Social Security numbers, and physical addresses. More specifically, traditional street address detectors, which rely on regex technology for pattern-based recognition, often yield false positives. In our work, we implement a new pipeline that combines Pyap, an address detector with more flexibility, and the Smarty address validator API to verify actual addresses. This approach reduces false positives and identifies genuine addresses in datasets. Unlike current regex-based detectors, this combined method lessens the inherent limitations of pattern-based detection, reducing the risk of sensitive information exposure when training machine learning models.

lsa logoum logo