Detection and Validation of Sensitive Text Entities Within Documents – UROP Symposium

Detection and Validation of Sensitive Text Entities Within Documents

Reena Sharif

Research Mentor(s): Stefan Larson
Department or Program: Computer Science
Authors: Reena Sharif, Stefan Larson
Session: Session 2: 1:00pm-1:50pm
Poster: 47

Abstract

Text processing and machine learning models can be used to detect sensitive data including phone numbers, Social Security numbers, and birth dates, from varying documents. However, public datasets, like the RVL-CDIP dataset, contain thousands of sensitive entities which are readily available online, posing a major privacy risk. Through the use of regular expression (regexes), a pattern-based model, potentially sensitive entities can be identified by patterns within documents, yet, regexes can cause a greater number of false positives in detecting valid entities than acceptable leading to distrust among human users. We hypothesize that additional validator functions can identify truly valid data (e.g., checking if area codes within phone numbers are valid) within all possible valid entities allowing for a reduced percentage of false positives found by regex patterns. In this project, we designed and evaluated validators to target regexes for specific entities to run against hundreds of test cases. Our proposed validators will improve trust during user experience when handling personal data online.

lsa logoum logo