Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus – UROP Symposium

Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus

Yixin Yuan

Pronouns: She/her

Research Mentor(s): Stefan Larson
Research Mentor School/College/Department: DryvIQ / NonUM
Program:
Authors: Yixin Yuan, Cyrus Desai, Sam Desai, Azfar Mohamed, Stefan Larson
Session: Session 5: 2:40 pm – 3:30 pm
Poster: 25

Abstract

The RVL-CDIP corpus is a popular dataset for benchmarking image-based document classification machine learning models. However, recent prior work has estimated large amounts of label errors and duplicates between test and train splits in RVL-CDIP, which is problematic as modern machine learning models are capable of overfitting to noise and the presence of duplicate documents can inflate model performance scores. In this project, we seek to thoroughly analyze and quantify the presence of label errors and duplicates in RVL-CDIP. For label errors, we exhaustively catalog the RVL-CDIP corpus (400,000 documents) and filter out erroneously and ambiguously labeled documents. We find that label errors range between 1–25%, depending on the document category. For duplicate data, we develop and benchmark several duplicate detection algorithms incorporating both textual (e.g., string and text embedding similarity) and image (e.g., image embedding similarity and local feature matching) modalities. Finally, we release several updated versions of the RVL-CDIP dataset with label errors fixed and duplicate documents removed, and benchmark machine learning classifiers on these updated versions. Our work highlights the importance of data quality in the document understanding field, and the techniques we implement can be used beyond the RVL-CDIP dataset.

Engineering, Physical Sciences

lsa logoum logo