Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus – UROP Symposium

Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus

Cyrus Desai

Pronouns: he/him

Research Mentor(s): Stefan Larson
Research Mentor School/College/Department: DryvIQ / NonUM
Program:
Authors: Cyrus Desai, Sam Desai, Azfar Mohamed, Yixin Yuan, Stefan Larson
Session: Session 5: 2:40 pm – 3:30 pm
Poster: 25

Abstract

The RVL-CDIP dataset is commonly used for training and testing machine learning models for the task of document classification. Our previous research shows a significant amount of label noise, estimated to be between 1.6% to 16.9% depending on document category, as well as a substantial overlap between testing and training data (Larson et al., 2023). This potentially leads to document classification models that tend to classify documents incorrectly and have overinflated accuracy scores. We propose RVL-CDIP++, a version of RVL-CDIP with these errors amended. We manually labeled batches of documents in RVL-CDIP to flag label errors. Further, we used these results to train unsupervised outlier detection models, which were able to determine incorrectly-labeled documents throughout the dataset. Additionally, we computed cosine similarity scores between image embeddings of documents in the test and train datasets to identify and eliminate cases of test-train overlap. Our main contribution is RVL-CDIP++, a set of “cleaned” versions of RVL-CDIP where each version minimizes a unique issue, such as labeling errors or test-train overlap. We anticipate that state-of-the-art models will achieve higher accuracy scores due to the removal of label errors. While there may have been risk in using RVL-CDIP for benchmarking document classifiers, we anticipate that RVL-CDIP++ would be an improved benchmark for classifier models. By providing a more robust dataset with the aforementioned errors addressed, the introduction of RVIL-CDIP++ creates an opportunity for the improvement of past and future document classification models.

Engineering, Physical Sciences

lsa logoum logo