Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus

Sam Desai

Pronouns: he/him

Research Mentor(s): Stefan Larson
Research Mentor School/College/Department: DryvIQ / NonUM
Program:
Authors: Cyrus Desai, Sam Desai, Azfar Mohamed, Yixin Yuan, Stefan Larson
Session: Session 5: 2:40 pm – 3:30 pm
Poster: 25

Abstract

Document classification is the task of classifying documents into categories like resumes, invoices, etc.. These classifiers often use machine learning models, which need to be trained and tested on large datasets to determine their accuracy. The RVL-CDIP dataset is an industry-standard collection of documents often used for this task. However, previous research shows that the dataset contains a significant amount of labeling errors and overlap between test and train splits (Larson et al., 2023). As such, document classifiers that use this noisy dataset may not be as accurate as they are reported to be. In this paper, we manually labeled documents and used methods (e.x. outlier detection) to find and remedy errors and report error rates in each document category, which we have found to range from 1.46% to 23.5%. We also used methods (e.g. image hashing, pre-trained CNNs, local feature matching) to detect near-duplicate documents in the test and train splits. Using this analysis, we created a final dataset called RVL-CDIP++ with significantly fewer errors and duplicates, which we have tested on multiple popular document classification models (e.g. DiT, LayoutLM, BERT). This will help improve the field of document classification by providing a more accurate benchmark for comparisons between models.

Addressing Data Annotation Errors and Leakage in the RVL-CDIP Document Classification Corpus

Sam Desai

Abstract

Engineering, Physical Sciences

Undergraduate Research Opportunity Program

Search

Questions?