RVL-CDIP(Ryerson Vision Lab Complex Document Information Processing)

Published in

Analytics Vidhya

6 min readFeb 2, 2020

Description:

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Here are the classes in the dataset, and an example from each:

letter memo email file folder form handwritten invoice advertisement
budget news article presentation scientific publication questionnaire resume scientific report
specification
This dataset is a subset of the IIT-CDIP Test Collection 1.0 which is publicly available here. The file structure of this dataset is the same as in the IIT collection, so it is possible to refer to that dataset for OCR and additional metadata. The IIT-CDIP dataset is itself a subset of the Legacy Tobacco Document Library.

credits: https://www.cs.cmu.edu/~aharley/rvl-cdip/

Use of deep learning:

A region-based Deep Convolutional Neural Network framework is presented for document structure learning. The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of ‘inter-domain’ transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of ‘intra-domain’ transfer learning is used for rapid training of deep learning models for image segments. Finally, a stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. The proposed method achieves state-of-the-art accuracy of 92.21% on the popular RVL-CDIP document image dataset, exceeding the benchmarks set by the existing algorithms.

Source of the data:

The data is downloaded from the site: https://www.cs.cmu.edu/~aharley/rvl-cdip/. I have used colab for the purpose of the project the data is downloaded using a software curl-we-get directly on the colab platform and extracted and distributed into 16 classes.

Data Overview:

The data set consists of scanned grayscale images of documents from law suits against American Tobacco companies and is segregated into 16 categories or classes. The dataset is subdivided into Training, Validation and Test Sets each containing 320000, 40000 and 40000 images respectively. The dataset is already split into train/dev/test, with 320k/40k/40k files available in each split respectively. Uncompressed, the dataset size is ~100GB.

Mapping into real world ml problem:

It is a multi class classification problem, for a given raw image as input we need to predict which class it belongs to of the 16 classes.

Performance metric:

I have used accuracy as performance metric because all the images are classified into 16 classes and the number of images in each class is of equal distribution.

Problem statement:

Take a raw image as input and predict which class it belongs to as output.

First cut approach to the problem:

The first cut approach to the problem is to make 16 different folders and distribute the train and cross validation images to respective folders.Next step is to use image data generator over these images individually for whole ,header,footer,right and left part of every other image train ,validation and test images and save the models individually.Next, is to perform the feature extraction part of all the five models and stacking all the models and then predict the labels using the stacked models.The last step is to make a neural network and use this neural network with stacked models as input to the fitting part of the neural network. And the last part of the case study is to make the pipeline part of the project i,e, to take raw image as input take hollistic,header,footer,left and right and use the existing models to extract the features and use some raw unseen images to test whether the models predict the class or not.

Exploratory Data Analysis on the rvl-dataset:

First task is to understand what is actually provided to us in the data.

From the given graph it can be seen that the data is evenly distributed.

Understanding the width of every image . It is seen that width of the image varies over the given range.

Explanation of the models in my case study:

DCNNs are currently one of the most popular models for deep learning.I have used vgg16 for this purpose. VGG16 model performs better on a document classification task than other DCNN models. This information is utilized by us to select the VGG16 model for our base classifier model for this task.The general architecture is done with the Adam Optimizer being used for training along with a learning rate decay tuned based on the accuracy on the validation set.The initial weights used are transferred from a model trained on the ImageNet object recognition dataset. For assessment of various architectures, model weights were initialized using ImageNet1K weights and random weights. We used a small batch size of 32 to accommodate the large training images. It could have been higher, but the results on test set seemed to hold up well.At first the fully connected layers of vgg16 is removed and the last two layers is made trainable and add more three layers on top of that.

In the last layer i have used softmax because it is a multiclass classification problem and the loss function is categorical cross entropy because it is a multi class classification problem. Dropout is used in the model to control the overfitting. Adam optimizer was performing better than rmsprop , adagrad and adadelta so i have used adam.After that i have used tensorboard to understand the performance of the models. After that i have used model checkpoint in order to save the models with best weights.After saving the model weights i have used the model to evaluate the model on data generator.

After that the model is serilized to json. And while solving the other portions like bottom,right,left and top i have used the similar model from the saved checkpoint. After training similarly for the other three parts, what i have done is to extract the features using predict generator and again save the features of the model. In this way all the features of train , test, validation of all the five parts — hollistic, left, right ,whole and bottom part is extracted and saved.

We have to stack all the extracted features and save the extracted features creating checkpoints.

After that a neural network is designed and the model is trained on train features extracted , and then the model is evaluated on the test extracted features. The accuracy is found to be 91.85%.

Future work:

In the future , i can enhance the workability of the models using distributed computing where i can run the entire assignment in one go which is not possible in colab or in an entry level desktop/laptop instead of saving the models which will increase the effiviency of the models. It can be trained across multi GPUs using data parallelism.I can also try to train the model with more number of epochs which can improve the model.