An Enhanced Document Referential, Classification And Retention System using Muiltinomial Naive Bayes Algorithm

— Securing organizational data is an important information management issue that continues to pose significant challenges for organizations especially in developing countries. Different organizations have made several failed attempts to develop authentic solution that will centrally create and manage document referential, place documents in their respective categories and apply retention on the documents during end-of-live, in a seamless manner. There have been several complaints from users on issues bordering on missing documents, premature destruction of important documents, organization's top-secret documents been found in wrong hands and placement of documents in wrong categories as a result of no well-defined organizational policies. We developed an enhanced online document referential, classification and retention system, which combated these challenges. We used object-oriented analysis and design methodology (OOADM) in our approach. Microsoft ASP.Net and Python technologies were used for this implementation. Automated document referential, classification and retention system provides a platform for easy creation of document referential, placement of documents in respective document categories and automatic application of retention on documents at the end of their lifespan. From the results, the overall accuracy of the classification model is 88% which indicates that most predictions made by the model are correct. The specificity of the individual classes ranges between ninety-eight (98) and one hundred (100) percent, whereas their precision, recall and f1-scores are above 0.5 which is good for prediction. This work could be beneficial to both small, medium and large-sized organizations.


INTRODUCTION
Placing documents in categories is a rare practice among organizations in developing countries. We tend to chunk out heavy documents on daily basis with little or no attention to what happens to such document throughout its life span. This has exposed many organizations, both in private and public sectors, to great danger of confidential documents reaching wrong hands. In some organizations, documents are shortlived or even stay longer than required, thereby exposing the organization to litigations by parties concerned or fines by regulatory authorities. One of the ways of professionally addressing this problem is by properly categorizing organizational documents, developing policies that govern each category and judiciously enforcing these policies.
Categorization is the process of placing documents into classes. A category is chosen considering the relation between the subject of the category and the document belonging to it [1]. Lots of information are being generated and stored in various hardcopy and electronic forms. There is need for a proper classification of these documents in order to apply accurate retention policy meant for each document. Document Retention is the holding (period) of records/documents for further use [2].
Before a document can be placed in a class, a document type referential (or schedule) must first be created. [3] observed that a good policy comprises of a schedule that has the retention periods for all documents types and a framework that is used in administering it. Document referential and retention concerns a cycle of organizational activities which include: i. The creation of document type referential that clearly states the various categories of documents and how to handle documents in a particular category.
ii. The acquisition of documents from one or more sources and placement of such documents in the accurate category; iii. Its ultimate disposition through archiving or deletion.
Many times in an organization, documents outlives those that created them. The person who created a document may be transferred to another location, sacked or may even resign. Another staff who picks that document needs to have a proper understanding of the type of document he is handling and with the help of a referential apply correct company policy for such a document.
As documents and database records generated by organizations increases enormously, the complexity of effectively handling these documents by users also increases. There is a problem of having a single tool that can standardize document types, effectively place documents into their appropriate categories and apply organizational retention policies on these documents, especially those in storage media and records in multiple databases and varying DBMS platforms. Some works have been done to overcome these problems. However, much is still required. We have developed an enhanced system for document referential, classification and retention for organizations. Document referential, classification and retention system targets organization of different sizes especially big organizations that chunks out big electronic documents and database records regularly.

International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181 http://www.ijert.org II. RELATED WORKS [4] developed an automated system that detects articles that are relevant to disease outbreaks using Machine Learning classifiers. The experiment recorded daily averages of areas under ROC curve is 0.841 for Naive Bayes and 0.836 for SVM classifier (equivalent to 95% confidence interval). The experiment did not explore other classification algorithms and not tested for large datasets.
[5] proposed a two-phased feature selection method and Naive Bayes classifier for Indonesian news classification. The method showed 86% accuracy and lowered the complexity of Maximal Ma 0rginal Relevance for Feature Selection (MMR-FS). [6] reviewed supervised machine learning classification techniques, based on a number of machine learning application-oriented papers. Naive Bayes ranked high in needing less training data, using little storage space; robust to missing values and quick training.
[7] carried a study that compared classification algorithms. The work concluded that the performance of algorithms depends on domain. No one algorithm fits all classification domains. Though he recommended the Random Tree and Logistics Model Tree. It suffered from limited coverage. [8] designed a framework that is based on generating mock examples for self-labelled classification. This framework improved the classification capabilities of self-labeled techniques.
[9] proposed a framework for records retention in relational database systems. This framework applied retention on views in the database. It has no automatic classification. The use of database views has deficiencies since it's not all views are updateable, integrity preservation, cost and what happens to existing databases. Also, it considered records in the databases only and does not allow multiple DBMS platforms. [10], extended RDBS to automatically enforce privacy policies. The work monitored privacy obligations enterprise-wide using an elaborate central obligation monitoring system. A systematic way of scheduling events throughout all corporate data repositories such that the execution of these events will ensure compliance with all privacy obligations. It did not cover retention on others records.
The goal of this paper is to develop an automated document referential, classification and retention system that will be used to standardize document categories, place documents into appropriate categories and apply retention on these documents using organizational policies that were defined in the referential. The system automatically applies retention on records in multiple databases and different DBMS platforms (SQL Server and Oracle) based on the classification and retention schedule. It also places documents into the standardized document types using user-defined options and multinomial naive Bayes algorithm

III. METHODOLOGY
The methodology adopted by the researcher is Object Oriented Analysis and Design Methodology (OOADM). OOADM involve the identification of critical objects of the document referential, classification and retention system by breaking them down into smaller sub-systems and recursively applying software processing on the identified objects.

A. Architecture of the Proposed System
The architectural design in Fig. 1 describes the components integrated to bring about the workings of the document referential and retention application. It captured the major functional building blocks needed to understand the process of building the document retention system. These components are explained below: Generate Contents: This refers to a collection of different inputs made into the system by staff and non-staff of the organization. It involves records stored in database and conventional files like word and excel document stored in directory.
Document Referential or Retention Schedule: This is an elaborate set of document types stored which the classification and retention modules references. It contains government and organizational policies on what should happen to documents during its life cycle. For example, final treatment of a document at end-of-live could be to destroy the document or archive for historical or legal purposes.

Multiple Data Source:
The data source describes how the document retention module gets its data. It is a connection setup that enables the retention module access the records in the database and conventional files on disks.

Files in Directory:
This refers to documents in various file formats that are stored in known paths in the disc.

Records in Database:
Records in database refers to relational records stored in various DBMS platforms like SQL Server and Oracle. The retention system captures the connection strings to these records applies retention on datasets at end-of-live. Document Classification: This module involves placing of each document into a particular document type or class using user-defined or naive Bayes algorithm.
The document classification service performs the calculation of placing a document into existing category using multinomial naive Bayes model given as: Retention Information and Logs: This contains basic information about the retention system and also a log of activities that are going on within the system. View Retention Reports: Reports can be generated using the logs and other information that are kept in retention system database itself. Fig. 2 is the Use Case diagram that shows a list of actions or event steps that defines the interaction between the actors and the enhanced document referential, classification and retention system.

B. Use Case Diagrams of the Proposed System
The user generates documents while performing his dayto-day activities and assign preferences (if required). The classifier places these documents into document types based on the user preferences and retention schedule created by the administrator or an Information Management Officer (IMO). The retention algorithm is triggered daily, based on scheduled time to identify document that has reached end-of-life (or met certain criteria) and decides either to delete, exclude, archive or recommend for migration into another format or new platform. The use case trigger for the proposed system is the user's activity of generating documents and assigning preferences to such documents where necessary.

C. Data Flow Diagram of the Proposed System
We used data flow diagram to diagrammatically show how data flows within the enhanced document referential and retention system. In Fig. 3 we represented the proposed document referential and retention system which has one external entity i.e. user of the system namely the Organization and the data flowing in and out of the system is the documents details. Fig. 4 shows the Level-1 DFD which models the details of the proposed system. It shows how the system is divided into processes. Each process handles one or more of the flows of data, either to or from the Organization.

D. Algorithm of the Proposed System
The steps used in the automated document referential, classification and retention system include the following:   5. Establish a connection to the document retention system using a multiple data source interface that allows conventional files, database records and different file formats in directory. 6. Pass the contents through a Naive Bayes classifier using the steps below [11] b. If 'yes', exclude record, delete record permanently, archive for historical purpose or migrate into a new format or technology depending on the document class and user preferences.
c. If 'no', skip the document and continue.
8. Log the retention activities 9. Allow user (administrator) to generate reports based on the retention activities.

E. Program Flowchart
The program flowcharts in Fig. 5 shows the program structure, logic flow and operations performed by the proposed system.  Net framework which uses Common Language Runtime architecture to manage execution of codes. Python was used for development of the classification service. The classification service is a windows service that performs the automatic classification aspect of the system using multinomial naive Bayes classifier. JavaScript, JQuery, CSS, HTML, JSON and Bootstrap technologies where used for scripting, styling, rendering of documents and remote communication between application and the databases.

A. System Testing
The accuracy of the program was tested with some varying data. The 20NewsGroup dataset [12] which had duplicates removed, was used as training and test dataset. They are preprocessed. Fig. 6 shows that when the document "C:\...\Nora Roberts -Loving Haley.doc" was selected from Open-File dialog box, the status was "Unclassified. Predicted Class is DT-0016". On clicking the classify button and responding 'Yes' to the ensuing dialog box, the document gets classified as shown in Fig. 8. Fig. 9 indicates that the status of the same document changed to "Classified. DT Code: DT-0016. Value: L" when selected again in the system. This document will be deleted from the system in 10 years because the conservation value for such type of document is 10years. Fig. 10 shows the implemented document referential. Users can search a particular document type using the search button or the departmental dropdown list.
The system has a configuration page for records stored in different databases and different DBMS platforms. It captures the connection strings and other information that will enable the retention service to know how to handle such records. Many databases from two different DBMS platforms, Oracle and SQL Server, were configured and activated for retention using the database records retention configuration module. The database records retention configuration module in Fig.  11, was used to configure a banking software which uses the 'NKPO.mdf' database. The system was able to pull all the existing tables from the database and for each table selected, it listed all the date fields (column name) and identified the primary key. When the Save button was clicked, the setup request was saved in the retention database and listed in a grid. Fig. 12 shows that when the link Delete/Exempt Record was clicked, it displayed all the activated database records retention requests. Clicking on the Select link beside each request displayed the details of the request, its referential details and all the records that have reached their end-of-live and are pending deletion or exemption.  When the link "Click here to view or exempt records was clicked, a new page showing the Exempt records page was pop-up. Fig. 13 shows two records that were exempted out of 142914 records that have reached end-of-live. When retention was applied all the 142912 were purged from the database leaving on the two records that were exempted.
The tests show that the system was able to configure conventional documents and database records in different databases of varying DBMS platforms. The system was able to apply retention on these records at their end-of-live. Also, the users were able to exempt some records and gave reasons for such exemptions. New extension dates for expiration were as well indicated. When searches were performed on the referential, users were able to locate company policies and descriptions that pertains to various document categories the company created. This means that the system is running properly and will achieve its purpose and objective. Table 1 and Table 2 shows the summary of outcomes from the observation of test documents classified using the proposed document referential and retention system. The overall accuracy of eighty-eight (88) percent was determined by the percentage number of documents that were correctly placed in their exact document classes, against the total number of documents. The 20NewsGroup dataset version used excluded the cross-posts and included only "From" and "Subject" headers. It has 18828 documents (newsgroups posts) on twenty (20) topics [12]. The documents were split into two subsets, the training and test set. Table 1 is the confusion matrix of the enhanced system which showcases the performance of the classification model on the test dataset. The table shows clearly the actual classes and the predicted classes, thereby helping in determining the number of documents that were correctly or incorrectly predicted by the model at a glance. Table 2 shows the summary of the outcomes in the confusion matrix and the performance evaluations of the model using the metrics: precision, recall, f1-score, specificity and accuracy. The true positive (TP) values shows that 1643 documents out of 1876 documents were correctly predicted. This gave an overall accuracy of eighty-eight (88) percent indicating that most predictions from the model are correct.

B. Discussion of Results
The false positive (FP) values shows that 239 documents were classified as positive when they are not. Whereas 217 documents were falsely classified as negative (FN column). The individual true negative (TN) values were gotten by calculating the sum of all columns and rows excluding that class's column and row. The TN values are almost equal to the total number of documents tested. This shows that the model correctly predicted the negative classes. The specificity of the individual classes ranges between ninety-eight (98) and one hundred (100) percent, whereas their precision, recall and f1scores are above the threshold of 0.5 which shows that the model's predictions are reliable.
The recall values show the proportion of the actual positive classes the model was able to correctly identify as positive. The precision values which are close to one (1), shows the proportion of positive predictions that was actually correct. The results above shows that the proportion of the data points the model says are relevant actually were relevant and its predictions are reliable. The line graph in Fig. 14 was used in visualizing the observations. It shows at a glance the individual performances of each classes on the evaluation metrics -precision, recall, f1score and specificity.

V. CONCLUSION
This paper has discussed the document referential, classification and retention system and how organization can take advantage of it to protect their documents and as well, avoid ligations from concerned parties, fines from regulatory authorities, exposure to information theft or confidential documents reaching wrong hands. The implementation of this system will result to organized document referential that is easily accessible to members of the organization and will present a platform that allows for easy document classification and automatic monitoring/application of retention when document reaches end-of-live.
The study has shown the possibility of applying retention on conventional documents and records in multiple relational databases and two different RDBMS platforms based on the classification and retention schedule. Development of a single system that can create and manage document referential, provide easy way to place documents into their respective document classes and automatically apply retention on documents at its end-of-live.