Home>Document Retrieval System v.1.0

Background

Information retrieval assumes various forms and users’ request for information can range from a text query — by typing on a keyboard, by selecting a query suggestion, or by voice recognition. The request for information can also be in the form of an image and in some cases the request can be implicit.

Document retrieval can be seen as matching some user-stated query — ranging from multi-sentence full descriptions to a few words — against a set of free-text records that can include unstructured text, such as newspaper articles, real estate records, paragraphs in a manual, financial documents and records held by banks, title insurance, mortgage banking sectors and financial institutions, organizations in the healthcare. At the end of this query, the outcome is to connect the requester with a necessary document or set of documents. Document retrieval users need to access varied records and apply the contained information to newer documents or historical data, making the document search and retrieval process as critical a step as ever. Both the user-stated query and the retrieved results may use the same modality (e.g., retrieving text documents in response to keyword-based queries), or be different (e.g., image search using text-based or keyword queries). If the query is not clear, the information retrieval system may employ user history, physical location, temporal changes in information, or other context when suggesting results. Retrieving information can include ranking existing pieces of content, such as documents or short-text answers, or composing new responses incorporating retrieved information.

Challenges/Problems

As we move to a digital paradigm, documentation is increasingly becoming paperless and the inability to search and retrieve the right document is not only a challenge but also an impediment in efficiency. Another major impediment is to summarize large documents and HTML pages retrieved from a particular search.

Also, a metadata-based search allows users to merely opt from simple keyword-based search such as document title, document creation date, document creator name and similar keywords. This is limited in scope as the entire context of a user-stated query cannot be encompassed in the search results that are generated. In sum, if a user-stated query needs more information on the context of the content in the documents or details of the background of the content, metadata-based search and information retrieval would not be a viable solution. For example, if a user-stated query includes historical information and needs in-depth context of, say, a chart or graph seen in a financial proposal, a metadata-based search would be insufficient to provide the required information. Also, it is a huge task to be able to visualize an image embedded in a document into a easily-retrievable natural language search query or term by using merely metadata as it would take years of effort in indexing images meaningfully.

Solution

xpresso.ai is a Auto-ML AI Ops platform that excels in developing, deploying and monitoring software projects to a high-availability environment, by providing development teams with tools and automated processes that encapsulate industry-standard best practices. Users can focus on data collection and preparation, model creation and testing, and inference generation using an intuitive process and standardized manner.

The journey involves setting up the required infrastructure using out of the box development environments. Development images installed on-premises or as a development VM within the xpresso.ai infrastructure, seamless project setup using Bitbucket, Jenkins and Docker (ensuring deployment without software compatibility issues) and deployment using a Kubernetes cluster (ensuring a highly available high-performance and scalable environment at the click of a button) means that time to market solutions is efficiently reduced.

To address the challenges with document retrieval the first step was to introduce crawlers to extract data from internal documents and external websites such as Bloomsberg, BlackRock and so on. xpresso.ai’s Auto-ML AI Ops framework — that allow collecting data from sources as diverse as databases, file systems, SFTP sites, S3 buckets — was leveraged and details collected were analysed and added as exploratory variables by using xpresso.ai libraries. The data attributes obtained were used for categorization and then performing uni-variate, bi-variate and Bag of Words analysis for both structured and unstructured data-sets through xpresso Exploratory Data Analysis (Data and Statistical Analysis). The xpresso Data Pipeline Management (Rapid Model Training and Experimentation) provided the ability to use Kubeflow-enabled pipelines, support for declarative pipelines and multiple pipelines and thus quicker training of models. xpresso.ai can read factors from a varied recommendation text connection and generate an output. This can be supplanted with additional data collected (with the aid of xpresso.ai data versioning and connectivity libraries.

A type of Deep Neural Networks (DNN) was used for Convolutional Deep Structures Semantic Models (CDSSM) and a combination of Convolutional and Recurrent Neural Networks (CNN + RNN) with Vector Space Models (VSMs) was the AI footprint which served the basis of the DNN. From all these variables obtained, models were created and versioned — enabled by xpresso.ai Auto ML-AI Ops framework. By using xpresso.ai data versioning and connectivity libraries that use Pachyderm, data versions were easily controlled, stored into xpresso Data Model (XDM) enabled data store enabling easy retrieval and storage of data-sets/ files into internal XDM.

Finally, an effective document retrieval system was created from articles, blogs, millions of internal as well as external documents (PDFs, business and financial news, reports, etc.) by querying the corpus and retrieving relevant paragraphs and sections from reference materials — enabling over 90% productivity enhancement for clients looking to search relevant business content from an existing (and growing) database.

Speak to AI expert