Komprise: Metadata is the key to smarter AI and data governance

Komprise

Apply on EasyApply

Create a free account to apply in seconds

AI/ML

Komprise: Metadata is the key to smarter AI and data governance


Chris Mellor

Published




Better unstructured data management is the reason Komprise was founded in 2014 by CEO Kumar Goswami along with president and COO Krishna Subramanian and CTO Michael Peercy. At that time, file populations numbering in the millions were appearing in large enterprises and, since then, have risen to the billions. Objects stored in buckets are numbered in the trillions at the hyperscale public cloud providers. Komprise works part of its data management magic by using and enriching metadata about files and objects.

For example, media files can have added metadata to describe their contents. In the last few years, generative AI’s large language models require vector embeddings to perform semantic search, and such vectors are generated from unstructured data, from the content. Are vectors a kind of metadata? We explored these topics with Goswami in an interview.

Blocks & Files:I could argue that the tokens and vector embeddings generated from a data item are metadata. What do you think about this idea?

BANDF AD

Kumar Goswami: Metadata and vector embeddings are complementary but related. Since vector embeddings are a computer-understandable representation of file contents (“the what”) while metadata is valuable information about the file that can go well beyond file contents (“the why”), you need both. Metadata is usually more concise than vector embeddings and putting the entire file contents into metadata can be inefficient. Also, there could be data governance issues with running AI on all your data via embeddings.

For example, say you want a chatbot to answer questions based on the most recent product features but you want it to only use public facing documents and not confidential internal documents, you should use metadata to exclude internal documents and non-final versions and run the vector embeddings and AI on just the right files. We are focusing on gathering and globally managing metadata to enrich, inform, and narrow down data, not to capture everything that can be gleaned from it.

We want to empower other tools and processes to consume and process the data as a whole. For example, you can enforce AI data governance and improve AI data quality by using Komprise to cull the files fed to Nvidia NeMo for embedding and running inferencing.

BANDF AD

Blocks & Files: Komprise says new tools can automatically analyze file contents and generate semantic tags at scale. What are semantic tags? Are they metadata generated from a file’s contents? If so, then how do these semantic tags differ from vector embeddings?

Kumar Goswami: Vector embeddings are used to help AI understand the meanings of words in context while metadata provides semantic context for which files are relevant. For example, vector embeddings may help AI understand that the word “award” in the context of a research grant paper means getting a funding award and not winning a trophy. Metadata can be used to cull and curate all the documents related to a specific research topic by a specific researcher in a specific time frame to send to an AI agent that is helping write a grant application. You can argue that both are semantic contexts, but for different purposes, and metadata is broader than what is in the file itself.

Blocks & Files:What tools exist that automate finding and analyzing metadata?

Kumar Goswami: You need to not only index metadata across different storage and cloud environments but also act on it at scale. Komprise does both as our analysis extracts both system metadata and extended metadata such as sensitive data information into a global file index. This index retains the knowledge no matter where your data lives, and it does so without changing the original files. Komprise Deep Analytics helps you query and filter data based on this index and Komprise Smart Data Workflows allows you to search and feed the right data to the right AI process and retain its outputs as additional metadata.

BANDF AD

That’s the neat thing about metadata and AI: it is not a one-and-done process like traditional ETL. Instead, you need an ongoing workflow solution to find the right data, get it to the right compute, run the compute either locally or in the cloud, and then repeat this process again. Our customers have indexed and mobilized over an exabyte of data using Komprise. You can use any AI or vector embedding or processor to enrich metadata further on your data in Komprise workflows. A great example of this is our customer Duquesne University.

Blocks & Files: What AI tools are now available to extract pertinent information hidden in files and turn it into useful metadata that adds structure and context? How is the synthesis carried out?

Kumar Goswami: Anything that looks at file contents and generates outputs can be used via APIs in Komprise to enrich metadata. You can use cloud-based services like Azure AI Speech to inspect audio or Salesforce Einstein to find particular purchase orders in your CRM, and then have Komprise tag the files. That is the beauty of iterative workflows. You can use any process or tool to distill relevant metadata once you have a systematic way to manage the workflow.

Blocks & Files:I understand Komprise thinks that automatic metadata from storage systems, while useful for basic operations, is just the start of a strategic metadata management program. The real business value comes from enriching this foundation with metadata that precisely defines data so it can be easily searched and moved as needed to AI tools or other locations as required. What metadata enriches the automatic metadata from storage systems? How is it generated? How is it stored and indexed?

Kumar Goswami:

Skills

Azure