Document Processing

UsageGuard provides advanced document processing capabilities that enable you to easily integrate AI-powered Retrieval Augmented Generation (RAG) into your applications.

Overview

UsageGuard's document processing pipeline provides a comprehensive solution for handling documents from upload to AI processing.

Upload & ProcessingContent ProcessingAI ProcessingSecurity & Compliance
Secure cloud storage with presigned URLs.
PDF, Office, TXT, CSV, RTF.
Automatic OCR.
Multi-language support.
Whitespace & encoding normalization.
Semantic chunking.
Structure preservation.
Language detection.
Multilingual embeddings.
Vector search.
Content classification.
Custom models.
End-to-end encryption.
Token-based access.
Audit logging.
Privacy compliance.

Document Processing Capabilities

Format DetectionText ExtractionContent OptimizationChunking Strategy
File format detection.
Integrity validation.
Content verification.
Format-specific methods.
Fallback mechanisms.
Error handling.
Boilerplate removal.
Header/footer handling.
Structure preservation.
Semantic boundaries.
Context preservation.
Special content handling.

Processing Status

StatusDescription
queuedDocument waiting to be processed.
indexingText extraction and embedding generation.
completedProcessing finished successfully.
failedProcessing failed (check error details).

Integration

To integrate document processing into your application, use our Document APIs. These APIs handle all the complexity of document processing while providing a simple interface for your application.

Advanced Document Processing

JSONL Format Support

UsageGuard supports JSONL (JSON Lines) format for advanced document processing, allowing you to index and retrieve metadata along with document chunks for enhanced context enrichment. Each line in a JSONL file represents a separate JSON object containing both the text content and associated metadata.

Schema Definition (UG 3.2.2 or late)

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "text": {
      "type": "string",
      "description": "Main textual content."
    },
    "metadata": {
      "type": "object",
      "description": "Flexible metadata containing various key-value pairs.",
      "additionalProperties": {
        "oneOf": [
          { "type": "string" },
          { "type": "number" },
          { "type": "integer" },
          { "type": "boolean" },
          { "type": "array", "items": { "type": "string" } },
          { "type": "object" }
        ]
      }
    }
  },
  "required": ["text", "metadata"]
}

Example

Here are some examples of how to structure your JSONL data following the schema:

{
  "text": "Attention Is All You Need We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less training time. 
  The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. 
  The best performing models also connect the encoder and decoder through an attention mechanism. We propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. 
  The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after training for as little as twelve hours on eight P100 GPUs. 
  The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. 
  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 
  We trained the base model for 100,000 steps or 12 hours on 8 P100 GPUs. For the big model, we trained for 300,000 steps (3.5 days) on 8 P100 GPUs. The Transformer achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the best previously reported models by more than 2.0 BLEU points.",
  "metadata": {
    "type": "research_paper",
    "paper_id": "TRANSFORMER-2017",
    "title": "Attention Is All You Need",
    "authors": ["Vaswani, A.", "Shazeer, N.", "Parmar, N.", "Uszkoreit, J.", "Jones, L.", "Gomez, A. N.", "Kaiser, L.", "Polosukhin, I."],
    "year": 2017,
    "venue": "NeurIPS",
    "citations": 45000,
    "sections": ["title", "abstract", "introduction", "methodology", "results"],
    "key_concepts": ["Transformer", "attention mechanism", "machine translation", "neural networks"],
    "metrics": {
      "training_time": "12 hours",
      "gpu_count": 8,
      "bleu_score": 28.4,
      "improvement": "2.0 BLEU points"
    },
    "datasets": ["WMT 2014"],
    "tasks": ["English-to-German translation"]
  }
}

Here's how you can include multiple objects in one JSONL file.

{"text": "Installing via Docker requires version 20.10 or higher", "metadata": {"type": "header", "chapter": 1, "tags": ["ML", "introduction"], "importance": "high"}}
{"text": "Machine learning is a subset of artificial intelligence...", "metadata": {"type": "content", "chapter": 1, "section": 1.1, "references": ["Mitchell 1997", "Bishop 2006"]}}
{"text": "Key Concepts in ML", "metadata": {"type": "subheader", "chapter": 1, "section": 1.2, "concepts": ["supervised learning", "unsupervised learning"]}}

How UsageGuard Processes JSONL

UsageGuard processes JSONL documents by splitting them into chunks for vectorization and storage. The metadata associated with each document is preserved throughout this process, allowing for rich context and filtering capabilities.

Key aspects of JSONL processing:

  • Each document is split into optimized chunks while maintaining associated metadata
  • Domain objects can be directly exported to JSONL format
  • Text content is optimized for search while preserving metadata relationships
  • The system enables intelligent query interpretation and metadata-based filtering

This approach provides flexibility in how you structure and process your documents, while ensuring that important contextual information remains intact throughout the processing pipeline.

Use Cases

1. Technical Documentation

  • Index documentation with version metadata
  • Track document sections and hierarchies
  • Link related content across documents
{
  "text": "Installing via Docker requires version 20.10 or higher",
  "metadata": {
    "version": "2.1.0",
    "category": "installation",
    "platform": "docker",
    "requirements": ["docker-20.10+"],
    "lastUpdated": "2024-03-15"
  }
}

2. Legal Document Processing

  • Track document sections and clauses
  • Maintain reference information
  • Include jurisdiction and applicability data
{
  "text": "This agreement shall be governed by the laws of the State of California",
  "metadata": {
    "documentType": "contract",
    "section": "governing_law",
    "jurisdiction": "California",
    "effectiveDate": "2024-01-01",
    "clauseId": "GL-001"
  }
}

3. Research Paper Analysis

  • Include citation information
  • Track methodology and results
  • Link to related papers and datasets
{
  "text": "The experiment showed a 15% improvement in accuracy",
  "metadata": {
    "paperID": "RP2024-123",
    "section": "results",
    "methodology": "A/B testing",
    "metrics": ["accuracy", "precision"],
    "dataset": "MNIST",
    "citations": ["Smith et al. 2023"]
  }
}

Was this page helpful?