Document Processing
UsageGuard provides advanced document processing capabilities that enable you to easily integrate AI-powered Retrieval Augmented Generation (RAG) into your applications.
Overview
UsageGuard's document processing pipeline provides a comprehensive solution for handling documents from upload to AI processing.
Upload & Processing | Content Processing | AI Processing | Security & Compliance |
---|---|---|---|
Secure cloud storage with presigned URLs. PDF, Office, TXT, CSV, RTF. Automatic OCR. Multi-language support. | Whitespace & encoding normalization. Semantic chunking. Structure preservation. Language detection. | Multilingual embeddings. Vector search. Content classification. Custom models. | End-to-end encryption. Token-based access. Audit logging. Privacy compliance. |
Document Processing Capabilities
Format Detection | Text Extraction | Content Optimization | Chunking Strategy |
---|---|---|---|
File format detection. Integrity validation. Content verification. | Format-specific methods. Fallback mechanisms. Error handling. | Boilerplate removal. Header/footer handling. Structure preservation. | Semantic boundaries. Context preservation. Special content handling. |
Processing Status
Status | Description |
---|---|
queued | Document waiting to be processed. |
indexing | Text extraction and embedding generation. |
completed | Processing finished successfully. |
failed | Processing failed (check error details). |
Integration
To integrate document processing into your application, use our Document APIs. These APIs handle all the complexity of document processing while providing a simple interface for your application.
Advanced Document Processing
JSONL Format Support
UsageGuard supports JSONL (JSON Lines) format for advanced document processing, allowing you to index and retrieve metadata along with document chunks for enhanced context enrichment. Each line in a JSONL file represents a separate JSON object containing both the text content and associated metadata.
Schema Definition (UG 3.2.2 or late)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "Main textual content."
},
"metadata": {
"type": "object",
"description": "Flexible metadata containing various key-value pairs.",
"additionalProperties": {
"oneOf": [
{ "type": "string" },
{ "type": "number" },
{ "type": "integer" },
{ "type": "boolean" },
{ "type": "array", "items": { "type": "string" } },
{ "type": "object" }
]
}
}
},
"required": ["text", "metadata"]
}
Example
Here are some examples of how to structure your JSONL data following the schema:
{
"text": "Attention Is All You Need We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less training time.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
The best performing models also connect the encoder and decoder through an attention mechanism. We propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after training for as little as twelve hours on eight P100 GPUs.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
We trained the base model for 100,000 steps or 12 hours on 8 P100 GPUs. For the big model, we trained for 300,000 steps (3.5 days) on 8 P100 GPUs. The Transformer achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the best previously reported models by more than 2.0 BLEU points.",
"metadata": {
"type": "research_paper",
"paper_id": "TRANSFORMER-2017",
"title": "Attention Is All You Need",
"authors": ["Vaswani, A.", "Shazeer, N.", "Parmar, N.", "Uszkoreit, J.", "Jones, L.", "Gomez, A. N.", "Kaiser, L.", "Polosukhin, I."],
"year": 2017,
"venue": "NeurIPS",
"citations": 45000,
"sections": ["title", "abstract", "introduction", "methodology", "results"],
"key_concepts": ["Transformer", "attention mechanism", "machine translation", "neural networks"],
"metrics": {
"training_time": "12 hours",
"gpu_count": 8,
"bleu_score": 28.4,
"improvement": "2.0 BLEU points"
},
"datasets": ["WMT 2014"],
"tasks": ["English-to-German translation"]
}
}
Here's how you can include multiple objects in one JSONL file.
{"text": "Installing via Docker requires version 20.10 or higher", "metadata": {"type": "header", "chapter": 1, "tags": ["ML", "introduction"], "importance": "high"}}
{"text": "Machine learning is a subset of artificial intelligence...", "metadata": {"type": "content", "chapter": 1, "section": 1.1, "references": ["Mitchell 1997", "Bishop 2006"]}}
{"text": "Key Concepts in ML", "metadata": {"type": "subheader", "chapter": 1, "section": 1.2, "concepts": ["supervised learning", "unsupervised learning"]}}
How UsageGuard Processes JSONL
UsageGuard processes JSONL documents by splitting them into chunks for vectorization and storage. The metadata associated with each document is preserved throughout this process, allowing for rich context and filtering capabilities.
Key aspects of JSONL processing:
- Each document is split into optimized chunks while maintaining associated metadata
- Domain objects can be directly exported to JSONL format
- Text content is optimized for search while preserving metadata relationships
- The system enables intelligent query interpretation and metadata-based filtering
This approach provides flexibility in how you structure and process your documents, while ensuring that important contextual information remains intact throughout the processing pipeline.
When using JSONL format, ensure that each line is a valid JSON object and follows the schema. Invalid JSON objects will be rejected during processing.
Use Cases
1. Technical Documentation
- Index documentation with version metadata
- Track document sections and hierarchies
- Link related content across documents
{
"text": "Installing via Docker requires version 20.10 or higher",
"metadata": {
"version": "2.1.0",
"category": "installation",
"platform": "docker",
"requirements": ["docker-20.10+"],
"lastUpdated": "2024-03-15"
}
}
2. Legal Document Processing
- Track document sections and clauses
- Maintain reference information
- Include jurisdiction and applicability data
{
"text": "This agreement shall be governed by the laws of the State of California",
"metadata": {
"documentType": "contract",
"section": "governing_law",
"jurisdiction": "California",
"effectiveDate": "2024-01-01",
"clauseId": "GL-001"
}
}
3. Research Paper Analysis
- Include citation information
- Track methodology and results
- Link to related papers and datasets
{
"text": "The experiment showed a 15% improvement in accuracy",
"metadata": {
"paperID": "RP2024-123",
"section": "results",
"methodology": "A/B testing",
"metrics": ["accuracy", "precision"],
"dataset": "MNIST",
"citations": ["Smith et al. 2023"]
}
}