Complete Guide to Document Information API: Extract File Metadata & Properties
Ever wondered how document viewers instantly know how many pages a file has, or how they enable text selection so smoothly? The secret lies in extracting document information before rendering. This comprehensive guide shows you exactly how to retrieve detailed file metadata, page properties, and text coordinates using GroupDocs.Viewer Cloud API.
Whether you’re building a document management system, creating a custom viewer, or need to analyze file properties at scale, this tutorial covers everything you need to know about document information extraction.
What You’ll Master in This Tutorial
By the end of this guide, you’ll confidently know how to:
- Extract essential document metadata (format, page count, dimensions) in minutes
- Retrieve text coordinates for implementing search and selection features
- Access embedded file information for complete document analysis
- Handle different document types with format-specific properties
- Implement document information retrieval in real applications
- Troubleshoot common issues and optimize performance
Why Document Information Extraction Is a Game-Changer
Before diving into code, let’s understand why retrieving document information matters for your applications and business workflows.
Essential for Smart Document Processing
Document information extraction isn’t just a nice-to-have feature—it’s the foundation of intelligent document processing. Here’s why it’s crucial:
Planning Efficient Operations: Knowing a document’s page count and dimensions upfront lets you allocate resources properly. Instead of blindly processing a 500-page document, you can plan memory usage, set appropriate timeouts, and optimize rendering strategies.
Enabling Advanced Features: Text coordinates make sophisticated features possible. Want to implement text search? You need character positions. Planning text selection or annotation tools? Coordinate data is essential.
Handling Complex Documents: Modern documents often contain attachments, embedded files, or special formatting. Document information extraction reveals these hidden elements, ensuring nothing gets missed in your processing pipeline.
Improving User Experience: Users expect immediate feedback. Showing “Document: 15 pages, PDF format” instantly builds confidence, while “Processing…” leaves them wondering if anything’s happening.
Real-World Business Applications
Here are practical scenarios where document information extraction proves invaluable:
- Legal Document Review: Quickly assess document size and complexity before assigning review time
- Archive Management: Catalog thousands of files efficiently by extracting metadata in batch
- Content Management Systems: Auto-tag documents based on format, page count, and embedded content
- Document Conversion Services: Pre-validate files and estimate processing time based on document properties
- Compliance Workflows: Identify documents with attachments that may require special handling
Prerequisites: What You Need to Get Started
Before we dive into the technical implementation, make sure you have these essentials ready:
Required Accounts and Credentials
- A GroupDocs.Viewer Cloud account (grab your free trial here if you don’t have one)
- Your Client ID and Client Secret (found in your dashboard)
- Basic understanding of REST APIs (don’t worry, we’ll walk through everything)
Technical Requirements
- Familiarity with your programming language of choice (we support C#, Java, Python, PHP, Ruby, Node.js, and Go)
- A sample document for testing (we’ll use a DOCX file, but any supported format works)
- HTTP client or SDK (we’ll show both approaches)
Optional but Helpful
- Understanding of JSON structure (responses contain nested data)
- Basic knowledge of document formats (helps interpret format-specific information)
Step 1: Upload Your Document to Cloud Storage
Every document information extraction starts with having your file accessible in the cloud. Here’s how to upload your test document:
# First, get your JSON Web Token for authentication
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"
# Store the JWT token for reuse (you'll need this for all subsequent requests)
JWT="YOUR_JWT_TOKEN"
# Upload your document to cloud storage
curl -v "https://api.groupdocs.cloud/v2.0/viewer/storage/file/SampleFiles/sample.docx" \
-X PUT \
-H "Content-Type: multipart/form-data" \
-H "Accept: application/json" \
-H "Authorization: Bearer $JWT" \
--data-binary "@/path/to/your/sample.docx"
Pro Tip: Organize your files in folders (like SampleFiles/
) from the start. This makes file management much easier as your application grows.
Step 2: Extract Basic Document Information
Now for the exciting part—let’s retrieve comprehensive information about your document. This single API call reveals everything you need to know about the file structure:
curl -v "https://api.groupdocs.cloud/v2.0/viewer/info" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $JWT" \
-d "{
'FileInfo': {
'FilePath': 'SampleFiles/sample.docx'
},
'ViewFormat': 'HTML'
}"
Decoding the Response: What Each Field Means
The API returns a wealth of information in a structured format. Here’s what you get:
{
"formatExtension": ".docx",
"format": "Microsoft Word Open XML Document",
"pages": [
{
"number": 1,
"width": 595,
"height": 841,
"visible": true,
"lines": []
},
{
"number": 2,
"width": 595,
"height": 841,
"visible": true,
"lines": []
},
{
"number": 3,
"width": 595,
"height": 841,
"visible": true,
"lines": []
}
],
"attachments": [],
"archiveViewInfo": null,
"cadViewInfo": null,
"projectManagementViewInfo": null,
"outlookViewInfo": null,
"pdfViewInfo": null
}
Understanding Key Response Elements:
- Format Information:
formatExtension
andformat
tell you exactly what type of document you’re working with—crucial for handling format-specific features - Page Details: Each page object contains dimensions in pixels and visibility status, perfect for planning your viewer layout
- Attachments Array: Lists any embedded files (emails, archives, etc.)—empty in this example but essential for complex documents
- Format-Specific Views: Additional properties appear for specialized formats (PDFs, CAD files, project management documents)
Step 3: Unlock Text Coordinates for Advanced Features
If you’re planning to implement text search, selection, or annotation features, you’ll need detailed text coordinate information. Here’s how to extract it:
curl -v "https://api.groupdocs.cloud/v2.0/viewer/info" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $JWT" \
-d "{
'FileInfo': {
'FilePath': 'SampleFiles/sample.docx'
},
'ViewFormat': 'PNG',
'RenderOptions': {
'ExtractText': true
}
}"
Important Note: Text extraction requires setting the ViewFormat
to either PNG
or JPG
. This isn’t a limitation—it’s because text coordinates are most useful when overlaying interactive elements on image-based views.
Understanding Text Coordinate Hierarchy
With text extraction enabled, you get a detailed hierarchical structure:
{
"formatExtension": ".docx",
"format": "Microsoft Word Open XML Document",
"pages": [
{
"number": 1,
"width": 595,
"height": 841,
"visible": true,
"lines": [
{
"words": [
{
"characters": [
{
"x": 229.607,
"y": 67.8,
"width": 10.674,
"height": 19.8,
"value": "T"
}
],
"x": 229.607,
"y": 67.8,
"width": 39.721,
"height": 19.8,
"value": "This"
}
],
"x": 229.607,
"y": 67.8,
"width": 136.387,
"height": 19.8,
"value": "This is a sample"
}
]
}
]
}
Text Coordinate Structure Explained:
- Lines: Represent horizontal lines of text with complete position and content
- Words: Individual words within each line, with precise boundaries
- Characters: Each character’s exact position and dimensions—perfect for fine-grained text selection
This hierarchical approach gives you flexibility: use line coordinates for simple text selection, word coordinates for search highlighting, or character coordinates for precise cursor positioning.
Step 4: Handling Documents with Attachments
Many business documents contain embedded files—especially emails, archives, and complex reports. Here’s how to identify and work with these attachments:
curl -v "https://api.groupdocs.cloud/v2.0/viewer/info" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $JWT" \
-d "{
'FileInfo': {
'FilePath': 'SampleFiles/with_attachments.msg'
},
'ViewFormat': 'HTML'
}"
Working with Attachment Information
For documents containing embedded files, the response includes detailed attachment data:
{
"formatExtension": ".msg",
"format": "Microsoft Outlook Message",
"pages": [
{
"number": 1,
"width": 612,
"height": 792,
"visible": true,
"lines": []
}
],
"attachments": [
{
"name": "attachment-image.png",
"filePath": null
},
{
"name": "attachment-word.doc",
"filePath": null
}
],
"archiveViewInfo": null,
"cadViewInfo": null,
"projectManagementViewInfo": null,
"outlookViewInfo": {
"folders": []
},
"pdfViewInfo": null
}
Practical Applications for Attachment Information:
- Compliance Workflows: Identify emails with attachments that require special handling
- Archive Processing: Enumerate all embedded files before extraction
- Security Scanning: Flag documents with multiple attachments for additional review
- User Interface: Display attachment counts and types to users before opening
Step 5: Real-World Implementation Examples
Let’s put this knowledge into practice with complete, working examples in popular programming languages.
C# Implementation: Complete Document Analysis
using GroupDocs.Viewer.Cloud.Sdk.Api;
using GroupDocs.Viewer.Cloud.Sdk.Client;
using GroupDocs.Viewer.Cloud.Sdk.Model;
using GroupDocs.Viewer.Cloud.Sdk.Model.Requests;
using System;
using System.Collections.Generic;
using System.IO;
namespace GroupDocs.Viewer.Cloud.Tutorial
{
class Program
{
static void Main(string[] args)
{
// Get your client ID and client secret from https://dashboard.groupdocs.cloud/
string MyClientId = "YOUR_CLIENT_ID";
string MyClientSecret = "YOUR_CLIENT_SECRET";
// Create API instance
var configuration = new Configuration(MyClientId, MyClientSecret);
var apiInstance = new InfoApi(configuration);
// Define view options
var viewOptions = new ViewOptions
{
FileInfo = new FileInfo
{
FilePath = "SampleFiles/sample.docx"
},
ViewFormat = ViewOptions.ViewFormatEnum.HTML
};
try
{
// Call the API to get document information
var response = apiInstance.GetInfo(new GetInfoRequest(viewOptions));
Console.WriteLine("Document information retrieved successfully!");
// Display basic document information
Console.WriteLine($"Format: {response.Format} ({response.FormatExtension})");
Console.WriteLine($"Total pages: {response.Pages.Count}");
// Display information about each page
Console.WriteLine("\nPage information:");
foreach (var page in response.Pages)
{
Console.WriteLine($" Page {page.Number}: {page.Width}x{page.Height} pixels, Visible: {page.Visible}");
}
// Display attachment information if any
if (response.Attachments != null && response.Attachments.Count > 0)
{
Console.WriteLine("\nAttachments:");
foreach (var attachment in response.Attachments)
{
Console.WriteLine($" {attachment.Name}");
}
}
else
{
Console.WriteLine("\nNo attachments found.");
}
// Display format-specific information if available
if (response.PdfViewInfo != null)
{
Console.WriteLine("\nPDF-specific information available.");
}
if (response.ArchiveViewInfo != null)
{
Console.WriteLine("\nArchive-specific information available.");
}
if (response.CadViewInfo != null)
{
Console.WriteLine("\nCAD-specific information available.");
}
if (response.ProjectManagementViewInfo != null)
{
Console.WriteLine("\nProject Management-specific information available.");
}
if (response.OutlookViewInfo != null)
{
Console.WriteLine("\nOutlook-specific information available.");
}
}
catch (Exception e)
{
Console.WriteLine("Exception while calling InfoApi: " + e.Message);
}
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();
}
}
}
Python Implementation: Streamlined Document Analysis
# Import modules
import groupdocs_viewer_cloud
# Get your client ID and client secret from https://dashboard.groupdocs.cloud/
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
# Create API instance
api_instance = groupdocs_viewer_cloud.InfoApi.from_keys(client_id, client_secret)
# Define view options
view_options = groupdocs_viewer_cloud.ViewOptions()
view_options.file_info = groupdocs_viewer_cloud.FileInfo()
view_options.file_info.file_path = "SampleFiles/sample.docx"
view_options.view_format = "HTML"
try:
# Call the API to get document information
request = groupdocs_viewer_cloud.GetInfoRequest(view_options)
response = api_instance.get_info(request)
print("Document information retrieved successfully!")
# Display basic document information
print(f"Format: {response.format} ({response.format_extension})")
print(f"Total pages: {len(response.pages)}")
# Display information about each page
print("\nPage information:")
for page in response.pages:
print(f" Page {page.number}: {page.width}x{page.height} pixels, Visible: {page.visible}")
# Display attachment information if any
if response.attachments and len(response.attachments) > 0:
print("\nAttachments:")
for attachment in response.attachments:
print(f" {attachment.name}")
else:
print("\nNo attachments found.")
# Display format-specific information if available
if response.pdf_view_info:
print("\nPDF-specific information available.")
if response.archive_view_info:
print("\nArchive-specific information available.")
if response.cad_view_info:
print("\nCAD-specific information available.")
if response.project_management_view_info:
print("\nProject Management-specific information available.")
if response.outlook_view_info:
print("\nOutlook-specific information available.")
except groupdocs_viewer_cloud.ApiException as e:
print(f"Exception while calling InfoApi: {e}")
Advanced Text Coordinate Extraction
For applications requiring sophisticated text interaction features, here’s how to extract and work with detailed text coordinate data:
Enhanced C# Example with Text Processing
// Define view options with text extraction enabled
var viewOptions = new ViewOptions
{
FileInfo = new FileInfo
{
FilePath = "SampleFiles/sample.docx"
},
ViewFormat = ViewOptions.ViewFormatEnum.PNG,
RenderOptions = new ImageOptions
{
ExtractText = true
}
};
try
{
// Call the API to get document information with text coordinates
var response = apiInstance.GetInfo(new GetInfoRequest(viewOptions));
Console.WriteLine("Document information with text coordinates retrieved successfully!");
// Process text information for the first page (as an example)
if (response.Pages.Count > 0 && response.Pages[0].Lines != null && response.Pages[0].Lines.Count > 0)
{
Console.WriteLine("\nText content of the first page:");
foreach (var line in response.Pages[0].Lines)
{
Console.WriteLine($"Line: \"{line.Value}\"");
Console.WriteLine($" Position: X={line.X}, Y={line.Y}, Width={line.Width}, Height={line.Height}");
Console.WriteLine(" Words:");
foreach (var word in line.Words)
{
Console.WriteLine($" Word: \"{word.Value}\"");
Console.WriteLine($" Position: X={word.X}, Y={word.Y}, Width={word.Width}, Height={word.Height}");
if (word.Characters != null && word.Characters.Count > 0)
{
Console.WriteLine(" Characters:");
foreach (var character in word.Characters)
{
Console.WriteLine($" Char: '{character.Value}' at X={character.X}, Y={character.Y}");
}
}
}
Console.WriteLine();
}
}
else
{
Console.WriteLine("No text data found in the document.");
}
}
catch (Exception e)
{
Console.WriteLine("Exception while calling InfoApi: " + e.Message);
}
Practical Applications: Building Document-Aware Systems
Let’s explore how to use document information in real-world applications.
Smart Document Preview Generator
Use document information to create intelligent preview interfaces:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document Preview</title>
<style>
body {
font-family: Arial, sans-serif;
max-width: 800px;
margin: 0 auto;
padding: 20px;
}
.document-info {
border: 1px solid #ddd;
padding: 20px;
border-radius: 5px;
margin-bottom: 20px;
}
.document-info h2 {
margin-top: 0;
color: #333;
}
.info-row {
display: flex;
margin-bottom: 10px;
}
.info-label {
width: 150px;
font-weight: bold;
}
.pages-list {
display: flex;
flex-wrap: wrap;
gap: 10px;
margin-top: 20px;
}
.page-thumbnail {
border: 1px solid #ddd;
padding: 5px;
width: 100px;
height: 150px;
display: flex;
align-items: center;
justify-content: center;
background-color: #f9f9f9;
}
.attachments-list {
margin-top: 20px;
}
.attachment-item {
display: flex;
align-items: center;
margin-bottom: 10px;
}
.attachment-icon {
margin-right: 10px;
font-size: 20px;
}
</style>
</head>
<body>
<div class="document-info">
<h2>Document Information</h2>
<div class="info-row">
<div class="info-label">Format:</div>
<div>Microsoft Word Open XML Document (.docx)</div>
</div>
<div class="info-row">
<div class="info-label">Total Pages:</div>
<div>3</div>
</div>
<div class="info-row">
<div class="info-label">Page Dimensions:</div>
<div>595x841 pixels</div>
</div>
</div>
<h3>Page Thumbnails</h3>
<div class="pages-list">
<div class="page-thumbnail">Page 1</div>
<div class="page-thumbnail">Page 2</div>
<div class="page-thumbnail">Page 3</div>
</div>
<h3>Attachments</h3>
<div class="attachments-list">
<div class="attachment-item">
<div class="attachment-icon">📄</div>
<div>No attachments found</div>
</div>
</div>
</body>
</html>
Intelligent Document Viewer Configuration
Use document information to automatically configure your viewer interface:
// Example pseudo-code for a document viewer application
function configureDocumentViewer(documentInfo) {
// Set the total number of pages for pagination
viewer.setTotalPages(documentInfo.pages.length);
// Set the initial zoom based on the page dimensions
if (documentInfo.pages.length > 0) {
const firstPage = documentInfo.pages[0];
const aspectRatio = firstPage.width / firstPage.height;
viewer.setOptimalZoom(aspectRatio);
}
// Configure attachments panel if needed
if (documentInfo.attachments && documentInfo.attachments.length > 0) {
viewer.showAttachmentsPanel();
viewer.setAttachments(documentInfo.attachments);
} else {
viewer.hideAttachmentsPanel();
}
// Configure text selection if text coordinates are available
const hasTextData = documentInfo.pages.some(page =>
page.lines && page.lines.length > 0
);
if (hasTextData) {
viewer.enableTextSelection(documentInfo.pages);
} else {
viewer.disableTextSelection();
}
// Configure special features based on document type
if (documentInfo.pdfViewInfo) {
viewer.enablePdfFeatures();
}
if (documentInfo.cadViewInfo) {
viewer.enableCadFeatures();
}
}
Performance Best Practices
Optimizing Document Information Retrieval
Here are proven strategies to get the best performance from document information extraction:
Batch Processing Strategy: When analyzing multiple documents, group requests by document type. Processing similar formats together improves cache efficiency and reduces processing time.
Selective Text Extraction: Only enable text coordinate extraction when you actually need it. Text processing adds overhead, so use it specifically for search, selection, or annotation features.
Caching Strategies: Document information rarely changes. Cache results with appropriate TTL (time-to-live) values to avoid repeated API calls for the same documents.
Progressive Loading: For large documents, consider retrieving information for visible pages first, then load additional page data as needed.
Memory Management Considerations
Large documents with text coordinates can generate substantial response data. Here’s how to handle it efficiently:
- Stream Processing: Process text data page by page rather than loading entire document coordinates into memory
- Selective Parsing: Extract only the coordinate levels you need (line, word, or character)
- Data Compression: Use gzip compression for API responses to reduce transfer time and bandwidth usage
Common Issues and Troubleshooting
Authentication Problems
Issue: “Invalid JWT token” or “Authentication failed” errors Solution:
- Verify your Client ID and Client Secret are correct in your dashboard
- Ensure you’re generating a fresh JWT token for each session
- Check that your token hasn’t expired (tokens have limited lifetime)
Issue: Token expires during long processing sessions Solution: Implement automatic token refresh logic in your application
File Access Problems
Issue: “File not found” errors even when the file exists Solution:
- Double-check the file path in your request matches exactly what’s in cloud storage
- Verify file upload completed successfully before attempting to retrieve information
- Use forward slashes (/) in file paths, not backslashes ()
Text Extraction Issues
Issue: Text coordinates return empty even though the document has text Common Causes and Solutions:
- Wrong View Format: Text extraction only works with PNG or JPG view formats, not HTML
- Unsupported Document Type: Some document formats don’t support text coordinate extraction
- Scanned Documents: Image-based PDFs or scanned documents may not have extractable text
- Encrypted Documents: Password-protected documents may block text extraction
Performance Issues
Issue: Slow response times for document information retrieval Troubleshooting Steps:
- Large Documents: Break processing into smaller chunks or specific page ranges
- Network Latency: Consider using regional API endpoints closer to your location
- Concurrent Requests: Implement rate limiting to avoid overwhelming the API
- Text Extraction Overhead: Disable text extraction for use cases that don’t require it
Format-Specific Considerations
Microsoft Office Documents: Generally excellent support for both metadata and text extraction PDF Files: Support varies based on PDF type (native vs. scanned) CAD Files: May have limited text extraction capabilities Archive Files: Focus on attachment enumeration rather than text content Email Files: Excellent support for both content and attachment information
Exercise 1: Multi-Format Document Analysis
- Upload different document types (PDF, DOCX, XLSX, PPTX, images)
- Compare the information returned for each format
- Note format-specific properties and differences in text extraction support
Exercise 2: Build a Document Cataloging Tool
- Create a simple application that processes multiple documents
- Extract and store metadata in a database or spreadsheet
- Include file size estimation based on page count and dimensions
Exercise 3: Implement Smart Text Search
- Extract text coordinates from a document with substantial text content
- Build a search function that highlights matching words using coordinate data
- Test with different search terms and edge cases
Exercise 4: Attachment Processing Workflow
- Find or create documents with attachments (emails work great)
- Extract attachment information and display it to users
- Plan how you’d implement attachment download or processing
Comparing GroupDocs.Viewer Cloud with Alternatives
Understanding how GroupDocs.Viewer Cloud compares to other document information extraction solutions helps you make informed decisions:
Advantages of GroupDocs.Viewer Cloud:
- Comprehensive Format Support: Handles 170+ document formats out of the box
- Detailed Text Coordinates: Hierarchical word/character positioning for advanced features
- Cloud-Native Architecture: No infrastructure management required
- SDK Availability: Official SDKs for all major programming languages
- Attachment Handling: Built-in support for embedded files and complex document structures
When to Consider Alternatives:
- Offline Processing Requirements: If you need on-premises document processing
- Simple Metadata Only: For basic file information, simpler solutions might suffice
- High-Volume Processing: Large-scale operations might benefit from dedicated infrastructure
- Specialized Formats: Some niche document types might need specialized tools
Advanced Topics and Next Steps
Now that you’ve mastered document information extraction, here are logical next steps to expand your document processing capabilities:
Advanced Implementation Patterns
- Microservices Architecture: Integrate document information extraction into larger systems
- Batch Processing Workflows: Handle thousands of documents efficiently
- Real-Time Document Analysis: Process documents as they’re uploaded
- Machine Learning Integration: Use extracted text for content classification and analysis
Documentation and Tools
- Complete Product Overview Understand all available features
- Comprehensive API Documentation Detailed reference for every endpoint
- Interactive Live Demo Test features without writing code