Document Info API Tutorial: Extract Metadata & Analyze Files
What You’ll Master in This Tutorial
Ever wondered how to peek inside a document without actually opening it? You’re about to learn exactly that! In this comprehensive tutorial, you’ll discover how to:
- Extract document metadata and structure information programmatically
- Analyze document content, dimensions, and format-specific details
- Build intelligent pre-processing systems that make smart rendering decisions
- Handle everything from simple PDFs to complex CAD drawings and archives
Whether you’re building a document management system or just need to understand what’s inside your files before processing them, this guide will get you there with real Python code examples you can use immediately.
Before We Dive In
Here’s what you’ll need to follow along:
- A GroupDocs Cloud account (free tier available)
- Basic Python knowledge and REST API familiarity
- Sample documents in various formats for testing
Quick setup tip: Grab your Client ID and Secret from the dashboard - you’ll need these for authentication throughout the tutorial.
Why InfoResult Matters for Your Applications
Think of InfoResult as your document’s “digital fingerprint.” Before you invest time and resources rendering a massive CAD file or a multi-page PDF, wouldn’t you want to know:
- How many pages it contains?
- What the actual dimensions are?
- Whether it has text you can search?
- If there are special permissions or restrictions?
That’s exactly what the Document Info API delivers through the InfoResult data structure. It’s like having X-ray vision for your documents - you get comprehensive insights without the overhead of full rendering.
Understanding InfoResult: Your Document Analysis Toolkit
InfoResult isn’t just a simple response object - it’s a comprehensive analysis report that contains several key components:
Core Information:
- FormatExtension/Format: What type of document you’re dealing with
- Pages: Detailed page-by-page information including dimensions and content
- Attachments: Any embedded files or attachments within the document
Format-Specific Insights:
- PDF documents: Security permissions, printing restrictions
- CAD drawings: Available layouts, layers, and their visibility
- Archives: Folder structure and organization
- Project files: Timeline information and resource details
The beauty of this approach? You make informed decisions about how to process each document based on its actual characteristics, not assumptions.
Step-by-Step Tutorial: From Basic to Advanced
Step 1: Getting Basic Document Information (Start Here!)
Let’s start with the fundamentals - extracting basic information about any document:
# Tutorial Code Example: Getting basic document information
import os
from groupdocs_viewer_cloud import Configuration, ViewerApi, InfoOptions, FileInfo
# Configure the API client
configuration = Configuration(client_id="YOUR_CLIENT_ID", client_secret="YOUR_CLIENT_SECRET")
viewer_api = ViewerApi.from_config(configuration)
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/sample.pdf"
# Create info options
info_options = InfoOptions()
info_options.file_info = file_info
# Get document information
info_result = viewer_api.get_info(info_options)
# Display basic document information
print(f"Document format: {info_result.format}")
print(f"File extension: {info_result.format_extension}")
print(f"Total pages: {len(info_result.pages)}")
# Display page dimensions for the first page
if info_result.pages:
first_page = info_result.pages[0]
print(f"\nFirst page information:")
print(f" - Page number: {first_page.number}")
print(f" - Width: {first_page.width} pixels")
print(f" - Height: {first_page.height} pixels")
print(f" - Visible: {first_page.visible}")
What’s happening here? You’re asking the API to analyze your document and return its basic characteristics. This information alone can help you decide whether to render the document as images (for precise layouts) or HTML (for searchable text).
Pro tip: Always check the page visibility property - some documents have hidden pages that you might not want to include in your rendered output.
Step 2: Extracting Text Content (The Game-Changer)
Here’s where things get interesting. You can actually extract the text content from documents without fully rendering them:
# Tutorial Code Example: Extracting text content
from groupdocs_viewer_cloud import InfoOptions, FileInfo
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/text-document.docx"
# Create info options with text extraction
info_options = InfoOptions()
info_options.file_info = file_info
info_options.extract_text = True # Request text extraction
# Get document information with text content
info_result = viewer_api.get_info(info_options)
# Display text content statistics
print(f"Document has {len(info_result.pages)} pages with text content")
# Analyze text content from the first page
if info_result.pages and hasattr(info_result.pages[0], 'lines') and info_result.pages[0].lines:
first_page = info_result.pages[0]
lines_count = len(first_page.lines)
print(f"\nText content from first page:")
print(f" - Total lines: {lines_count}")
# Show first few lines as example
for i, line in enumerate(first_page.lines[:3]):
print(f" - Line {i+1}: {line.value}")
if hasattr(line, 'words') and line.words:
word_count = len(line.words)
print(f" Contains {word_count} words")
# Count total words on the page
total_words = sum(len(line.words) if hasattr(line, 'words') else 0 for line in first_page.lines)
print(f"\nTotal words on first page: {total_words}")
else:
print("\nNo text content extracted or available")
Why this matters: Text extraction capabilities help you determine whether a document is text-heavy (good candidate for HTML rendering with search functionality) or image-based (better suited for PNG/PDF rendering).
Common use case: Building a document search system? Use this feature to index document content without storing massive rendered files.
Step 3: Discovering Document Attachments
Some documents are like Russian dolls - they contain other documents inside them. Here’s how to find them:
# Tutorial Code Example: Analyzing document attachments
from groupdocs_viewer_cloud import InfoOptions, FileInfo
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/with-attachments.msg" # Email with attachments
# Create info options
info_options = InfoOptions()
info_options.file_info = file_info
# Get document information
info_result = viewer_api.get_info(info_options)
# Check for attachments
if info_result.attachments and len(info_result.attachments) > 0:
print(f"Document contains {len(info_result.attachments)} attachments:")
for i, attachment in enumerate(info_result.attachments):
print(f" {i+1}. {attachment.name}")
# Generate example code for handling attachments
print("\nExample code for handling attachments:")
print("python")
print("# Process each attachment")
print("for attachment in info_result.attachments:")
print(" # Get the attachment name")
print(" attachment_name = attachment.name")
print(" print(f'Processing attachment: {attachment_name}')")
print(" ")
print(" # Create file info for the attachment")
print(" attachment_file_info = FileInfo()")
print(" attachment_file_info.file_path = f'documents/attachments/{attachment_name}'")
print(" ")
print(" # Now you can process the attachment separately")
print(" # For example, render it to HTML")
print(" attachment_view_options = ViewOptions()")
print(" attachment_view_options.file_info = attachment_file_info")
print(" attachment_view_options.view_format = 'HTML'")
print(" attachment_result = viewer_api.view(attachment_view_options)")
print("")
else:
print("Document does not contain attachments")
Real-world application: Email processing systems often need to handle attachments separately. This approach lets you identify what’s inside before deciding how to process each piece.
Step 4: Analyzing PDF-Specific Security Information
PDFs can have security restrictions that affect how you can process them. Let’s check for those:
# Tutorial Code Example: Analyzing PDF-specific information
from groupdocs_viewer_cloud import InfoOptions, FileInfo
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/secured.pdf"
# Create info options
info_options = InfoOptions()
info_options.file_info = file_info
# Get document information
info_result = viewer_api.get_info(info_options)
# Check for PDF-specific information
if hasattr(info_result, 'pdf_view_info') and info_result.pdf_view_info:
print(f"PDF document information:")
print(f" - Printing allowed: {info_result.pdf_view_info.printing_allowed}")
# Make a decision based on printing permission
if info_result.pdf_view_info.printing_allowed:
print(" This document can be printed")
else:
print(" This document has printing restrictions")
# Print Security Considerations
print("\nSecurity considerations for this PDF:")
if not info_result.pdf_view_info.printing_allowed:
print(" - Disable print button in your viewer application")
print(" - Add a watermark indicating printing is not allowed")
else:
print("No PDF-specific information available or not a PDF document")
Why this matters for your app: Respecting document security settings isn’t just good practice - it’s often a legal requirement. Use this information to configure your viewer interface appropriately.
Step 5: Working with CAD Drawings (For Technical Documents)
CAD files are complex beasts with multiple layouts and layers. Here’s how to understand their structure:
# Tutorial Code Example: Analyzing CAD drawing information
from groupdocs_viewer_cloud import InfoOptions, FileInfo
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/drawing.dwg"
# Create info options
info_options = InfoOptions()
info_options.file_info = file_info
# Get document information
info_result = viewer_api.get_info(info_options)
# Check for CAD-specific information
if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
print(f"CAD drawing information:")
# Display layouts
if hasattr(info_result.cad_view_info, 'layouts') and info_result.cad_view_info.layouts:
print(f" - Available layouts ({len(info_result.cad_view_info.layouts)}):")
for layout in info_result.cad_view_info.layouts:
print(f" * {layout.name} - {layout.width}x{layout.height}")
# Display layers
if hasattr(info_result.cad_view_info, 'layers') and info_result.cad_view_info.layers:
print(f"\n - Available layers ({len(info_result.cad_view_info.layers)}):")
for layer in info_result.cad_view_info.layers:
visibility = "Visible" if layer.visible else "Hidden"
print(f" * {layer.name} - {visibility}")
# Generate example code for rendering specific layout and layers
print("\nExample code for rendering specific layout and layers:")
print("python")
print("from groupdocs_viewer_cloud import ViewOptions, HtmlOptions, CadOptions, FileInfo")
print("")
print("# Set up file info")
print("file_info = FileInfo()")
print("file_info.file_path = 'documents/drawing.dwg'")
print("")
print("# Create CAD options")
print("cad_options = CadOptions()")
if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
print(f"cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}' # Render specific layout")
if info_result.cad_view_info.layers and len(info_result.cad_view_info.layers) > 0:
visible_layers = [layer.name for layer in info_result.cad_view_info.layers if layer.visible]
if visible_layers:
layers_str = "', '".join(visible_layers[:3]) # Take up to 3 layers for example
print(f"cad_options.layers = ['{layers_str}'] # Render specific layers")
print("")
print("# Set up rendering options")
print("html_options = HtmlOptions()")
print("html_options.cad_options = cad_options")
print("")
print("# Set up view options")
print("view_options = ViewOptions()")
print("view_options.file_info = file_info")
print("view_options.view_format = 'HTML'")
print("view_options.render_options = html_options")
print("")
print("# Render the CAD drawing")
print("result = viewer_api.view(view_options)")
print("")
else:
print("No CAD-specific information available or not a CAD drawing")
Professional tip: CAD files often contain sensitive information in hidden layers. Always check layer visibility before rendering to avoid exposing confidential data.
Step 6: Understanding Archive File Structure
Archive files (ZIP, RAR, etc.) have folder structures that you might want to navigate:
# Tutorial Code Example: Analyzing archive information
from groupdocs_viewer_cloud import InfoOptions, FileInfo
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/archive.zip"
# Create info options
info_options = InfoOptions()
info_options.file_info = file_info
# Get document information
info_result = viewer_api.get_info(info_options)
# Check for archive-specific information
if hasattr(info_result, 'archive_view_info') and info_result.archive_view_info:
print(f"Archive information:")
# Display folders
if hasattr(info_result.archive_view_info, 'folders') and info_result.archive_view_info.folders:
print(f" - Folders in archive ({len(info_result.archive_view_info.folders)}):")
for folder in info_result.archive_view_info.folders:
print(f" * {folder}")
# Generate example code for rendering a specific folder
print("\nExample code for rendering a specific folder from the archive:")
print("python")
print("from groupdocs_viewer_cloud import ViewOptions, HtmlOptions, ArchiveOptions, FileInfo")
print("")
print("# Set up file info")
print("file_info = FileInfo()")
print("file_info.file_path = 'documents/archive.zip'")
print("")
print("# Create archive options")
print("html_options = HtmlOptions()")
print("html_options.archive_options = ArchiveOptions()")
if info_result.archive_view_info.folders and len(info_result.archive_view_info.folders) > 0:
print(f"html_options.archive_options.folder = '{info_result.archive_view_info.folders[0]}' # Render specific folder")
print("html_options.archive_options.items_per_page = 15 # Items per page when rendering")
print("")
print("# Set up view options")
print("view_options = ViewOptions()")
print("view_options.file_info = file_info")
print("view_options.view_format = 'HTML'")
print("view_options.render_options = html_options")
print("")
print("# Render the archive contents")
print("result = viewer_api.view(view_options)")
print("")
else:
print("No archive-specific information available or not an archive file")
Use case spotlight: Building a file browser interface? This information helps you create navigation structures that match the actual archive organization.
Step 7: Building Your Own Document Analysis Tool
Now let’s put it all together and create a comprehensive document analyzer that makes smart decisions:
# Tutorial Code Example: Comprehensive document analysis tool
from groupdocs_viewer_cloud import InfoOptions, FileInfo
import json
# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/unknown-document.pdf" # This could be any format
# Create info options with text extraction
info_options = InfoOptions()
info_options.file_info = file_info
info_options.extract_text = True # Get text content
# Get document information
info_result = viewer_api.get_info(info_options)
# Create comprehensive analysis report
print("Document Analysis Report")
print("======================\n")
# Basic document information
print(f"Document Type: {info_result.format}")
print(f"Extension: {info_result.format_extension}")
print(f"Total Pages: {len(info_result.pages)}")
# Page dimensions analysis
if info_result.pages:
# Collect page dimensions
widths = [page.width for page in info_result.pages if hasattr(page, 'width')]
heights = [page.height for page in info_result.pages if hasattr(page, 'height')]
if widths and heights:
# Analyze page sizes
min_width = min(widths)
max_width = max(widths)
min_height = min(heights)
max_height = max(heights)
avg_width = sum(widths) / len(widths)
avg_height = sum(heights) / len(heights)
# Check if all pages have same dimensions
uniform_pages = min_width == max_width and min_height == max_height
print("\nPage Dimensions Analysis:")
print(f" - Page size is {'uniform' if uniform_pages else 'variable'}")
if uniform_pages:
print(f" - All pages are {widths[0]}x{heights[0]} pixels")
else:
print(f" - Width range: {min_width}-{max_width} pixels (avg: {avg_width:.1f})")
print(f" - Height range: {min_height}-{max_height} pixels (avg: {avg_height:.1f})")
# Determine optimal rendering settings based on page sizes
print("\nRecommended Rendering Settings:")
if max_width > 1000 or max_height > 1400:
print(" - Use high-resolution rendering for detailed content")
# Page orientation analysis
portrait_pages = sum(1 for w, h in zip(widths, heights) if h > w)
landscape_pages = sum(1 for w, h in zip(widths, heights) if w > h)
square_pages = sum(1 for w, h in zip(widths, heights) if w == h)
print("\nPage Orientation Analysis:")
print(f" - Portrait pages: {portrait_pages}")
print(f" - Landscape pages: {landscape_pages}")
print(f" - Square pages: {square_pages}")
if landscape_pages > 0:
print(" - Note: Document contains landscape pages, consider appropriate viewer layout")
# Text content analysis
has_text = False
text_lines_count = 0
total_words = 0
for page in info_result.pages:
if hasattr(page, 'lines') and page.lines:
has_text = True
text_lines_count += len(page.lines)
page_words = sum(len(line.words) if hasattr(line, 'words') else 0 for line in page.lines)
total_words += page_words
if has_text:
print("\nText Content Analysis:")
print(f" - Total text lines: {text_lines_count}")
print(f" - Approximate word count: {total_words}")
print(f" - Average words per page: {total_words / len(info_result.pages):.1f}")
# Determine if document is text-heavy
if total_words > 500:
print(" - Document is text-heavy, consider enabling text search functionality")
else:
print("\nText Content Analysis:")
print(" - No text content found or extracted")
print(" - Document may be image-based or scanned")
print(" - Consider OCR processing if text search is required")
# Format-specific analysis
print("\nFormat-Specific Information:")
# PDF analysis
if hasattr(info_result, 'pdf_view_info') and info_result.pdf_view_info:
print("PDF Document Properties:")
print(f" - Printing allowed: {info_result.pdf_view_info.printing_allowed}")
if not info_result.pdf_view_info.printing_allowed:
print(" - Security Note: Document has printing restrictions")
# CAD analysis
if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
print("CAD Drawing Properties:")
if hasattr(info_result.cad_view_info, 'layouts') and info_result.cad_view_info.layouts:
print(f" - Contains {len(info_result.cad_view_info.layouts)} layouts")
if hasattr(info_result.cad_view_info, 'layers') and info_result.cad_view_info.layers:
visible_layers = sum(1 for layer in info_result.cad_view_info.layers if layer.visible)
hidden_layers = len(info_result.cad_view_info.layers) - visible_layers
print(f" - Contains {len(info_result.cad_view_info.layers)} layers " +
f"({visible_layers} visible, {hidden_layers} hidden)")
# Archive analysis
if hasattr(info_result, 'archive_view_info') and info_result.archive_view_info:
print("Archive Properties:")
if hasattr(info_result.archive_view_info, 'folders') and info_result.archive_view_info.folders:
print(f" - Contains {len(info_result.archive_view_info.folders)} folders")
# Project management analysis
if hasattr(info_result, 'project_management_view_info') and info_result.project_management_view_info:
print("Project Management Properties:")
if hasattr(info_result.project_management_view_info, 'start_date'):
print(f" - Project start date: {info_result.project_management_view_info.start_date}")
if hasattr(info_result.project_management_view_info, 'end_date'):
print(f" - Project end date: {info_result.project_management_view_info.end_date}")
# Outlook data analysis
if hasattr(info_result, 'outlook_view_info') and info_result.outlook_view_info:
print("Outlook Data Properties:")
if hasattr(info_result.outlook_view_info, 'folders') and info_result.outlook_view_info.folders:
print(f" - Contains {len(info_result.outlook_view_info.folders)} folders")
# Final recommendations
print("\nRendering Recommendations:")
print(" - Recommended format: ", end="")
if has_text:
print("HTML (for text searchability)")
elif hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
print("PNG (for crisp lines and details)")
else:
print("PDF (for compatibility and document fidelity)")
# Generate rendering code based on analysis
print("\nSuggested Rendering Code:")
print("python")
print("from groupdocs_viewer_cloud import ViewOptions, FileInfo")
if has_text:
print("from groupdocs_viewer_cloud import HtmlOptions")
print("\n# Create HTML options for text searchability")
print("html_options = HtmlOptions()")
print("html_options.is_responsive = True")
if landscape_pages > 0:
print("# Enable proper layout for landscape pages")
if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
print("# Configure CAD options")
print("html_options.cad_options = CadOptions()")
if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
print(f"html_options.cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}'")
print("\n# Set up view options")
print("view_options = ViewOptions()")
print("view_options.file_info = file_info")
print("view_options.view_format = 'HTML'")
print("view_options.render_options = html_options")
elif hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
print("from groupdocs_viewer_cloud import ImageOptions, CadOptions")
print("\n# Create PNG options for CAD drawing")
print("image_options = ImageOptions()")
print("image_options.cad_options = CadOptions()")
if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
print(f"image_options.cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}'")
print("image_options.width = 1200 # High resolution for details")
print("\n# Set up view options")
print("view_options = ViewOptions()")
print("view_options.file_info = file_info")
print("view_options.view_format = 'PNG'")
print("view_options.render_options = image_options")
else:
print("from groupdocs_viewer_cloud import PdfOptions")
print("\n# Create PDF options for general compatibility")
print("pdf_options = PdfOptions()")
print("\n# Set up view options")
print("view_options = ViewOptions()")
print("view_options.file_info = file_info")
print("view_options.view_format = 'PDF'")
print("view_options.render_options = pdf_options")
print("\n# Render the document")
print("result = viewer_api.view(view_options)")
print("")
This comprehensive analyzer does the heavy lifting for you - it examines your document and recommends the best rendering approach based on the actual content characteristics.
Hands-On Practice: Try These Challenges
Ready to test your newfound skills? Here are some practical exercises:
Beginner Challenge: Create a script that categorizes documents as “text-heavy,” “image-based,” or “mixed” based on their text content analysis.
Intermediate Challenge: Build a CAD drawing inspector that generates a report showing all available layouts and layers, helping users choose what to render.
Advanced Challenge: Develop a batch document processor that analyzes multiple files and creates optimal rendering configurations for each based on their unique characteristics.
Common Pitfalls and How to Avoid Them
Text Extraction Troubles: If you’re not getting text content, double-check that extract_text
is set to True
in your InfoOptions. Also remember that image-based documents (like scanned PDFs) won’t have extractable text without OCR.
Format-Specific Info Missing: Always verify the document format before trying to access format-specific properties. A Word document won’t have CAD layers, and a PDF won’t have project timeline information.
Performance Considerations: Text extraction adds processing time, especially for large documents. Only request it when you actually need the text content for your application.
Page Dimension Edge Cases: Some document formats might not report accurate dimensions until after rendering. Use the InfoResult data as guidance, but be prepared to handle edge cases.
Best Practices for Production Applications
Smart Caching Strategy: Document information doesn’t change unless the file changes. Cache InfoResult data to avoid repeated API calls for the same documents.
Error Handling: Always wrap your API calls in try-catch blocks. Network issues, invalid files, or API limits can cause failures that you’ll want to handle gracefully.
Batch Processing: If you’re analyzing many documents, consider implementing batch processing with rate limiting to stay within API quotas.
Security Awareness: Respect document security settings revealed by InfoResult. If a PDF restricts printing, honor that in your application interface.
Performance Optimization Tips
Selective Information Requests: Only extract text when you need it - it’s the most resource-intensive operation.
Asynchronous Processing: For batch document analysis, use asynchronous processing to handle multiple files concurrently.
Smart Pre-filtering: Use basic file extension checks before calling the API to filter out unsupported formats early.
Advanced Integration Patterns
Event-Driven Processing: Set up webhook notifications when documents are uploaded, then use InfoResult to determine the optimal processing pipeline automatically.
Machine Learning Enhancement: Use InfoResult data as features for ML models that predict user preferences or document importance.
Multi-Format Workflows: Create adaptive workflows that handle different document types differently based on their InfoResult characteristics.
Performance Monitoring and Optimization
Key Metrics to Track:
- Average API response time for InfoResult calls
- Cache hit rates for document information
- Text extraction success rates by document type
- Format-specific feature detection accuracy
Optimization Strategies:
- Implement tiered caching (memory → Redis → database)
- Use connection pooling for API clients
- Batch similar document types for processing efficiency
- Monitor and adjust text extraction timeouts based on document size
Security Best Practices
Data Privacy: InfoResult may contain sensitive metadata. Ensure proper access controls and data handling procedures.
Permission Respect: Always honor document security settings revealed by the analysis. Implement UI controls that reflect document restrictions.
Audit Trails: Log document analysis activities for compliance and debugging purposes.
API Key Security: Rotate your GroupDocs Cloud credentials regularly and never expose them in client-side code.
Essential Resources for Your Toolkit
- **Product Page ** - Latest features and pricing information
- **API Documentation ** - Comprehensive technical reference
- **Interactive API Reference ** - Test API calls directly in your browser
- **Community Support Forum ** - Get help from experts and fellow developers
- **Free Trial Access ** - Start building immediately with generous free limits