Document Info API Tutorial: Extract Metadata & Analyze Files

What You’ll Master in This Tutorial

Ever wondered how to peek inside a document without actually opening it? You’re about to learn exactly that! In this comprehensive tutorial, you’ll discover how to:

Extract document metadata and structure information programmatically
Analyze document content, dimensions, and format-specific details
Build intelligent pre-processing systems that make smart rendering decisions
Handle everything from simple PDFs to complex CAD drawings and archives

Whether you’re building a document management system or just need to understand what’s inside your files before processing them, this guide will get you there with real Python code examples you can use immediately.

Before We Dive In

Here’s what you’ll need to follow along:

A GroupDocs Cloud account (free tier available)
Basic Python knowledge and REST API familiarity
Sample documents in various formats for testing

Quick setup tip: Grab your Client ID and Secret from the dashboard - you’ll need these for authentication throughout the tutorial.

Why InfoResult Matters for Your Applications

Think of InfoResult as your document’s “digital fingerprint.” Before you invest time and resources rendering a massive CAD file or a multi-page PDF, wouldn’t you want to know:

How many pages it contains?
What the actual dimensions are?
Whether it has text you can search?
If there are special permissions or restrictions?

That’s exactly what the Document Info API delivers through the InfoResult data structure. It’s like having X-ray vision for your documents - you get comprehensive insights without the overhead of full rendering.

Understanding InfoResult: Your Document Analysis Toolkit

InfoResult isn’t just a simple response object - it’s a comprehensive analysis report that contains several key components:

Core Information:

FormatExtension/Format: What type of document you’re dealing with
Pages: Detailed page-by-page information including dimensions and content
Attachments: Any embedded files or attachments within the document

Format-Specific Insights:

PDF documents: Security permissions, printing restrictions
CAD drawings: Available layouts, layers, and their visibility
Archives: Folder structure and organization
Project files: Timeline information and resource details

The beauty of this approach? You make informed decisions about how to process each document based on its actual characteristics, not assumptions.

Step-by-Step Tutorial: From Basic to Advanced

Step 1: Getting Basic Document Information (Start Here!)

Let’s start with the fundamentals - extracting basic information about any document:

# Tutorial Code Example: Getting basic document information
import os
from groupdocs_viewer_cloud import Configuration, ViewerApi, InfoOptions, FileInfo

# Configure the API client
configuration = Configuration(client_id="YOUR_CLIENT_ID", client_secret="YOUR_CLIENT_SECRET")
viewer_api = ViewerApi.from_config(configuration)

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/sample.pdf"

# Create info options
info_options = InfoOptions()
info_options.file_info = file_info

# Get document information
info_result = viewer_api.get_info(info_options)

# Display basic document information
print(f"Document format: {info_result.format}")
print(f"File extension: {info_result.format_extension}")
print(f"Total pages: {len(info_result.pages)}")

# Display page dimensions for the first page
if info_result.pages:
    first_page = info_result.pages[0]
    print(f"\nFirst page information:")
    print(f"  - Page number: {first_page.number}")
    print(f"  - Width: {first_page.width} pixels")
    print(f"  - Height: {first_page.height} pixels")
    print(f"  - Visible: {first_page.visible}")

What’s happening here? You’re asking the API to analyze your document and return its basic characteristics. This information alone can help you decide whether to render the document as images (for precise layouts) or HTML (for searchable text).

Pro tip: Always check the page visibility property - some documents have hidden pages that you might not want to include in your rendered output.

Step 2: Extracting Text Content (The Game-Changer)

Here’s where things get interesting. You can actually extract the text content from documents without fully rendering them:

# Tutorial Code Example: Extracting text content
from groupdocs_viewer_cloud import InfoOptions, FileInfo

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/text-document.docx"

# Create info options with text extraction
info_options = InfoOptions()
info_options.file_info = file_info
info_options.extract_text = True  # Request text extraction

# Get document information with text content
info_result = viewer_api.get_info(info_options)

# Display text content statistics
print(f"Document has {len(info_result.pages)} pages with text content")

# Analyze text content from the first page
if info_result.pages and hasattr(info_result.pages[0], 'lines') and info_result.pages[0].lines:
    first_page = info_result.pages[0]
    lines_count = len(first_page.lines)
    print(f"\nText content from first page:")
    print(f"  - Total lines: {lines_count}")
    
    # Show first few lines as example
    for i, line in enumerate(first_page.lines[:3]):
        print(f"  - Line {i+1}: {line.value}")
        if hasattr(line, 'words') and line.words:
            word_count = len(line.words)
            print(f"    Contains {word_count} words")
    
    # Count total words on the page
    total_words = sum(len(line.words) if hasattr(line, 'words') else 0 for line in first_page.lines)
    print(f"\nTotal words on first page: {total_words}")
else:
    print("\nNo text content extracted or available")

Why this matters: Text extraction capabilities help you determine whether a document is text-heavy (good candidate for HTML rendering with search functionality) or image-based (better suited for PNG/PDF rendering).

Common use case: Building a document search system? Use this feature to index document content without storing massive rendered files.

Step 3: Discovering Document Attachments

Some documents are like Russian dolls - they contain other documents inside them. Here’s how to find them:

# Tutorial Code Example: Analyzing document attachments
from groupdocs_viewer_cloud import InfoOptions, FileInfo

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/with-attachments.msg"  # Email with attachments

# Create info options
info_options = InfoOptions()
info_options.file_info = file_info

# Get document information
info_result = viewer_api.get_info(info_options)

# Check for attachments
if info_result.attachments and len(info_result.attachments) > 0:
    print(f"Document contains {len(info_result.attachments)} attachments:")
    for i, attachment in enumerate(info_result.attachments):
        print(f"  {i+1}. {attachment.name}")
    
    # Generate example code for handling attachments
    print("\nExample code for handling attachments:")
    print("python")
    print("# Process each attachment")
    print("for attachment in info_result.attachments:")
    print("    # Get the attachment name")
    print("    attachment_name = attachment.name")
    print("    print(f'Processing attachment: {attachment_name}')")
    print("    ")
    print("    # Create file info for the attachment")
    print("    attachment_file_info = FileInfo()")
    print("    attachment_file_info.file_path = f'documents/attachments/{attachment_name}'")
    print("    ")
    print("    # Now you can process the attachment separately")
    print("    # For example, render it to HTML")
    print("    attachment_view_options = ViewOptions()")
    print("    attachment_view_options.file_info = attachment_file_info")
    print("    attachment_view_options.view_format = 'HTML'")
    print("    attachment_result = viewer_api.view(attachment_view_options)")
    print("")
else:
    print("Document does not contain attachments")

Real-world application: Email processing systems often need to handle attachments separately. This approach lets you identify what’s inside before deciding how to process each piece.

Step 4: Analyzing PDF-Specific Security Information

PDFs can have security restrictions that affect how you can process them. Let’s check for those:

# Tutorial Code Example: Analyzing PDF-specific information
from groupdocs_viewer_cloud import InfoOptions, FileInfo

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/secured.pdf"

# Create info options
info_options = InfoOptions()
info_options.file_info = file_info

# Get document information
info_result = viewer_api.get_info(info_options)

# Check for PDF-specific information
if hasattr(info_result, 'pdf_view_info') and info_result.pdf_view_info:
    print(f"PDF document information:")
    print(f"  - Printing allowed: {info_result.pdf_view_info.printing_allowed}")
    
    # Make a decision based on printing permission
    if info_result.pdf_view_info.printing_allowed:
        print("  This document can be printed")
    else:
        print("  This document has printing restrictions")
    
    # Print Security Considerations
    print("\nSecurity considerations for this PDF:")
    if not info_result.pdf_view_info.printing_allowed:
        print("  - Disable print button in your viewer application")
        print("  - Add a watermark indicating printing is not allowed")
    
else:
    print("No PDF-specific information available or not a PDF document")

Why this matters for your app: Respecting document security settings isn’t just good practice - it’s often a legal requirement. Use this information to configure your viewer interface appropriately.

Step 5: Working with CAD Drawings (For Technical Documents)

CAD files are complex beasts with multiple layouts and layers. Here’s how to understand their structure:

# Tutorial Code Example: Analyzing CAD drawing information
from groupdocs_viewer_cloud import InfoOptions, FileInfo

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/drawing.dwg"

# Create info options
info_options = InfoOptions()
info_options.file_info = file_info

# Get document information
info_result = viewer_api.get_info(info_options)

# Check for CAD-specific information
if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
    print(f"CAD drawing information:")
    
    # Display layouts
    if hasattr(info_result.cad_view_info, 'layouts') and info_result.cad_view_info.layouts:
        print(f"  - Available layouts ({len(info_result.cad_view_info.layouts)}):")
        for layout in info_result.cad_view_info.layouts:
            print(f"    * {layout.name} - {layout.width}x{layout.height}")
    
    # Display layers
    if hasattr(info_result.cad_view_info, 'layers') and info_result.cad_view_info.layers:
        print(f"\n  - Available layers ({len(info_result.cad_view_info.layers)}):")
        for layer in info_result.cad_view_info.layers:
            visibility = "Visible" if layer.visible else "Hidden"
            print(f"    * {layer.name} - {visibility}")
    
    # Generate example code for rendering specific layout and layers
    print("\nExample code for rendering specific layout and layers:")
    print("python")
    print("from groupdocs_viewer_cloud import ViewOptions, HtmlOptions, CadOptions, FileInfo")
    print("")
    print("# Set up file info")
    print("file_info = FileInfo()")
    print("file_info.file_path = 'documents/drawing.dwg'")
    print("")
    print("# Create CAD options")
    print("cad_options = CadOptions()")
    if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
        print(f"cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}'  # Render specific layout")
    if info_result.cad_view_info.layers and len(info_result.cad_view_info.layers) > 0:
        visible_layers = [layer.name for layer in info_result.cad_view_info.layers if layer.visible]
        if visible_layers:
            layers_str = "', '".join(visible_layers[:3])  # Take up to 3 layers for example
            print(f"cad_options.layers = ['{layers_str}']  # Render specific layers")
    print("")
    print("# Set up rendering options")
    print("html_options = HtmlOptions()")
    print("html_options.cad_options = cad_options")
    print("")
    print("# Set up view options")
    print("view_options = ViewOptions()")
    print("view_options.file_info = file_info")
    print("view_options.view_format = 'HTML'")
    print("view_options.render_options = html_options")
    print("")
    print("# Render the CAD drawing")
    print("result = viewer_api.view(view_options)")
    print("")
else:
    print("No CAD-specific information available or not a CAD drawing")

Professional tip: CAD files often contain sensitive information in hidden layers. Always check layer visibility before rendering to avoid exposing confidential data.

Step 6: Understanding Archive File Structure

Archive files (ZIP, RAR, etc.) have folder structures that you might want to navigate:

# Tutorial Code Example: Analyzing archive information
from groupdocs_viewer_cloud import InfoOptions, FileInfo

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/archive.zip"

# Create info options
info_options = InfoOptions()
info_options.file_info = file_info

# Get document information
info_result = viewer_api.get_info(info_options)

# Check for archive-specific information
if hasattr(info_result, 'archive_view_info') and info_result.archive_view_info:
    print(f"Archive information:")
    
    # Display folders
    if hasattr(info_result.archive_view_info, 'folders') and info_result.archive_view_info.folders:
        print(f"  - Folders in archive ({len(info_result.archive_view_info.folders)}):")
        for folder in info_result.archive_view_info.folders:
            print(f"    * {folder}")
    
    # Generate example code for rendering a specific folder
    print("\nExample code for rendering a specific folder from the archive:")
    print("python")
    print("from groupdocs_viewer_cloud import ViewOptions, HtmlOptions, ArchiveOptions, FileInfo")
    print("")
    print("# Set up file info")
    print("file_info = FileInfo()")
    print("file_info.file_path = 'documents/archive.zip'")
    print("")
    print("# Create archive options")
    print("html_options = HtmlOptions()")
    print("html_options.archive_options = ArchiveOptions()")
    if info_result.archive_view_info.folders and len(info_result.archive_view_info.folders) > 0:
        print(f"html_options.archive_options.folder = '{info_result.archive_view_info.folders[0]}'  # Render specific folder")
    print("html_options.archive_options.items_per_page = 15  # Items per page when rendering")
    print("")
    print("# Set up view options")
    print("view_options = ViewOptions()")
    print("view_options.file_info = file_info")
    print("view_options.view_format = 'HTML'")
    print("view_options.render_options = html_options")
    print("")
    print("# Render the archive contents")
    print("result = viewer_api.view(view_options)")
    print("")
else:
    print("No archive-specific information available or not an archive file")

Use case spotlight: Building a file browser interface? This information helps you create navigation structures that match the actual archive organization.

Step 7: Building Your Own Document Analysis Tool

Now let’s put it all together and create a comprehensive document analyzer that makes smart decisions:

# Tutorial Code Example: Comprehensive document analysis tool
from groupdocs_viewer_cloud import InfoOptions, FileInfo
import json

# Set up file info
file_info = FileInfo()
file_info.file_path = "documents/unknown-document.pdf"  # This could be any format

# Create info options with text extraction
info_options = InfoOptions()
info_options.file_info = file_info
info_options.extract_text = True  # Get text content

# Get document information
info_result = viewer_api.get_info(info_options)

# Create comprehensive analysis report
print("Document Analysis Report")
print("======================\n")

# Basic document information
print(f"Document Type: {info_result.format}")
print(f"Extension: {info_result.format_extension}")
print(f"Total Pages: {len(info_result.pages)}")

# Page dimensions analysis
if info_result.pages:
    # Collect page dimensions
    widths = [page.width for page in info_result.pages if hasattr(page, 'width')]
    heights = [page.height for page in info_result.pages if hasattr(page, 'height')]
    
    if widths and heights:
        # Analyze page sizes
        min_width = min(widths)
        max_width = max(widths)
        min_height = min(heights)
        max_height = max(heights)
        avg_width = sum(widths) / len(widths)
        avg_height = sum(heights) / len(heights)
        
        # Check if all pages have same dimensions
        uniform_pages = min_width == max_width and min_height == max_height
        
        print("\nPage Dimensions Analysis:")
        print(f"  - Page size is {'uniform' if uniform_pages else 'variable'}")
        if uniform_pages:
            print(f"  - All pages are {widths[0]}x{heights[0]} pixels")
        else:
            print(f"  - Width range: {min_width}-{max_width} pixels (avg: {avg_width:.1f})")
            print(f"  - Height range: {min_height}-{max_height} pixels (avg: {avg_height:.1f})")
        
        # Determine optimal rendering settings based on page sizes
        print("\nRecommended Rendering Settings:")
        if max_width > 1000 or max_height > 1400:
            print("  - Use high-resolution rendering for detailed content")
        
        # Page orientation analysis
        portrait_pages = sum(1 for w, h in zip(widths, heights) if h > w)
        landscape_pages = sum(1 for w, h in zip(widths, heights) if w > h)
        square_pages = sum(1 for w, h in zip(widths, heights) if w == h)
        
        print("\nPage Orientation Analysis:")
        print(f"  - Portrait pages: {portrait_pages}")
        print(f"  - Landscape pages: {landscape_pages}")
        print(f"  - Square pages: {square_pages}")
        
        if landscape_pages > 0:
            print("  - Note: Document contains landscape pages, consider appropriate viewer layout")

# Text content analysis
has_text = False
text_lines_count = 0
total_words = 0

for page in info_result.pages:
    if hasattr(page, 'lines') and page.lines:
        has_text = True
        text_lines_count += len(page.lines)
        page_words = sum(len(line.words) if hasattr(line, 'words') else 0 for line in page.lines)
        total_words += page_words

if has_text:
    print("\nText Content Analysis:")
    print(f"  - Total text lines: {text_lines_count}")
    print(f"  - Approximate word count: {total_words}")
    print(f"  - Average words per page: {total_words / len(info_result.pages):.1f}")
    
    # Determine if document is text-heavy
    if total_words > 500:
        print("  - Document is text-heavy, consider enabling text search functionality")
else:
    print("\nText Content Analysis:")
    print("  - No text content found or extracted")
    print("  - Document may be image-based or scanned")
    print("  - Consider OCR processing if text search is required")

# Format-specific analysis
print("\nFormat-Specific Information:")

# PDF analysis
if hasattr(info_result, 'pdf_view_info') and info_result.pdf_view_info:
    print("PDF Document Properties:")
    print(f"  - Printing allowed: {info_result.pdf_view_info.printing_allowed}")
    if not info_result.pdf_view_info.printing_allowed:
        print("  - Security Note: Document has printing restrictions")

# CAD analysis
if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
    print("CAD Drawing Properties:")
    if hasattr(info_result.cad_view_info, 'layouts') and info_result.cad_view_info.layouts:
        print(f"  - Contains {len(info_result.cad_view_info.layouts)} layouts")
    if hasattr(info_result.cad_view_info, 'layers') and info_result.cad_view_info.layers:
        visible_layers = sum(1 for layer in info_result.cad_view_info.layers if layer.visible)
        hidden_layers = len(info_result.cad_view_info.layers) - visible_layers
        print(f"  - Contains {len(info_result.cad_view_info.layers)} layers " +
              f"({visible_layers} visible, {hidden_layers} hidden)")

# Archive analysis
if hasattr(info_result, 'archive_view_info') and info_result.archive_view_info:
    print("Archive Properties:")
    if hasattr(info_result.archive_view_info, 'folders') and info_result.archive_view_info.folders:
        print(f"  - Contains {len(info_result.archive_view_info.folders)} folders")

# Project management analysis
if hasattr(info_result, 'project_management_view_info') and info_result.project_management_view_info:
    print("Project Management Properties:")
    if hasattr(info_result.project_management_view_info, 'start_date'):
        print(f"  - Project start date: {info_result.project_management_view_info.start_date}")
    if hasattr(info_result.project_management_view_info, 'end_date'):
        print(f"  - Project end date: {info_result.project_management_view_info.end_date}")

# Outlook data analysis
if hasattr(info_result, 'outlook_view_info') and info_result.outlook_view_info:
    print("Outlook Data Properties:")
    if hasattr(info_result.outlook_view_info, 'folders') and info_result.outlook_view_info.folders:
        print(f"  - Contains {len(info_result.outlook_view_info.folders)} folders")

# Final recommendations
print("\nRendering Recommendations:")
print("  - Recommended format: ", end="")
if has_text:
    print("HTML (for text searchability)")
elif hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
    print("PNG (for crisp lines and details)")
else:
    print("PDF (for compatibility and document fidelity)")

# Generate rendering code based on analysis
print("\nSuggested Rendering Code:")
print("python")
print("from groupdocs_viewer_cloud import ViewOptions, FileInfo")

if has_text:
    print("from groupdocs_viewer_cloud import HtmlOptions")
    print("\n# Create HTML options for text searchability")
    print("html_options = HtmlOptions()")
    print("html_options.is_responsive = True")
    
    if landscape_pages > 0:
        print("# Enable proper layout for landscape pages")
        
    if hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
        print("# Configure CAD options")
        print("html_options.cad_options = CadOptions()")
        if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
            print(f"html_options.cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}'")
        
    print("\n# Set up view options")
    print("view_options = ViewOptions()")
    print("view_options.file_info = file_info")
    print("view_options.view_format = 'HTML'")
    print("view_options.render_options = html_options")
    
elif hasattr(info_result, 'cad_view_info') and info_result.cad_view_info:
    print("from groupdocs_viewer_cloud import ImageOptions, CadOptions")
    print("\n# Create PNG options for CAD drawing")
    print("image_options = ImageOptions()")
    print("image_options.cad_options = CadOptions()")
    if info_result.cad_view_info.layouts and len(info_result.cad_view_info.layouts) > 0:
        print(f"image_options.cad_options.layout_name = '{info_result.cad_view_info.layouts[0].name}'")
    print("image_options.width = 1200  # High resolution for details")
    
    print("\n# Set up view options")
    print("view_options = ViewOptions()")
    print("view_options.file_info = file_info")
    print("view_options.view_format = 'PNG'")
    print("view_options.render_options = image_options")
    
else:
    print("from groupdocs_viewer_cloud import PdfOptions")
    print("\n# Create PDF options for general compatibility")
    print("pdf_options = PdfOptions()")
    
    print("\n# Set up view options")
    print("view_options = ViewOptions()")
    print("view_options.file_info = file_info")
    print("view_options.view_format = 'PDF'")
    print("view_options.render_options = pdf_options")

print("\n# Render the document")
print("result = viewer_api.view(view_options)")
print("")

This comprehensive analyzer does the heavy lifting for you - it examines your document and recommends the best rendering approach based on the actual content characteristics.

Hands-On Practice: Try These Challenges

Ready to test your newfound skills? Here are some practical exercises:

Beginner Challenge: Create a script that categorizes documents as “text-heavy,” “image-based,” or “mixed” based on their text content analysis.

Intermediate Challenge: Build a CAD drawing inspector that generates a report showing all available layouts and layers, helping users choose what to render.

Advanced Challenge: Develop a batch document processor that analyzes multiple files and creates optimal rendering configurations for each based on their unique characteristics.

Common Pitfalls and How to Avoid Them

Text Extraction Troubles: If you’re not getting text content, double-check that extract_text is set to True in your InfoOptions. Also remember that image-based documents (like scanned PDFs) won’t have extractable text without OCR.

Format-Specific Info Missing: Always verify the document format before trying to access format-specific properties. A Word document won’t have CAD layers, and a PDF won’t have project timeline information.

Performance Considerations: Text extraction adds processing time, especially for large documents. Only request it when you actually need the text content for your application.

Page Dimension Edge Cases: Some document formats might not report accurate dimensions until after rendering. Use the InfoResult data as guidance, but be prepared to handle edge cases.

Best Practices for Production Applications

Smart Caching Strategy: Document information doesn’t change unless the file changes. Cache InfoResult data to avoid repeated API calls for the same documents.

Error Handling: Always wrap your API calls in try-catch blocks. Network issues, invalid files, or API limits can cause failures that you’ll want to handle gracefully.

Batch Processing: If you’re analyzing many documents, consider implementing batch processing with rate limiting to stay within API quotas.

Security Awareness: Respect document security settings revealed by InfoResult. If a PDF restricts printing, honor that in your application interface.

Performance Optimization Tips

Selective Information Requests: Only extract text when you need it - it’s the most resource-intensive operation.

Asynchronous Processing: For batch document analysis, use asynchronous processing to handle multiple files concurrently.

Smart Pre-filtering: Use basic file extension checks before calling the API to filter out unsupported formats early.

Advanced Integration Patterns

Event-Driven Processing: Set up webhook notifications when documents are uploaded, then use InfoResult to determine the optimal processing pipeline automatically.

Machine Learning Enhancement: Use InfoResult data as features for ML models that predict user preferences or document importance.

Multi-Format Workflows: Create adaptive workflows that handle different document types differently based on their InfoResult characteristics.

Performance Monitoring and Optimization

Key Metrics to Track:

Average API response time for InfoResult calls
Cache hit rates for document information
Text extraction success rates by document type
Format-specific feature detection accuracy

Optimization Strategies:

Implement tiered caching (memory → Redis → database)
Use connection pooling for API clients
Batch similar document types for processing efficiency
Monitor and adjust text extraction timeouts based on document size

Security Best Practices

Data Privacy: InfoResult may contain sensitive metadata. Ensure proper access controls and data handling procedures.

Permission Respect: Always honor document security settings revealed by the analysis. Implement UI controls that reflect document restrictions.

Audit Trails: Log document analysis activities for compliance and debugging purposes.

API Key Security: Rotate your GroupDocs Cloud credentials regularly and never expose them in client-side code.

Essential Resources for Your Toolkit

Product Page - Latest features and pricing information
API Documentation - Comprehensive technical reference
Interactive API Reference - Test API calls directly in your browser
Community Support Forum - Get help from experts and fellow developers
Free Trial Access - Start building immediately with generous free limits