Tutorial: Converting to Text Format

In this tutorial, you’ll learn how to convert various document types to plain text format using GroupDocs.Conversion Cloud API. You’ll master text extraction with options for controlling output formatting, encoding, and content selection.

Learning Objectives

By the end of this tutorial, you will be able to:

Convert documents to plain text format (.txt)
Control text encoding and formatting during conversion
Extract text from specific pages or sections of documents
Implement both storage-based and stream-based text conversions
Handle common text extraction challenges

Prerequisites

Before starting this tutorial, you need:

A GroupDocs.Conversion Cloud account
Your Client ID and Client Secret credentials
Basic understanding of REST API concepts
Development environment with your preferred programming language set up
Sample documents to test conversion (we’ll use various formats including PDF, Word, and emails)

Implementation Steps

Step 1: Authentication with GroupDocs.Conversion Cloud API

Before performing any operations, we need to authenticate with the API using your Client ID and Client Secret.

Try it yourself

First, let’s obtain a JWT access token using cURL:

# First get JSON Web Token
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"

Make sure to replace YOUR_CLIENT_ID and YOUR_CLIENT_SECRET with your actual credentials.

Step 2: Basic Document to Text Conversion

Let’s start with a simple conversion from a Word document to plain text format:

Try it yourself

Using cURL:

curl -X POST "https://api.groupdocs.cloud/v2.0/conversion" \
-H "accept: application/json" \
-H "authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d "{  
      'FilePath': 'documents/sample.docx',  
      'Format': 'txt',  
      'OutputPath': 'converted'
    }"

Replace YOUR_JWT_TOKEN with the actual token received in Step 1.

Step 3: Converting to Text with Encoding Options

Character encoding is crucial when working with text files. Here’s how to specify encoding:

// C# SDK Example
using System;
using System.Collections.Generic;
using GroupDocs.Conversion.Cloud.Sdk.Api;
using GroupDocs.Conversion.Cloud.Sdk.Client;
using GroupDocs.Conversion.Cloud.Sdk.Model;
using GroupDocs.Conversion.Cloud.Sdk.Model.Requests;

namespace TextConversionTutorial
{
    class Program
    {
        static void Main(string[] args)
        {
            // Configure API client
            var configuration = new Configuration("YOUR_CLIENT_ID", "YOUR_CLIENT_SECRET");
            var apiInstance = new ConvertApi(configuration);

            try
            {
                // Set up conversion to text with encoding options
                var settings = new ConvertSettings
                {
                    StorageName = "MyStorage",
                    FilePath = "documents/sample.docx",
                    Format = "txt",
                    // For password-protected documents
                    LoadOptions = new DocxLoadOptions() 
                    { 
                        Password = "" 
                    },
                    // Text-specific convert options
                    ConvertOptions = new TxtConvertOptions()
                    {
                        // Page selection
                        FromPage = 1,
                        PagesCount = 0,  // All pages
                        
                        // Encoding options (UTF-8)
                        Encoding = "utf-8"
                    },
                    OutputPath = "converted"
                };

                // Execute conversion
                List<StoredConvertedResult> response = apiInstance.ConvertDocument(
                    new ConvertDocumentRequest(settings));
                
                Console.WriteLine("Document converted successfully to text: " + response[0].Url);
            }
            catch (Exception e)
            {
                Console.WriteLine("Error: " + e.Message);
            }
        }
    }
}

Step 4: Converting PDF to Text with Advanced Options

PDF documents often require special handling for effective text extraction:

// Java SDK Example
import com.groupdocs.cloud.conversion.api.*;
import com.groupdocs.cloud.conversion.client.ApiException;
import com.groupdocs.cloud.conversion.model.*;
import com.groupdocs.cloud.conversion.model.requests.*;
import java.util.List;

public class PdfToTextExample {
    public static void main(String[] args) {
        // Configure API client
        String clientId = "YOUR_CLIENT_ID";
        String clientSecret = "YOUR_CLIENT_SECRET";
        Configuration configuration = new Configuration(clientId, clientSecret);
        ConvertApi apiInstance = new ConvertApi(configuration);
        
        try {
            // Prepare convert settings
            ConvertSettings settings = new ConvertSettings();
            settings.setFilePath("documents/document.pdf");
            settings.setFormat("txt");
            
            // Set PDF-specific load options
            PdfLoadOptions loadOptions = new PdfLoadOptions();
            loadOptions.setPassword("");  // If PDF is protected
            loadOptions.setRemoveEmbeddedFiles(true);  // Ignore embedded files
            loadOptions.setHidePdfAnnotations(true);   // Ignore annotations
            
            settings.setLoadOptions(loadOptions);
            
            // Configure TXT-specific convert options
            TxtConvertOptions convertOptions = new TxtConvertOptions();
            convertOptions.setFromPage(1);
            convertOptions.setPagesCount(0);  // All pages
            
            // Set UTF-8 encoding for universal compatibility
            convertOptions.setEncoding("utf-8");
            
            settings.setConvertOptions(convertOptions);
            settings.setOutputPath("converted");
            
            // Execute conversion
            List<StoredConvertedResult> result = apiInstance.convertDocument(
                new ConvertDocumentRequest(settings));
            
            System.out.println("PDF converted successfully to text: " + result.get(0).getUrl());
        } catch (ApiException e) {
            System.err.println("Exception when calling ConvertApi: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Step 5: Converting Email Messages to Text

Email messages often contain rich formatting that needs to be properly extracted:

# Python SDK Example
import groupdocs_conversion_cloud
from groupdocs_conversion_cloud.models.requests import ConvertDocumentRequest

# Configure API client
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
api_instance = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_secret)

try:
    # Prepare conversion settings
    settings = groupdocs_conversion_cloud.ConvertSettings()
    settings.file_path = "documents/email_message.msg"
    settings.format = "txt"
    
    # Configure email-specific load options
    load_options = groupdocs_conversion_cloud.EmailLoadOptions()
    load_options.display_header = True          # Include email header
    load_options.display_from_email_address = True
    load_options.display_to_email_address = True
    load_options.display_cc_email_address = True
    load_options.display_bcc_email_address = False
    load_options.preserve_embedded_message_format = True
    
    settings.load_options = load_options
    
    # Configure text-specific convert options
    convert_options = groupdocs_conversion_cloud.TxtConvertOptions()
    convert_options.from_page = 1
    convert_options.pages_count = 0  # All content
    convert_options.encoding = "utf-8"
    
    settings.convert_options = convert_options
    settings.output_path = "converted"
    
    # Execute conversion
    request = ConvertDocumentRequest(settings)
    result = api_instance.convert_document(request)
    
    print(f"Email converted successfully to text: {result[0].url}")
except groupdocs_conversion_cloud.ApiException as e:
    print(f"Exception when calling ConvertApi: {e}")

Step 6: Stream-Based Text Conversion

For applications that need to process the extracted text directly:

// Node.js SDK Example
const { ConvertApi, Configuration } = require("groupdocs-conversion-cloud");
const fs = require("fs");

// Configure API client
const clientId = "YOUR_CLIENT_ID";
const clientSecret = "YOUR_CLIENT_SECRET";
const config = new Configuration(clientId, clientSecret);
const apiInstance = new ConvertApi(config);

// Prepare conversion settings
const settings = {
    filePath: "documents/sample.docx",
    format: "txt",
    loadOptions: {
        // DOCX-specific load options if needed
    },
    convertOptions: {
        // TXT-specific convert options
        fromPage: 1,
        pagesCount: 0,  // All pages
        encoding: "utf-8"
    },
    // Set outputPath to null for stream output
    outputPath: null
};

// Execute conversion
apiInstance.convertDocumentDownload({ convertSettings: settings })
    .then((result) => {
        // Process the text directly
        let textContent = '';
        result.on('data', (chunk) => {
            textContent += chunk.toString('utf8');
        });
        
        result.on('end', () => {
            console.log("Document converted successfully to text");
            
            // Display first 100 characters of the extracted text
            const previewText = textContent.substring(0, 100) + 
                (textContent.length > 100 ? '...' : '');
            console.log("Text preview: " + previewText);
            
            // Save to file if needed
            fs.writeFile("extracted-text.txt", textContent, (err) => {
                if (err) {
                    console.error("Error saving text file:", err);
                } else {
                    console.log("Text saved to extracted-text.txt");
                }
            });
        });
    })
    .catch((error) => {
        console.log(`Error: ${error.message}`);
    });

Text-Specific Conversion Options

When converting to text format, you can leverage these specialized options:

Option	Description	Default	Impact
FromPage	First page number to convert	1	Controls starting point
PagesCount	Number of pages to convert	All	Limits content amount
Encoding	Character encoding for the text file	“utf-8”	Affects character support and compatibility

Troubleshooting Common Issues

1. Character Encoding Problems

If text appears garbled or contains unexpected characters:

Ensure you’re using the correct encoding for your content (UTF-8 is recommended for most cases)
For documents with special characters, consider using UTF-16 encoding
For legacy systems, you might need to use specific encodings like Windows-1252 or ISO-8859-1

2. Content Formatting Loss

Text conversion naturally loses formatting:

Remember that TXT format doesn’t support formatting like bold, italics, or colors
For tabular data, tables will be converted to plain text with spacing that might not align perfectly
Consider HTML or PDF format if formatting preservation is critical

3. Special Content Handling

Some document elements require special handling:

Headers and footers may be included in unexpected locations
Page numbers might be interspersed with regular content
For complex documents with tables, charts, or images, the text representation might be confusing

4. Performance Considerations

For large documents:

Use the page selection options to convert only needed sections
Consider batching very large documents into smaller chunks
Monitor memory usage when processing large streams

What You’ve Learned

In this tutorial, you’ve learned:

How to convert various document formats to plain text
Controlling text encoding for proper character representation
Extracting text from specific pages or sections
Implementing both storage-based and stream-based text conversions
Troubleshooting common text conversion challenges

Further Practice

To reinforce your learning, try these exercises:

Create a text extractor that processes multiple document types and normalizes the output format
Implement a command-line utility that extracts text and performs basic processing (e.g., word count, search)
Build a simple indexing service that extracts text for search functionality
Create a comparison tool that shows differences between text extracted from different document versions

Next Tutorial

Ready to explore more specialized conversion options? Continue with our Tutorial: Converting CAD Documents with Load Options to learn techniques for handling specialized CAD file formats.

Additional Resources

Have questions about this tutorial? Feel free to reach out on our forum for support.