Tutorial: Converting to Text Format
In this tutorial, you’ll learn how to convert various document types to plain text format using GroupDocs.Conversion Cloud API. You’ll master text extraction with options for controlling output formatting, encoding, and content selection.
Learning Objectives
By the end of this tutorial, you will be able to:
- Convert documents to plain text format (.txt)
- Control text encoding and formatting during conversion
- Extract text from specific pages or sections of documents
- Implement both storage-based and stream-based text conversions
- Handle common text extraction challenges
Prerequisites
Before starting this tutorial, you need:
- A GroupDocs.Conversion Cloud account
- Your Client ID and Client Secret credentials
- Basic understanding of REST API concepts
- Development environment with your preferred programming language set up
- Sample documents to test conversion (we’ll use various formats including PDF, Word, and emails)
Implementation Steps
Step 1: Authentication with GroupDocs.Conversion Cloud API
Before performing any operations, we need to authenticate with the API using your Client ID and Client Secret.
Try it yourself
First, let’s obtain a JWT access token using cURL:
# First get JSON Web Token
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"
Make sure to replace YOUR_CLIENT_ID
and YOUR_CLIENT_SECRET
with your actual credentials.
Step 2: Basic Document to Text Conversion
Let’s start with a simple conversion from a Word document to plain text format:
Try it yourself
Using cURL:
curl -X POST "https://api.groupdocs.cloud/v2.0/conversion" \
-H "accept: application/json" \
-H "authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d "{
'FilePath': 'documents/sample.docx',
'Format': 'txt',
'OutputPath': 'converted'
}"
Replace YOUR_JWT_TOKEN
with the actual token received in Step 1.
Step 3: Converting to Text with Encoding Options
Character encoding is crucial when working with text files. Here’s how to specify encoding:
// C# SDK Example
using System;
using System.Collections.Generic;
using GroupDocs.Conversion.Cloud.Sdk.Api;
using GroupDocs.Conversion.Cloud.Sdk.Client;
using GroupDocs.Conversion.Cloud.Sdk.Model;
using GroupDocs.Conversion.Cloud.Sdk.Model.Requests;
namespace TextConversionTutorial
{
class Program
{
static void Main(string[] args)
{
// Configure API client
var configuration = new Configuration("YOUR_CLIENT_ID", "YOUR_CLIENT_SECRET");
var apiInstance = new ConvertApi(configuration);
try
{
// Set up conversion to text with encoding options
var settings = new ConvertSettings
{
StorageName = "MyStorage",
FilePath = "documents/sample.docx",
Format = "txt",
// For password-protected documents
LoadOptions = new DocxLoadOptions()
{
Password = ""
},
// Text-specific convert options
ConvertOptions = new TxtConvertOptions()
{
// Page selection
FromPage = 1,
PagesCount = 0, // All pages
// Encoding options (UTF-8)
Encoding = "utf-8"
},
OutputPath = "converted"
};
// Execute conversion
List<StoredConvertedResult> response = apiInstance.ConvertDocument(
new ConvertDocumentRequest(settings));
Console.WriteLine("Document converted successfully to text: " + response[0].Url);
}
catch (Exception e)
{
Console.WriteLine("Error: " + e.Message);
}
}
}
}
Step 4: Converting PDF to Text with Advanced Options
PDF documents often require special handling for effective text extraction:
// Java SDK Example
import com.groupdocs.cloud.conversion.api.*;
import com.groupdocs.cloud.conversion.client.ApiException;
import com.groupdocs.cloud.conversion.model.*;
import com.groupdocs.cloud.conversion.model.requests.*;
import java.util.List;
public class PdfToTextExample {
public static void main(String[] args) {
// Configure API client
String clientId = "YOUR_CLIENT_ID";
String clientSecret = "YOUR_CLIENT_SECRET";
Configuration configuration = new Configuration(clientId, clientSecret);
ConvertApi apiInstance = new ConvertApi(configuration);
try {
// Prepare convert settings
ConvertSettings settings = new ConvertSettings();
settings.setFilePath("documents/document.pdf");
settings.setFormat("txt");
// Set PDF-specific load options
PdfLoadOptions loadOptions = new PdfLoadOptions();
loadOptions.setPassword(""); // If PDF is protected
loadOptions.setRemoveEmbeddedFiles(true); // Ignore embedded files
loadOptions.setHidePdfAnnotations(true); // Ignore annotations
settings.setLoadOptions(loadOptions);
// Configure TXT-specific convert options
TxtConvertOptions convertOptions = new TxtConvertOptions();
convertOptions.setFromPage(1);
convertOptions.setPagesCount(0); // All pages
// Set UTF-8 encoding for universal compatibility
convertOptions.setEncoding("utf-8");
settings.setConvertOptions(convertOptions);
settings.setOutputPath("converted");
// Execute conversion
List<StoredConvertedResult> result = apiInstance.convertDocument(
new ConvertDocumentRequest(settings));
System.out.println("PDF converted successfully to text: " + result.get(0).getUrl());
} catch (ApiException e) {
System.err.println("Exception when calling ConvertApi: " + e.getMessage());
e.printStackTrace();
}
}
}
Step 5: Converting Email Messages to Text
Email messages often contain rich formatting that needs to be properly extracted:
# Python SDK Example
import groupdocs_conversion_cloud
from groupdocs_conversion_cloud.models.requests import ConvertDocumentRequest
# Configure API client
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
api_instance = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_secret)
try:
# Prepare conversion settings
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path = "documents/email_message.msg"
settings.format = "txt"
# Configure email-specific load options
load_options = groupdocs_conversion_cloud.EmailLoadOptions()
load_options.display_header = True # Include email header
load_options.display_from_email_address = True
load_options.display_to_email_address = True
load_options.display_cc_email_address = True
load_options.display_bcc_email_address = False
load_options.preserve_embedded_message_format = True
settings.load_options = load_options
# Configure text-specific convert options
convert_options = groupdocs_conversion_cloud.TxtConvertOptions()
convert_options.from_page = 1
convert_options.pages_count = 0 # All content
convert_options.encoding = "utf-8"
settings.convert_options = convert_options
settings.output_path = "converted"
# Execute conversion
request = ConvertDocumentRequest(settings)
result = api_instance.convert_document(request)
print(f"Email converted successfully to text: {result[0].url}")
except groupdocs_conversion_cloud.ApiException as e:
print(f"Exception when calling ConvertApi: {e}")
Step 6: Stream-Based Text Conversion
For applications that need to process the extracted text directly:
// Node.js SDK Example
const { ConvertApi, Configuration } = require("groupdocs-conversion-cloud");
const fs = require("fs");
// Configure API client
const clientId = "YOUR_CLIENT_ID";
const clientSecret = "YOUR_CLIENT_SECRET";
const config = new Configuration(clientId, clientSecret);
const apiInstance = new ConvertApi(config);
// Prepare conversion settings
const settings = {
filePath: "documents/sample.docx",
format: "txt",
loadOptions: {
// DOCX-specific load options if needed
},
convertOptions: {
// TXT-specific convert options
fromPage: 1,
pagesCount: 0, // All pages
encoding: "utf-8"
},
// Set outputPath to null for stream output
outputPath: null
};
// Execute conversion
apiInstance.convertDocumentDownload({ convertSettings: settings })
.then((result) => {
// Process the text directly
let textContent = '';
result.on('data', (chunk) => {
textContent += chunk.toString('utf8');
});
result.on('end', () => {
console.log("Document converted successfully to text");
// Display first 100 characters of the extracted text
const previewText = textContent.substring(0, 100) +
(textContent.length > 100 ? '...' : '');
console.log("Text preview: " + previewText);
// Save to file if needed
fs.writeFile("extracted-text.txt", textContent, (err) => {
if (err) {
console.error("Error saving text file:", err);
} else {
console.log("Text saved to extracted-text.txt");
}
});
});
})
.catch((error) => {
console.log(`Error: ${error.message}`);
});
Text-Specific Conversion Options
When converting to text format, you can leverage these specialized options:
Option | Description | Default | Impact |
---|---|---|---|
FromPage | First page number to convert | 1 | Controls starting point |
PagesCount | Number of pages to convert | All | Limits content amount |
Encoding | Character encoding for the text file | “utf-8” | Affects character support and compatibility |
Troubleshooting Common Issues
1. Character Encoding Problems
If text appears garbled or contains unexpected characters:
- Ensure you’re using the correct encoding for your content (UTF-8 is recommended for most cases)
- For documents with special characters, consider using UTF-16 encoding
- For legacy systems, you might need to use specific encodings like Windows-1252 or ISO-8859-1
2. Content Formatting Loss
Text conversion naturally loses formatting:
- Remember that TXT format doesn’t support formatting like bold, italics, or colors
- For tabular data, tables will be converted to plain text with spacing that might not align perfectly
- Consider HTML or PDF format if formatting preservation is critical
3. Special Content Handling
Some document elements require special handling:
- Headers and footers may be included in unexpected locations
- Page numbers might be interspersed with regular content
- For complex documents with tables, charts, or images, the text representation might be confusing
4. Performance Considerations
For large documents:
- Use the page selection options to convert only needed sections
- Consider batching very large documents into smaller chunks
- Monitor memory usage when processing large streams
What You’ve Learned
In this tutorial, you’ve learned:
- How to convert various document formats to plain text
- Controlling text encoding for proper character representation
- Extracting text from specific pages or sections
- Implementing both storage-based and stream-based text conversions
- Troubleshooting common text conversion challenges
Further Practice
To reinforce your learning, try these exercises:
- Create a text extractor that processes multiple document types and normalizes the output format
- Implement a command-line utility that extracts text and performs basic processing (e.g., word count, search)
- Build a simple indexing service that extracts text for search functionality
- Create a comparison tool that shows differences between text extracted from different document versions
Next Tutorial
Ready to explore more specialized conversion options? Continue with our Tutorial: Converting CAD Documents with Load Options to learn techniques for handling specialized CAD file formats.
Additional Resources
Have questions about this tutorial? Feel free to reach out on our forum for support.