Tutorial: How to Extract Formatted Text
Learning Objectives
In this tutorial, you’ll learn how to:
- Extract text from documents while preserving formatting
- Use different text formatting modes (HTML, Markdown, Plain Text)
- Process formatted text in different programming languages
Prerequisites
Before starting this tutorial, make sure you have:
- A GroupDocs.Parser Cloud account (if you don’t have one, register for a free trial)
- Your Client ID and Client Secret (available from the dashboard)
- A formatted document (e.g., a DOCX file with various formatting) uploaded to your cloud storage
The Practical Scenario
Imagine you’re developing an application that needs to:
- Convert documents to web content while preserving formatting
- Export document text as HTML for rendering in a browser
- Preserve document structure including headings, lists, tables, and text styling
This tutorial will show you how to implement this functionality step by step.
Step 1: Obtain Authorization Token
Before making any API calls, you need to authenticate with the GroupDocs API using your Client ID and Client Secret.
# First get JSON Web Token
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"
This will return a JWT token that you’ll use in subsequent requests.
Step 2: Prepare Your API Request
To extract formatted text, you’ll make a POST request to the text endpoint with the following parameters:
curl -v "https://api.groupdocs.cloud/v1.0/parser/text" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-d "{
\"FormattedTextOptions\": {
\"Mode\": \"Html\"
},
\"FileInfo\": {
\"FilePath\": \"words/docx/formatted-document.docx\"
}
}"
The key parameter here is FormattedTextOptions.Mode
, which can be set to one of the following values:
Html
: Extracts text with HTML formattingMarkdown
: Extracts text with Markdown formattingPlainText
: Extracts text without formatting (default)
Step 3: Execute the Request and Process the Response
When you execute the request, the API will return a JSON response containing the formatted text:
{
"text": "
<p>
<b>Bold text
</b>
</p>
<p>
<i>Italic text
</i>
</p>
<ol>
<li>
<i>First element
</i>
</li>
<li>
<i>Second element
</i>
</li>
<li>
<i>Third element
</i>
</li>
</ol>
<h1>Heading 1
</h1>
<p>
<a href=\"http://targetwebsite.domain\">Hyperlink
</a>targetwebsite.domain
</p>
<table border=\"1\">
<tr>
<td>
<p>table
</p>
</td>
<td>
<p>Cell 1
</p>
</td>
<td>
<p>Cell 2
</p>
</td>
</tr>
<tr>
<td>
<p>Cell 3
</p>
</td>
<td>
<p>Cell 4
</p>
</td>
<td>
<p>Cell 5
</p>
</td>
</tr>
</table>
<p>\f
</p>
<p>
<b>Second page bold text
</b>
</p>
<h1>Second page heading
</h1>"
}
Notice that the HTML response preserves:
- Paragraph structure with
<p>
tags - Text styling with
<b>
and<i>
tags - Lists with
<ol>
and<li>
tags - Headings with
<h1>
tags - Links with
<a>
tags - Tables with
<table>
,<tr>
, and<td>
tags - Page breaks with
\f
character
Try It Yourself
Now it’s your turn to try extracting formatted text:
- Replace
YOUR_CLIENT_ID
andYOUR_CLIENT_SECRET
with your actual credentials - Update the
FilePath
parameter to point to a formatted document in your storage - Try different values for
Mode
(Html, Markdown, PlainText) and observe how the response changes - Execute the curl command and analyze the formatted output
Implementation in Different Languages
C# Example
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json;
namespace GroupDocsParserCloudTutorial
{
class Program
{
static async Task Main(string[] args)
{
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
string clientId = "YOUR_CLIENT_ID";
string clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
string token = await GetAuthToken(clientId, clientSecret);
// Extract formatted text (HTML mode)
await ExtractFormattedText(token, "words/docx/formatted-document.docx", "Html");
// You can also try other modes
// await ExtractFormattedText(token, "words/docx/formatted-document.docx", "Markdown");
// await ExtractFormattedText(token, "words/docx/formatted-document.docx", "PlainText");
}
static async Task<string> GetAuthToken(string clientId, string clientSecret)
{
using (var client = new HttpClient())
{
// Prepare request
var requestBody = $"grant_type=client_credentials&client_id={clientId}&client_secret={clientSecret}";
var content = new StringContent(requestBody, Encoding.UTF8, "application/x-www-form-urlencoded");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/connect/token", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
var token = JsonConvert.DeserializeObject<Dictionary<string, string>>(jsonString);
return token["access_token"];
}
}
static async Task ExtractFormattedText(string token, string filePath, string mode)
{
using (var client = new HttpClient())
{
// Prepare request
client.DefaultRequestHeaders.Add("Authorization", $"Bearer {token}");
var requestBody = new
{
FormattedTextOptions = new
{
Mode = mode
},
FileInfo = new
{
FilePath = filePath
}
};
var content = new StringContent(JsonConvert.SerializeObject(requestBody), Encoding.UTF8, "application/json");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/v1.0/parser/text", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
var result = JsonConvert.DeserializeObject<FormattedTextResponse>(jsonString);
Console.WriteLine($"Extracted text in {mode} format:");
Console.WriteLine(result.Text);
// If extracting HTML, you can save it to an HTML file
if (mode == "Html")
{
System.IO.File.WriteAllText("extracted.html", result.Text);
Console.WriteLine("HTML content saved to extracted.html");
}
}
}
class FormattedTextResponse
{
public string Text { get; set; }
}
}
}
Java Example
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
import org.json.JSONObject;
public class ExtractFormattedTextTutorial {
private static final String BASE_URL = "https://api.groupdocs.cloud/v1.0/parser";
private static final String AUTH_URL = "https://api.groupdocs.cloud/connect/token";
public static void main(String[] args) throws IOException {
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
String clientId = "YOUR_CLIENT_ID";
String clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
String token = getAuthToken(clientId, clientSecret);
// Extract formatted text (HTML mode)
extractFormattedText(token, "words/docx/formatted-document.docx", "Html");
// You can also try other modes
// extractFormattedText(token, "words/docx/formatted-document.docx", "Markdown");
// extractFormattedText(token, "words/docx/formatted-document.docx", "PlainText");
}
private static String getAuthToken(String clientId, String clientSecret) throws IOException {
URL url = new URL(AUTH_URL);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
conn.setDoOutput(true);
String requestBody = "grant_type=client_credentials&client_id=" + clientId + "&client_secret=" + clientSecret;
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
JSONObject jsonObject = new JSONObject(jsonResponse);
return jsonObject.getString("access_token");
}
}
private static void extractFormattedText(String token, String filePath, String mode) throws IOException {
URL url = new URL(BASE_URL + "/text");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/json");
conn.setRequestProperty("Accept", "application/json");
conn.setRequestProperty("Authorization", "Bearer " + token);
conn.setDoOutput(true);
String requestBody = String.format(
"{\"FormattedTextOptions\":{\"Mode\":\"%s\"},\"FileInfo\":{\"FilePath\":\"%s\"}}",
mode, filePath
);
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
JSONObject responseObj = new JSONObject(jsonResponse);
String extractedText = responseObj.getString("text");
System.out.println("Extracted text in " + mode + " format:");
System.out.println(extractedText);
// If extracting HTML, you can save it to an HTML file
if (mode.equals("Html")) {
try (FileWriter writer = new FileWriter("extracted.html")) {
writer.write(extractedText);
}
System.out.println("HTML content saved to extracted.html");
}
}
}
}
Learning Checkpoint
Take a moment to test your understanding:
- What are the three available modes for formatted text extraction?
- How does the HTML mode preserve document structure compared to plain text?
- What types of formatting elements are preserved in the HTML output?
Using the Extracted Formatted Text
Here are some practical uses for the formatted text:
HTML Mode
The HTML-formatted output can be directly embedded in a webpage or application. You might need to add some CSS styling to make it look better:
<!DOCTYPE html>
<html>
<head>
<title>Extracted Document</title>
<style>
body {
font-family: Arial, sans-serif;
line-height: 1.6;
margin: 20px;
}
table {
border-collapse: collapse;
width: 100%;
}
td, th {
border: 1px solid #ddd;
padding: 8px;
}
h1 {
color: #2c3e50;
}
a {
color: #3498db;
}
</style>
</head>
<body>
<!-- Insert the extracted HTML content here -->
[EXTRACTED_HTML_CONTENT]
</body>
</html>
Markdown Mode
The Markdown-formatted output can be used in Markdown editors, GitHub README files, or converted to other formats using Markdown processors.
Common Issues and Troubleshooting
- Missing Formatting: Some complex formatting may not be preserved exactly as in the original document. The API focuses on preserving the most common formatting elements.
- Special Characters: Special characters in HTML might need to be escaped when you display the content. Most modern frameworks handle this automatically.
- Rendering Differences: Different browsers or Markdown renderers might display the formatted content slightly differently.
What You’ve Learned
In this tutorial, you’ve learned:
- How to extract text with preserved formatting from documents
- How to specify different formatting modes (HTML, Markdown, Plain Text)
- How to process and use the formatted text in your applications
Next Steps
Now that you know how to extract formatted text, you can:
- Combine this with Page Range Extraction to extract formatted text from specific pages
- Learn about Extracting Images to complement your formatted text with images
Further Practice
Try creating an application that:
- Extracts formatted text from a document
- Processes the HTML to enhance it (e.g., adding custom CSS classes)
- Renders the formatted content in a web application
- Allows the user to toggle between HTML and Markdown views