Tutorial: How to Extract Text from the Whole Document
Learning Objectives
In this tutorial, you’ll learn how to:
- Set up the GroupDocs.Parser Cloud API for text extraction
- Extract text from an entire document with a simple API call
- Process the extracted text in different programming languages
Prerequisites
Before starting this tutorial, make sure you have:
- A GroupDocs.Parser Cloud account (if you don’t have one, register for a free trial)
- Your Client ID and Client Secret (available from the dashboard)
- At least one document uploaded to your cloud storage
The Practical Scenario
Imagine you need to extract all text from a document to:
- Create a searchable database of document content
- Perform analysis on document text
- Enable full-text search functionality in your application
This tutorial will show you how to implement this functionality step by step.
Step 1: Obtain Authorization Token
Before making any API calls, you need to authenticate with the GroupDocs API using your Client ID and Client Secret.
# First get JSON Web Token
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"
This will return a JWT token that you’ll use in subsequent requests.
Step 2: Prepare Your API Request
To extract text from a document, you’ll make a POST request to the text endpoint with the following parameters:
curl -v "https://api.groupdocs.cloud/v1.0/parser/text" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-d "{
\"FileInfo\": {
\"FilePath\": \"words/docx/document.docx\"
}
}"
Note that you only need to specify the FilePath
parameter to extract text from the entire document.
Step 3: Execute the Request and Process the Response
When you execute the request, the API will return a JSON response containing the extracted text:
{
"text": "First Page\r\r\f"
}
The text is returned in a simple format, with page breaks represented by the \f
character.
Try It Yourself
Now it’s your turn to try extracting text from your own document:
- Replace
YOUR_CLIENT_ID
andYOUR_CLIENT_SECRET
with your actual credentials - Update the
FilePath
parameter to point to a document in your storage - Execute the curl command and observe the response
Implementation in Different Languages
C# Example
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json;
namespace GroupDocsParserCloudTutorial
{
class Program
{
static async Task Main(string[] args)
{
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
string clientId = "YOUR_CLIENT_ID";
string clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
string token = await GetAuthToken(clientId, clientSecret);
// Extract text from the document
await ExtractText(token, "words/docx/document.docx");
}
static async Task<string> GetAuthToken(string clientId, string clientSecret)
{
using (var client = new HttpClient())
{
// Prepare request
var requestBody = $"grant_type=client_credentials&client_id={clientId}&client_secret={clientSecret}";
var content = new StringContent(requestBody, Encoding.UTF8, "application/x-www-form-urlencoded");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/connect/token", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
var token = JsonConvert.DeserializeObject<Dictionary<string, string>>(jsonString);
return token["access_token"];
}
}
static async Task ExtractText(string token, string filePath)
{
using (var client = new HttpClient())
{
// Prepare request
client.DefaultRequestHeaders.Add("Authorization", $"Bearer {token}");
var requestBody = new
{
FileInfo = new
{
FilePath = filePath
}
};
var content = new StringContent(JsonConvert.SerializeObject(requestBody), Encoding.UTF8, "application/json");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/v1.0/parser/text", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
Console.WriteLine(jsonString);
}
}
}
}
Java Example
import java.io.IOException;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
import org.json.JSONObject;
public class ExtractTextTutorial {
private static final String BASE_URL = "https://api.groupdocs.cloud/v1.0/parser";
private static final String AUTH_URL = "https://api.groupdocs.cloud/connect/token";
public static void main(String[] args) throws IOException {
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
String clientId = "YOUR_CLIENT_ID";
String clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
String token = getAuthToken(clientId, clientSecret);
// Extract text from the document
extractText(token, "words/docx/document.docx");
}
private static String getAuthToken(String clientId, String clientSecret) throws IOException {
URL url = new URL(AUTH_URL);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
conn.setDoOutput(true);
String requestBody = "grant_type=client_credentials&client_id=" + clientId + "&client_secret=" + clientSecret;
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
JSONObject jsonObject = new JSONObject(jsonResponse);
return jsonObject.getString("access_token");
}
}
private static void extractText(String token, String filePath) throws IOException {
URL url = new URL(BASE_URL + "/text");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/json");
conn.setRequestProperty("Accept", "application/json");
conn.setRequestProperty("Authorization", "Bearer " + token);
conn.setDoOutput(true);
String requestBody = "{\"FileInfo\":{\"FilePath\":\"" + filePath + "\"}}";
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
System.out.println(jsonResponse);
}
}
}
Common Issues and Troubleshooting
- Authentication Error: If you receive a 401 Unauthorized error, check that your Client ID and Client Secret are correct and that your token hasn’t expired.
- File Not Found: Ensure the specified file path is correct and the file exists in your storage.
- Empty Text Response: Some document formats may not contain extractable text. Try with a different document format.
What You’ve Learned
In this tutorial, you’ve learned:
- How to authenticate with the GroupDocs.Parser Cloud API
- How to extract text from an entire document
- How to implement this functionality in C# and Java
Next Steps
Now that you’ve mastered extracting text from an entire document, you can:
- Follow our tutorial on Extracting Text by Page Number Range to learn how to extract text from specific pages
- Explore how to Extract Formatted Text to preserve document formatting
Further Practice
Try extracting text from various document formats (PDF, DOCX, XLSX, etc.) to understand how the API handles different document structures. Experiment with combining multiple API operations to build a more comprehensive document processing solution.