Tutorial: How to Extract Text by a Page Number Range
Learning Objectives
In this tutorial, you’ll learn how to:
- Extract text from specific pages in a document
- Define page ranges for targeted text extraction
- Process page-specific text in different programming languages
Prerequisites
Before starting this tutorial, make sure you have:
- A GroupDocs.Parser Cloud account (if you don’t have one, register for a free trial)
- Your Client ID and Client Secret (available from the dashboard)
- A multi-page document uploaded to your cloud storage
The Practical Scenario
Imagine you’re building an application that needs to:
- Extract text from specific sections of a long document
- Process only relevant pages instead of the entire document
- Allow users to navigate through document content page by page
This tutorial will show you how to implement this functionality step by step.
Step 1: Obtain Authorization Token
Before making any API calls, you need to authenticate with the GroupDocs API using your Client ID and Client Secret.
# First get JSON Web Token
curl -v "https://api.groupdocs.cloud/connect/token" \
-X POST \
-d "grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET" \
-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: application/json"
This will return a JWT token that you’ll use in subsequent requests.
Step 2: Prepare Your API Request
To extract text from specific pages, you’ll make a POST request to the text endpoint with the following parameters:
curl -v "https://api.groupdocs.cloud/v1.0/parser/text" \
-X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-d "{
\"StartPageNumber\": 0,
\"CountPagesToExtract\": 4,
\"FileInfo\": {
\"FilePath\": \"pdf/pages-document.pdf\"
}
}"
The key parameters here are:
StartPageNumber
: The zero-based index of the first page to extract (0 = first page)CountPagesToExtract
: The number of pages to extract starting from the start page
Step 3: Execute the Request and Process the Response
When you execute the request, the API will return a JSON response containing the extracted text for each page:
{
"pages": [
{
"pageIndex": 0,
"text": "Text inside bookmark 0\r\n\r\n Page 0 heading\r\nP a g e T e x t - P a g e 0\r\n"
},
{
"pageIndex": 1,
"text": "Text inside bookmark 1\r\n\r\n Page 1 heading\r\nP a g e T e x t - P a g e 1\r\n"
},
{
"pageIndex": 2,
"text": "Text inside bookmark 2\r\n\r\n Page 2 heading\r\nP a g e T e x t - P a g e 2\r\n"
},
{
"pageIndex": 3,
"text": "Text inside bookmark 3\r\n\r\n Page 3 heading\r\nP a g e T e x t - P a g e 3\r\n"
}
]
}
Notice that the response includes the pageIndex
for each extracted page, making it easy to identify which text belongs to which page.
Try It Yourself
Now it’s your turn to try extracting text from specific pages:
- Replace
YOUR_CLIENT_ID
andYOUR_CLIENT_SECRET
with your actual credentials - Update the
FilePath
parameter to point to a multi-page document in your storage - Adjust the
StartPageNumber
andCountPagesToExtract
parameters to extract different page ranges - Execute the curl command and observe how the response changes
Implementation in Different Languages
C# Example
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Newtonsoft.Json;
namespace GroupDocsParserCloudTutorial
{
class Program
{
static async Task Main(string[] args)
{
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
string clientId = "YOUR_CLIENT_ID";
string clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
string token = await GetAuthToken(clientId, clientSecret);
// Extract text from specific pages
await ExtractTextByPageRange(token, "pdf/pages-document.pdf", 0, 4);
}
static async Task<string> GetAuthToken(string clientId, string clientSecret)
{
using (var client = new HttpClient())
{
// Prepare request
var requestBody = $"grant_type=client_credentials&client_id={clientId}&client_secret={clientSecret}";
var content = new StringContent(requestBody, Encoding.UTF8, "application/x-www-form-urlencoded");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/connect/token", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
var token = JsonConvert.DeserializeObject<Dictionary<string, string>>(jsonString);
return token["access_token"];
}
}
static async Task ExtractTextByPageRange(string token, string filePath, int startPage, int pageCount)
{
using (var client = new HttpClient())
{
// Prepare request
client.DefaultRequestHeaders.Add("Authorization", $"Bearer {token}");
var requestBody = new
{
StartPageNumber = startPage,
CountPagesToExtract = pageCount,
FileInfo = new
{
FilePath = filePath
}
};
var content = new StringContent(JsonConvert.SerializeObject(requestBody), Encoding.UTF8, "application/json");
// Send request
var response = await client.PostAsync("https://api.groupdocs.cloud/v1.0/parser/text", content);
// Process response
var jsonString = await response.Content.ReadAsStringAsync();
Console.WriteLine(jsonString);
// You can process each page individually
var result = JsonConvert.DeserializeObject<PageTextResponse>(jsonString);
foreach (var page in result.Pages)
{
Console.WriteLine($"Page {page.PageIndex}:");
Console.WriteLine(page.Text);
Console.WriteLine("--------------------");
}
}
}
class PageTextResponse
{
public List<PageData> Pages { get; set; }
}
class PageData
{
public int PageIndex { get; set; }
public string Text { get; set; }
}
}
}
Java Example
import java.io.IOException;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
import org.json.JSONArray;
import org.json.JSONObject;
public class ExtractTextByPageRangeTutorial {
private static final String BASE_URL = "https://api.groupdocs.cloud/v1.0/parser";
private static final String AUTH_URL = "https://api.groupdocs.cloud/connect/token";
public static void main(String[] args) throws IOException {
// Get your ClientID and ClientSecret from https://dashboard.groupdocs.cloud
String clientId = "YOUR_CLIENT_ID";
String clientSecret = "YOUR_CLIENT_SECRET";
// Get JWT token
String token = getAuthToken(clientId, clientSecret);
// Extract text from specific pages
extractTextByPageRange(token, "pdf/pages-document.pdf", 0, 4);
}
private static String getAuthToken(String clientId, String clientSecret) throws IOException {
URL url = new URL(AUTH_URL);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
conn.setDoOutput(true);
String requestBody = "grant_type=client_credentials&client_id=" + clientId + "&client_secret=" + clientSecret;
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
JSONObject jsonObject = new JSONObject(jsonResponse);
return jsonObject.getString("access_token");
}
}
private static void extractTextByPageRange(String token, String filePath, int startPage, int pageCount) throws IOException {
URL url = new URL(BASE_URL + "/text");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/json");
conn.setRequestProperty("Accept", "application/json");
conn.setRequestProperty("Authorization", "Bearer " + token);
conn.setDoOutput(true);
String requestBody = String.format(
"{\"StartPageNumber\":%d,\"CountPagesToExtract\":%d,\"FileInfo\":{\"FilePath\":\"%s\"}}",
startPage, pageCount, filePath
);
try (OutputStream os = conn.getOutputStream()) {
os.write(requestBody.getBytes(StandardCharsets.UTF_8));
}
try (Scanner scanner = new Scanner(conn.getInputStream(), StandardCharsets.UTF_8.name())) {
String jsonResponse = scanner.useDelimiter("\\A").next();
System.out.println(jsonResponse);
// Process each page individually
JSONObject responseObj = new JSONObject(jsonResponse);
JSONArray pages = responseObj.getJSONArray("pages");
for (int i = 0; i < pages.length(); i++) {
JSONObject page = pages.getJSONObject(i);
int pageIndex = page.getInt("pageIndex");
String text = page.getString("text");
System.out.println("Page " + pageIndex + ":");
System.out.println(text);
System.out.println("--------------------");
}
}
}
}
Learning Checkpoint
Take a moment to test your understanding:
- What is the purpose of the
StartPageNumber
parameter? - If you want to extract pages 5-10 of a document, what values should you use for
StartPageNumber
andCountPagesToExtract
? - How would you modify the request to extract only the last page of a document if you don’t know how many pages it has?
Common Issues and Troubleshooting
- Page Range Out of Bounds: If you specify a page range that’s outside the document’s boundaries, you’ll receive an error. Make sure your
StartPageNumber
andCountPagesToExtract
parameters are valid for your document. - Zero Pages Returned: Ensure that the document actually has content on the specified pages. Some documents might have blank pages that return empty text.
- Different Page Counts: Remember that page numbering starts from 0 in the API, but may start from 1 in the document viewer.
What You’ve Learned
In this tutorial, you’ve learned:
- How to extract text from specific pages in a document
- How to specify page ranges using
StartPageNumber
andCountPagesToExtract
- How to process the page-specific text in your application
Next Steps
Now that you know how to extract text from specific pages, you can:
- Explore Extracting Formatted Text to preserve document formatting
Further Practice
Try creating an application that:
- First determines the total number of pages in a document
- Extracts text page by page using a loop
- Performs analysis on each page (such as word count, keyword detection, etc.)
- Compiles a summary report of the document’s content structure