Paging with the REST API Query Component 2020

Paging with the REST API Query Component 2020


This document builds on earlier articles Connecting to any REST API and Writing API Profiles, and describes how to include paging in an API Profile that you're already working on.

Overview of Paging

Most APIs support some form of response pagination or "paging" for practical reasons. The underlying network protocols perform faster and more reliably when the bulk transfer of data is split into individually manageable chunks. For example if it’s necessary to transfer 1,000,000 records, an API with paging enabled might expect the caller to make 100 individual HTTP requests, and will send back 10,000 records every time.

Please Note

Matillion’s API Query component has the capability to internally manage the paging loop. When it’s working correctly you do not need to develop your own external loop structures such as iterators.

2020 Updates

Matillion’s RSD syntax has changed significantly in 2020 with version 1.41. This document refers to 1.41 (and later) versions of Matillion.

In cases where both API Profile versions are available you can force Matillion to use the latest one by adding the parameter BackwardsCompatibilityMode set to false.

Set Compatibility Mode-False

Set Compatibility Mode-False

Preparation steps for Matillion API Profile implementation

In preparation for implementing paging, first make sure you have followed the first two steps described in "Connecting to any REST API".

  • Find and understand the API documentation, especially around paging.
  • Load the first page.
Then one extra step is required.
  • Load the "second" page, using cURL or Postman.

Loading the Second page

You should already have a working cURL or Postman command which returns the first page of results from your API. If not, revisit "Connecting to any REST API" and do so now.

The equivalent command to fetch the "second" page of data should be very similar to the first, but with some bespoke modification specific to the API in question.

For example, if your first cURL command was:

  curl -o page1-raw.json http://your.api/getData

Then, the returned data will look like as shown below:

{
  "data": [ {"id": 1, "value": "abc"},
            {"id": 2, "value": "defg"}
  ],
  "nextURL": "http://your.api/getData?page=2"
}

The second cURL command would be:

  curl -o page2-raw.json http://your.api/getData?page=2

Before continuing, ensure that:

  • Your second cURL command does successfully return data.
  • The second page of data is different to the first.
  • The second page of data follows the same structure as the first page.

Index of paging techniques

Different APIs implement paging in different ways. This document describes how to implement various widely used techniques.

   

Common Features

All Matillion paging implementations require that you add a new input named rows@next to the list of columns inside the existing api:info block. This virtual column is used to drive paging in the various ways demonstrated in this document.

In all cases, the driver will automatically loop through pages until any one of the following conditions is met:

  • No data records were returned.
  • The paging information held in the rows@next variable is missing.
  • The value of rows@next did not change since the last iteration.

Paging can be tricky to get right because RSD is a declarative language rather than an explicit loop, and so can be difficult to debug. It’s usually impossible to tell just from the API Profile Test screen whether or not paging has been invoked, because the Test screen only shows the first few records.

So the best way to be certain that paging is working is to save the API Profile and then use it in an API Query component, and check that it successfully returns all the data that matches the query.

For APIs which return the entire next URL as a data item it is technically possible to simulate paging entirely locally, by uploading multiple data pages (for example as JSON files), and manually editing the next-page URLs. You will need to use the file-finding technique demonstrated in our "Connecting to any REST API" documentation.

Rate Limiting

Matillion has no in-built support for artificial rate limiting. It will always issue the HTTP request for the next page immediately after the previous page has been processed.


Non-paging baseline

Later examples will all be based on a simple API which returns an array of data in JSON format like this:

{"data":[{"id":2},{"id":3}]}

For clarity, the baseline API Profile, which does not include any paging logic, is shown here:

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Baseline" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
      <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
   </api:info>
   <api:set attr="DataModel" value="DOCUMENT" />
   <api:set attr="URI" value="http://your.api/getData" />
   <api:set attr="JSONPath" value="/data" />
   <api:script method="GET">
      <api:call op="jsonproviderGet">
         <api:push/>
      </api:call>
  </api:script>
</api:script>



APIs which return the next URL

This is a very common pattern, in which the API response in each page includes the instructions for fetching the "next" page. Sometimes it’s a full URL: Sometimes a partial URL. Sometimes the information is presented as part of the data, or it may be embedded as an HTTP header in the response.

Full next URL

In this variation, the full next URL to call is presented as a single field in the data.

For example the returned data follows this pattern:

{
  "data": [ ... ],
  "nextURL": "http://your.api/getData?pagingPosition=f7zWmaOdf0YRXPVbUqbt"
}

This is one of the simplest methods, and only three changes are required over the baseline API Profile, highlighted below.

  <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Full URL pager" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
    <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
    <input name="rows@next" desc="Identifier used for paging" />
    </api:info>
     <api:set attr="DataModel" value="DOCUMENT" />
   <api:set attr="EnablePaging" value="TRUE" />
   <api:set attr="pageurlpath" value="/nextUrl" />
   <api:set attr="URI" value="http://your.api/getData" />
   <api:set attr="JSONPath" value="/data" />
    <api:script method="GET">
     <api:call op="jsonproviderGet">
    <api:push/>
   </api:call>
   </api:script>
 </api:script>

Partial next URL

In this variation, the API does not return the full next URL to call. Instead the returned data includes the information you need to construct the next URL yourself. The "Salesforce REST API" is an example of this: it returns a nextRecordsUrl containing the path to the next page of results if there are more.

For example, the returned data follows this pattern:

{
    "data": [ ... ],
    "pagingToken":"f7zWmaOdf0YRXPVbUqbt"
  }

This time, assuming your first URL was:

  http://your.api/getData

Then, clients are expected to use the pagingToken value to construct the second URL, which would need to be:

  http://your.api//getData?pagingPosition=f7zWmaOdf0YRXPVbUqbt

The pageurlpath property shown in the previous section does not work with this form of paging. Instead you must build the next URI each time manually, for example as shown below, with the changes highlighted again from the baseline.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Partial URL pager" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
      <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
    <input name="rows@next" desc="Identifier used for paging" />
   </api:info>
   <api:set attr="DataModel" value="DOCUMENT" />
   <api:set attr="EnablePaging" value="TRUE" />
   <api:set attr="elementmappath#" value="/pagingToken" />
   <api:set attr="elementmapname#" value="rows@next" />
   <api:check attr="_input.rows@next">
   <api:set attr="URI" value="http://your.api/getData?pagingPosition=[_input.rows@next]" />
  <api:else>
     <api:set attr="URI" value="http://your.api/getData" />
  <api:else>
  </api:check/>
  <api:set attr="JSONPath" value="/data" />
  <api:script method="GET">
     <api:call op="jsonproviderGet">
        <api:push/>
     </api:call>
  </api:script>
</api:script>

RFC8288 HTTP header linking

In this variation, the full next URL to call is returned by the API as an HTTP header according to RFC8288, i.e.

  • with the value set to

    Link:<http://your.api/getData... >

     

  • with

    rel="next"

     

Both of those features must be present for this type of paging to work. You can check the headers returned from the first page using a curl -v command.

An example might look like this:

  TheLinkHeader: Link:
  <http://your.api/getData?pagingPosition=f7zWmaOdf0YRXPVbUqbt >
  rel="next"

The technique is almost identical to the case described earlier where you have a full next URL, and might look like this:

  <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="RFC8288 pager" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
    <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
     <input name="rows@next" desc="Identifier used for paging" />
     </api:info>
     <api:set attr="DataModel" value="DOCUMENT" />
     <api:set attr="EnablePaging" value="TRUE" />
     <api:set attr="pageurlpath" value="header:TheLinkHeader" />
     <api:set attr="URI" value="http://your.api/getData" />
     <api:set attr="JSONPath" value="/data" />
     <api:script method="GET">
       <api:call op="jsonproviderGet">
          <api:push/>
        </api:call>
     </api:script>
  </api:script>

Custom HTTP header field

In this variation, the API uses a header field to return the information about the next URL, but it’s non RFC8288 compliant, and you therefore need to construct the next URL manually.

You can check the headers returned from the first page using a curl -v command.

An example might look like this:

  NextPageToken: f7zWmaOdf0YRXPVbUqbt

Then clients are expected to take the NextPageToken value from the header to construct the second URL, which in this example would need to be:

  http://your.api/getData?pagingPosition=f7zWmaOdf0YRXPVbUqbt

Matillion expects that the last page is signified by the header being present but with no value:

  NextPageToken:

The technique is very similar to the one described in the Partial Next URL section earlier, but slightly more complex as it involves intercepting the custom header. A solution could look like this:

  <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Custom header pager" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
    <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
   <input name="rows@next" desc="Identifier used for paging" />
   </api:info>
   <api:set attr="DataModel" value="DOCUMENT" />
   <api:set attr="EnablePaging" value="TRUE" />
   <api:set attr="PushResponseHeader#" value="NextPageToken" />
   <api:check attr="_input.rows@next">
   <api:set attr="URI" value="http://your.api/getData?pagingPosition=[_input.rows@next]" />
   <api:else>
    <api:set attr="URI" value="http://your.api/getData" />
   </api:else>
   </api:check>
   <api:set attr="JSONPath" value="/data" />
   <api:script method="GET">
     <api:call op="jsonproviderGet">
       <api:set attr="rows@next" value="[_out1.header:NextPageToken]" />
           <api:push/>
     </api:call>
   </api:script>
</api:script>



APIs which use page numbering

This category of API requires that the caller requests a specifically-numbered page of results each time. The caller is expected to maintain the counter.

Paging ends when there is no more data to send, and the server returns a success response which is valid but which contains no data.

  • First URL: http://your.api/getData
  • Second URL: http://your.api/getData?page=2
  • Third URL: http://your.api/getData?page=3
  • Nth URL:http://your.api/getData?page=N
  • N+1th URL:http://your.api/getData?page=N+1 (returns a successful HTTP 200 but no data).

The technique involves starting the count at 1, and then incrementing the page number for every call.

  <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Numbered pager" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
    <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
    <input name="rows@next" desc="Identifier used for paging" />
   </api:info>
   <api:set attr="DataModel" value="DOCUMENT" />
   <api:check attr="_input.rows@next">
   <api:set attr="URI" value="http://your.api/getData?page=[_input.rows@next]" />
   <api:else>
   <api:set attr="_input.rows@next" value="1" />
   <api:set attr="URI" value="http://your.api/getData" />
    </api:else>
    </api:check>
   <api:set attr="JSONPath" value="/data" />
   <api:set attr="EnablePaging" value="TRUE" />
   <api:script method="GET">
     <api:call op="jsonproviderGet">
   <api:set attr="rows@next" value="[_input.rows@next | add(1)]" />
     <api:push/>
    </api:call>
   </api:script>
</api:script>

Failure when the page is out of range

This variation is almost identical to the previous example, except that the server signifies the end of paging by returning a specific HTTP response code: for example a 400 instead of a 200.

It’s almost identical to the ordinary paging example, except that the jsonproviderGet <api:script> section needs to have error handling added.

  <api:try>
    <api:call op="jsonproviderGet">
    <api:set attr="rows@next" value="[_input.rows@next | add(1)]" />
      <api:push/>
     </api:call>
    <api:catch code="*">
    <api:match pattern="true" value="[_description | contains('400 Bad Request')]" >
    <api:else>
    <api:throw code="[_code]" desc="[_description]" />
   </api:else>
    </api:match>
    </api:catch>
  </api:try>

The full solution might look like below. It will continue paging until it receives a 400 response code, which it treats as the end-of-paging signal.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <api:info title="Numbered pager with 400" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
  <attr name="id" xs:type="integer" readonly="false"  other:xPath="id" />
  <input name="rows@next" desc="Identifier used for paging" />
 </api:info>
 <api:set attr="DataModel" value="DOCUMENT" />
 <api:check attr="_input.rows@next">
  <api:set attr="URI" value="http://your.api/getData?page=[_input.rows@next]" />
   <api:else>
  <api:set attr="_input.rows@next" value="1" />
  <api:set attr="URI" value="http://your.api/getData" />
  </api:else>
  </api:check>
  <api:set attr="JSONPath" value="/data" />
  <api:set attr="EnablePaging" value="TRUE" />
   <api:script method="GET">
  <api:try>
  <api:call op="jsonproviderGet">
  <api:set attr="rows@next" value="[_input.rows@next | add(1)]" />
   <api:push/>
  </api:call>
  <api:catch code="*">
  <api:match pattern="true" value="[_description | contains('400 Bad Request')]" >

 <api:else>
  <api:throw code="[_code]" desc="[_description]" />
  </api:else>
  </api:match>
  </api:catch>
  </api:try>
 </api:script>
</api:script>



APIs which use record counts

This category of API returns a number of records each time, and expects the caller to subsequently request records starting from an offset beyond the previous range.

There are a number of common variations:

API provides the next offset

These APIs can be handled in an identical way to the Partial Next URL example.

API tells you how many records it returned

In this variation, the API returns an array of records plus a counter which indicates the number of records returned.

For example like this:

  {
    "data":[{"id":1},{"id":2},{"id":3},{"id":4}],
    "recordCount":4
  }

In the above case, the URL of the next page is calculated by adding the recordCount value to the previous offset, and so (if starting at 1) would be:

  /getData?offset=5

The technique used here involves a second input variable (named localCounter), which is captured from a named elementmappath (in this case /recordCount). Inside the fetch loop, check to see if the attribute has been set, which it will be once per page, and if so add it to the offset held in rows@next, starting at 1.

   <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">

      <api:info title="Baseline" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
         <attr name="id" xs:type="integer" readonly="false"  other:xPath="id" />
         <input name="rows@next" desc="Identifier used for paging" />
         <input name="localCounter" desc="Record counter" />
      </api:info>

     <api:set attr="DataModel" value="DOCUMENT" />
     <api:set attr="EnablePaging" value="TRUE" />

     <api:set attr="elementmappath#" value="/recordCount" />
     <api:set attr="elementmapname#" value="localCounter" />
     <api:check attr="_input.rows@next">
       <api:set attr="URI" value="http://your.api/getData?offset=[_input.rows@next]" />
     <api:else>
        <api:set attr="_input.rows@next" value="1" />
        <api:set attr="URI" value="http://your.api/getData" />
     </api:else>
     </api:check>
     <api:set attr="JSONPath" value="/data" />
     <api:script method="GET">
        <api:call op="jsonproviderGet">
         <api:exists attr="localCounter">
              <api:set attr="rows@next" value="[_input.rows@next | add([localCounter])]" />
           </api:exists>
          <api:push/>
        </api:call>
     </api:script>
  </api:script>

Notes on the above:

  • rows@next stores the absolute offset.
  • localCounter stores the relative offset, and it’s fine if different pages contain different numbers of records.
  • The <api:exists> block acts as a conditional branch, and is necessary because the localCounter only gets assigned once per page whereas the push happens once per output record.

Caller must count the records

In this variation, the API returns just an array of records, and the client must count them in order to work out the offset required for the next page.

For example like this:

  {
    "data":[{"id":1},{"id":2},{"id":3}]
  }

Note the server does not return a recordCount attribute, so the caller must count the number of records returned and then add it to the previous offset. So if starting at 1 the next URL would be:

  /getData?offset=4

The technique used here is to begin the count at 1, and then increment by one for every record processed.

  <api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <api:info title="Record Counter" desc="Generated schema file" xmlns:other="http://apiscript.com/ns?v1">
    <attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
     <input name="rows@next" desc="Identifier used for paging" />
   </api:info>
   <api:set attr="DataModel" value="DOCUMENT" />
   <api:set attr="EnablePaging" value="TRUE" />
   <api:check attr="_input.rows@next">
    <api:set attr="URI" value="http://your.api/getData?offset=[_input.rows@next]" />
   <api:else>
    <api:set attr="_input.rows@next" value="1" />
    <api:set attr="URI" value="http://your.api/getData" />
   </api:else>
   </api:check>
   <api:set attr="JSONPath" value="/data" />
   <api:script method="GET">
    <api:call op="jsonproviderGet">
     <api:set attr="_input.rows@next" value="[_input.rows@next | add(1)]" />
      <api:set attr="_out1.rows@next" value="[_input.rows@next]" />
     <api:push/>
    </api:call>
   </api:script>
  </api:script>

Note the syntax within the jsonproviderGet block which is required because there are actually "two" rows@next variables: one acting as an input and one an output.