Paging with the REST API Query Component - deprecated

Paging with the REST API Query Component - deprecated


Overview


Many API's support "paging" or "pagination" to make it easier to download large volumes of data – as manageable chunks. Downloading the complete dataset involves issuing multiple requests with each request specifying the part of the data to download. Please read your API's documentation to understand how paging (if used) is implemented in that API.

The API Query component supports certain constructs in RSD definition to allow issuing multiple requests to download relevant data in chunks. All downloaded data is then pushed into the specified target-table.

Important Information

This document builds on an earlier document (Managing API Profiles), and describes how to include paging in an API Profile that you have created.

Warning

  • The RSD syntax used by Matillion ETL has changed significantly from version 1.41. For the latest information, please refer to Paging with the REST API Query Component 2020.
  • Due to these changes, the syntax of examples used in this document and RSD files on Matillion ETL may differ slightly from current API standards—for example, <rsb: is now <api:.
  • Other updates include the need for users to setup input parameters as inputs rather than dummy columns, and only returning the date on Date data types as opposed to the datetime as previously.

Implementing Paging in RSD

To support paging, introduce pseudo column rows@next in your list of columns and set the rows@next attribute to any information needed to make the request for the next page of data. When this value is set in the output, the driver will automatically call the RSD script again with the rows@next value in the input after it is finished returning results for this page. You can use the value of this input to modify the request on the next pass to get the next page of data.

The process can be broken down into three steps.

  1. Introduce pseudo column rows@next in your list of columns, within the <rsb:info tag. Typically this should be the last entry in your <rsb:info tag.

    <input name="rows@next" desc="A system column used for paging. Do not change." />
  2. Set the attribute uri to an appropriate value. The value for rows@next value can be accessed via the _input collection (more on this in the example below):

    <rsb:check attr="_input.rows@next">
        <rsb:set attr="uri" value="[_input.rows@next]" />
        <rsb:else>
            <rsb:set attr="uri" value="<first page's URL>" />
        </rsb:else>
    </rsb:check>
  3. Set the value of rows@next to a value that can be used to identify if a subsequent request should be issued. For example, your API may return the next page's URL in the response. You can obtain this value by providing the XPath to the URL:

    <rsb:set attr="elementmappath#" value="/next_page" />
    <rsb:set attr="elementmapname#" value="rows@next" />

    Step 2 will always hit the <rsb:else /> block on first request as rows@next will be "null".

    Please Note

    The above example is just to illustrate how a value can be set for rows@next based on incoming data. Depending on the structure of the data and information available, we may have to take a different approach to updating rows@next. Please contact support if you need help with a specific API.

  4. Enable API paging by adding the following to the <rsb:script> section of your RSD.

    <rsb:set attr="EnablePaging" value="TRUE" />

Example

Now lets discuss this in the context of an example based on REST API from desk.com, for which documentation can be found here. Below is an initial request and Response we will use to build our RSD to support paging.

  • Initial Request

    https://api.desk.com/api/v2/cases?sort_field=updated_at&sort_direction=desc
  • Sample Response

    The header in the JSON response contains links to "first", "last", "next" and "previous" pages followed by a list of 50 cases (default). Using this information we can navigate between pages of information and retrieve data. Below is a snippet of the JSON returned:

    Please Note

    "Previous" is "null" for the first page. Similarly, "next" will be "null" for the last page. The "last" link tells us we have 39 pages of information to download.

    {
      "page": 1,
      "total_entries": 1920,
      "_links": {
        "self": {
          "class": "page",
          "href": "/api/v2/cases?page=1&per_page=50&sort_direction=desc&sort_field=updated_at"
        },
        "first": {
          "class": "page",
          "href": "/api/v2/cases?page=1&per_page=50&sort_direction=desc&sort_field=updated_at"
        },
        "last": {
          "class": "page",
          "href": "/api/v2/cases?page=39&per_page=50&sort_direction=desc&sort_field=updated_at"
        },
        "next": {
          "class": "page",
          "href": "/api/v2/cases?page=2&per_page=50&sort_direction=desc&sort_field=updated_at"
        },
        "previous": null
      },
      "_embedded": {
        "entries": [
          {
            "id": 2162,
            "blurb": "help implement paging with API Query...",
            "changed_at": "2017-05-25T10:50:29Z",
            "created_at": "2017-05-23T12:08:58Z",
            "custom_fields": {

 

Building our RSD

The goal is to build an RSD that will issue the following requests:

  • Initial request

    https://api.desk.com/api/v2/cases?sort_field=updated_at&sort_direction=desc
  • Page 2

    https://api.desk.com/api/v2/cases?page=2&per_page=50&sort_direction=desc&sort_field=updated_at
  • Page 3

    https://api.desk.com/api/v2/cases?page=3&per_page=50&sort_direction=desc&sort_field=updated_at
  • Page 39

    https://api.desk.com/api/v2/cases?page=39&per_page=50&sort_direction=desc&sort_field=updated_at

Below is the RSD required to work with this API. The lines specific to paging are highlighted in yellow.

Please Note

The URL api.desk.com needs to be updated with your DESK end-point if you have an account with Desk.com and would like to use the following RSD.

<rsb:script xmlns:rsb="http://www.rssbus.com/ns/rsbscript/2/">
  <rsb:info title="Desc Cases" desc="List of cases from Desk.com." >
    <!-- Column definitions -->
    <attr name="id" xs:type="integer"  other:xPath="id" />
    <attr name="subject" xs:type="string"  other:xPath="subject" />
    <attr name="create_date" xs:type="datetime"  other:xPath="received_at" />
    <attr name="status" xs:type="string"   other:xPath="status" />

    <attr name="next_page" xs:type="string"   other:xPath="/_links/next/href" />

    <!-- Paging meta column -->
    <input name="Rows@Next" desc="Identifier for the next page of results" />
  </rsb:info>
  <rsb:set attr="RepeatElement" value="/_embedded/entries" />
  <rsb:script method="GET" >
   <rsb:set attr="EnablePaging" value="TRUE" />
    <rsb:check attr="Rows@Next">
      <rsb:set attr="uri" value="https://api.desk.com[_input.Rows@Next]"/>
      <rsb:else>
        <rsb:set attr="uri" value="https://api.desk.com/api/v2/cases/search?status=pending&per_page=50&sort_direction=desc&sort_field=updated_at"/>
      </rsb:else>
    </rsb:check>
    <rsb:call  op="jsonproviderGet">
      <rsb:check attr="next_page" >
        <rsb:set  attr="Rows@Next" value="[next_page]" />
      </rsb:check>
      <rsb:push/>
    </rsb:call>
  </rsb:script>
</rsb:script>

  1. Introduce metadata column rows@next. We also introduce a column next_page to capture the appropriate value from the resulting JSON. This is done as pseudo-columns like rows@next do not support other:xpath attribute and hence cannot map to a value directly. We then assign this value to rows@next in Step 3.

    <attr name="next_page" xs:type="string" other:xPath="/_links/next/href" />
    <!-- Paging meta column →
    <input name="Rows@Next" desc="Identifier for the next page of results" />
  2. On every pass, the API driver evaluates the value in the uri attribute and issues a request.

    <rsb:check attr="rows@next">
      <rsb:set attr="uri" value="https://api.desk.com[_input.rows@next]"/>
    

    <rsb:else>
        <rsb:set attr="uri"  value="https://api.desk.com/api/v2/cases?sort_field=updated_at&sort_direction=desc"/>
      </rsb:else>
    </rsb:check>

    Please Note

    Pay special attention to the use of [_input.rows@next]:

  3. The initial request always evaluates to <rsb:else> as rows@next is "null". We then set appropriate value in Step 3 so subsequent requests do not hit else. Below is the RSD snippet where we set the value for rows@next:

    <rsb:call  op="jsonproviderGet">
      <rsb:check attr="next_page" >
        <rsb:set attr="rows@next" value="[next_page]" />
      </rsb:check>
      <rsb:unset  attr="next_page"/>
      <rsb:push/>    
    </rsb:call>

    Please Note

    • The <rsb:call/> block (lines 1 – 7) is called after the data is downloaded and converted into an in-memory table by evaluating xpaths from <rsb:info/> block.
    • RSD defined in the <rsb:call /> block is called for every row returned. Desk.com API returns 50 rows with each request so lines 2-7 above will be evaluated for each row.
    • Line 3 checks if attribute/column next_page for that row has a value and assigns it to rows@next. The cell values for a row are available as attributes referenced by column name. The last page/request will return "null" into next_page hence Line 2 will evaluate to false and no value is assigned to rows@next.
    • <rsb:push/> on line 8 pushes an item into the output feed of the script. If there are no <rsb:push/> elements in your script, all rows are discarded and no data is loaded.