Context
When syncing data in Nango, there are actually two syncing processes happening:- Nango syncing with the external system, managed by integration Functions
- Your app syncing with Nango, managed by your app’s code using the Nango API
Full refresh syncing (small datasets only)
For small datasets (e.g., a list of Slack users for organizations with less than 100 employees), you can instruct Nango to periodically poll for the entire dataset. This method is known as a “full refresh” sync. Once the sync run finishes, Nango computes which records have been changed during the latest sync run, and sends a webhook to your backend. This lets you fetch only the incremental changes from Nango to your application. As datasets increase in size, full refresh syncs become unscalable, taking longer to run, triggering rate limiting from external APIs, and consuming more compute and memory resources. Here is an example of a sync script which refreshes the entire list of Slack user on each execution:Incremental syncing
The preferred method for syncing larger datasets is to fetch only the incremental changes from the external API. This method is known as an “incremental” sync. Sync functions expose the timestamp of the last sync execution start undernango.lastSyncDate
. You can use this timestamp to instruct the external API to send only the changes since that date. This way, you only receive and persist the modified records in the Nango cache.
For example, if you are syncing tens of thousands of contacts from a Salesforce account on an hourly basis, only a small portion of the contacts will be updated or created in any given hour. If you were doing a full refresh sync, you would need to fetch the entire contact list every hour, which is inefficient. With an incremental sync, you can fetch only the modified contacts from the past hour.
Not all APIs, and not all endpoints on most APIs, support this. If the endpoint you need does not let you filter or sort by last modified date, you will need to use a full refresh sync.
Here is an example of a sync script that updates the list of Salesforce contact incrementally on each execution, leveraging nango.lastSyncDate
:
Initial sync execution
Even with incremental syncing, the very first sync execution has to be a full refresh sync since there is no previous data. This initial sync fetches all historical data and is more resource-intensive than subsequent executions. One strategy to manage this is to limit the period you are backfilling. For example, if you are syncing a Notion workspace, you can inform users that you will only sync Notion pages modified in the last three months, assuming these are most relevant.Avoiding memory overuse
Nango integration functions, which manage data syncing between external systems and Nango, run on customer-specific VMs with fixed resources. Consequently, integration functions can lead to sync failures (e.g., VM crash) when memory resources are overused. The most common cause of excessive memory use is fetching a large number of records before saving them to the Nango cache, as shown below:Avoiding syncing unnecessary data
Another strategy for handling large datasets successfully is to filter the data you need as early as possible, either using filters available from the external API or by discarding data in the functions, i.e., not saving it to the Nango cache. This approach uses the external system as a source of truth, allowing you to sync additional data in the future by editing your Nango function and triggering a full refresh to backfill any missing historical data. Because of the flexibility of integration functions, Nango allows you to perform transformations early in the data sync process, optimizing resource use and enabling faster syncing. You can also use customer-specific config to implement customer-specific filters in your sync functions.Questions, problems, feedback? Please reach out in the Slack community.