How to choose an indexer?
This guide presents how to evaluate an indexing solutions for large-scale production use-cases.
The main focus of this document is protocols and dApps with large volume of data and/or many different chains, which means scalability and resilience will need special architecture design and tooling built for such purpose.
🧩 High-level components
For any indexing solution these are the main areas to explore and compare features, costs and scalability and robustness.:
Ingestion: how the indexer is listening for blockchain data (RPC Calls, RPC Websockets, Pre-populated data lakes, etc) and fault-tolerance characteristics.
Processing: facilities to help add more information or customize the logic for new transactions, events, blocks.
Backfilling: mechanism to load historical blockchain data and executer your processing logic.
Enriching: facilities to post-process for things such as aggregations, joins, adding USD prices, etc.
Data Delivery: the architecture of delivering data to your backend, e.g. a direct database access, or a prebuilt graphql, etc.
🌊 Ingestion reliability and scalability
When evaluating an indexing solution it is important to explore these points regarding how ingestion is done:
Where is the source of data ingestion? Does it work with a generic EVM RPC endpoint, or is it using pre-populated data, which means to support a new chain there must be a process that might takes days or weeks before you can start indexing the data.
Is the solution working with more than 1 RPC endpoint, for fault-tolerance and high-availablility? Can you provide both Websocket and HTTP endpoints so you have fast receiving via websocket and robust fallback from HTTP-based endpoints.
Is the architecture concurrent so that blocks and events are received and ingested and processed in parallel? Such architecture design allows high-availability as well as fast historical syncs.
⚡️ Processing scalability
Your custom logic around transactions and events must be processed as fast as possible to deliver the best UX, for example user A buying an NFT from collection X shouldn't affect user B buying an NFT from collection Y. Ideally an indexing solution can scale up horizontally and process events in parallel.
Parallel processing is mainly important for fast-chains, rollups and multi-chain dApps, to allow having all the chains stored in 1 single database.
Writing to databases (especially RDBMS-es like Postgres) is expensive. If for every event or transaction there are dozens of writes required, then underlying database must be able to scale without requiring huge compute costs.
A lot of real-world instances of indexers such as Graph face database write issues at scale. This problem is not obvious during PoC tests or <1m entities, but it becomes a real bottleneck when entities grow beyond 10m.
Processors make many RPC calls and to properly scale to millions of events there must be robust caching and invalidation mechanisms in-place. This will ensure historical backfills for purpose of fixing bugs or improving data will not cost double on RPC nodes.
🥘 Processing facilities
The environment and language used to write custom processors can significantly increase development speed, or reduce amount in-house duct taping. For example decentralized indexers require a deterministic environment and restricted compilers like WASM (i.e. Graph) which can be detrimental for projects that need to move fast.
Native support for features such as factory-tracking, scalable filtering (e.g. for millions of liquidity Pool addresses), integrated caching utilities, scheduled time-based jobs, ERC20 and USD pricing, are important to avoid re-inventing the wheel which might not be your core business.
🚒 Backfilling mechanism
One of challenging aspects of indexing blockchain data is processing historical data. Here are few important points regarding historical backfills:
Indexing solution must has proper caching mechanisms when fetching data from RPC nodes, to avoid extra network hops and/or high RPC provider bills. An alternative is indexers which pre-populate the data so you wouldn't need to call RPCs but that means adding a new chain requires support from the indexing provider. Yet still most often processors need to make RPC calls which means still caching mechanisms are required.
If the indexer is built with a parallel design it means it can process many blocks and events at the same time (e.g. 20 concurrent executors) but if the design is "sequential" such as Graph then it means by design events cannot be processed simultaneously because for example adding/subtracting numbers for a volumeUSD would be wrong. This can be difference of 3 months vs 25 hours of indexing Polygon data (a real-world instance for Uniswap data).
No matter how many times developers review the code and write tests there will always be corrupt data or edge-case bugs, sometimes even due to RPC downtime or chain network problems. This means if the indexer design is built to run partial backfllls you would be able to re-process only certain contracts or block range, instead of having to truncate the whole database and index from scratch.
🌳 Enrichment and Aggregation approaches
Transactions and events (e.g. Swaps) provide you with basic raw data, but most use-cases are higher-level aggregations and totals (e.g. Total swap Volume per pool). It is crucial to have a proper solution for these use-cases, ideally a proper data pipeline.
Some solutions take the approach of sequential processings, which means events will be processed 1 by 1, after each other. This approach has issues at scale especially on high traffic use-cases (fast chains, or complex aggregations), which essentially creates too many unnecessary writes towards your underlying database.
An ideal enrichment and aggregation solution takes advantage of already proven tools and knowledge used in Web2 large data processing (Apache Spark, Flink, etc), which gives you flexibility to intentionally make trade offs.
🚙 Delivery to your dApp
Indexed data must be readily available for your backend and/or frontend projects to query and there are various approaches, as follows:
Highest-level abstraction is usually a GraphQL which gives you already defined schemas and graphql queries to be used. The benefit of such solutions is less amount of development to get started. The downside is usually these solutions are opinionated or very limited, for example access to underlying SQL database is not easy.
Middle-level solutions provide a generic REST API on top of your indexed data. This reduces the need to maintain a database. The downside is usually network latency or queries, high per-usage pricing with limited amount of queries/reads per month.
Lowest-level solutions write the data directly to your own database already within your infrastructure. This means you won't be charged for read/queries since you own the database, and you are able to JOIN the data with your other tables/collections. The downside is for small projects with less than 1m entities the overhead of having a database on one cloud services might not be desirable.
👷🏻 Infrastructure Maintenance
A properly scalable indexing solution involves various components such as Block/Event Ingestors, RPC Caching Layer, Processing Compute, Database, Logging, Metrics and Monitoring, Fault-tolerance Mechanisms (redrivers, dlqs, auto-scalers, etc).
Usually there are few options:
In-house solutions would require at least full time backend engineers and full time devops and infra engineers. The more complex and bigger the data the more you would need to scale your infrastructure (e.g. a PoC doing 10k events vs 5m vs 50m are vastly different).
Open-source solutions such as Graph and Ponder provides you with already built logic and edge-cases that your engineers don't need to reinvent. You would only require to maintain the compute, storage, database and monitoring aspects. Currently all available open-source options struggle at large amount of data (+100m entities or +20 chains). Or when backfilling large historical data such as 50m blocks on Polygon.
Cloud-based solutions can reduce maintenance requirements. When evaluating a cloud-based solution it is very important that current features and architecture is enough to cover your use-case, as you don't want to rely on a third-party roadmap. For example supporting any chain via just an RPC endpoint, or ability to use any ABI json to get decoded events is a huge plus.
🧘♀️ Decentralization
There are various schools of thought and business strategies regarding how decentralized "reading" aspect of blockchain must be, available options are as follows:
Decentralized networks such as Graph and Subsquid provides your community with peace of mind that if your team goes out of business another team pick up the protocol and continue development. Usually the downside of these solution is highly degraded performance (or very high infra costs) due to overhead of decentralization (including sequential processing, cumbersome token-based billing, etc).
Centralized solutions such as open-source or cloud-based solutions usually provide a higher performance and lower overall cost especially at scale. The downside is that your community does not have a fallback if your team stops supporting the protocol.
Our engineers are happy to discuss these topics in details for your project.
Happy Indexing 🚀
Last updated