feat: faster way of figuring out how many new scraper rows were inserted (#4073)
### Description ### Context Full context / thread: https://discord.com/channels/935678348330434570/1254871503825141923 Copied here: so this is kinda interesting basically the gas payment indexing in the scraper task is consistently prioritizing hook-based indexing work over block range indexing work this only manifests in the scraper and for arbitrum because: there are a lot of arbitrum origin messages on the scraper, each gas payment indexing tick takes a very long time because of some very suboptimal queries. store_payments can take like 10-15 seconds! You can see batches of logs Found log(s) in index range https://cloudlogging.app.goo.gl/Bo9Q7YwyziSqjsEu7 that correspond with the gas payment contract sync block height (green line) advancing https://abacusworks.grafana.net/goto/poW4KOQIg?orgId=1 ![image](https://github.com/hyperlane-xyz/hyperlane-monorepo/assets/20362969/65da2e44-5728-4ecc-9c1f-52741805eb5b) but we're still indexing gas payments bc of hook indexing - there are tons of Found log(s) for tx id https://cloudlogging.app.goo.gl/QvDaBuf67SjjKviw6 In the scraper's store_payments we bizarrely end up performing 2 very expensive queries with the calls to payments_count - it gets the count of gas payments for the provided domain so it can figure out how many new gas payments were actually inserted https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/scraper/src/db/payment.rs#L27-L64 These queries vary a bunch in how expensive they can be - sometimes each one is like 1-2 seconds, sometimes it gets up to 10 seconds each though https://cloudlogging.app.goo.gl/cDFNH3oxh6WWV3Go8 ![image](https://github.com/hyperlane-xyz/hyperlane-monorepo/assets/20362969/612c7999-21b2-43f4-861f-8b42580e4cb7) easiest fix is probably to figure out how to make store_payments less expensive - we probably don't need to be making these calls to payments_count to figure out which ones are new which ill do atm there's maybe a decent argument to having hook indexing and range indexing not contend with one another but we should be able to kick that can also an annoying thing is https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/hyperlane-base/src/contract_sync/mod.rs#L155 this log not showing up, probably some weird instrumentation setup or something makes debugging harder ### The fix We want to keep supporting the possibility that we've double-indexed a data type, and to only return the number of new rows inserted. We want to keep each indexing task as non-conflicting as possible - we know that only one scraper is running at a time, but that a task for each chain and data type is executing concurrently. It's possible for e.g. two gas payment indexing tasks to try to insert into the same table concurrently, but it'll be for different domains. Insert statement concurrency issues aren't an issue because the only chance for conflict is the auto-incrementing primary key id, but this is all handled properly by database internals. Insert results only give the latest ID resulting from the insertion. E.g. if no matter how many rows were inserted, it gives the latest ID from that batch, but no info about how many were in the batch. We could do something like get the latest id in the table for all domains (this is cheap) before, then do the insert, and then get the latest id again, but this is susceptible to race conditions if concurrent tasks get an insertion in during that time as well. Shoving all these into a transaction doesn't seem to solve the problem either unless there's a lock on the table. However, sea-orm doesn't support table level locks, just select locks. We could also use a CTE to perform as a way to perform the select & insertion atomically and return the value of the select - but again sea-orm doesn't seem to provide a nice way of doing this from what I could tell. Instead, my suggested fix is: 1. Get the latest id in the table relating to the specific domain (this is cheap, < 100ms) 2. Perform the insertion 3. Count how many new rows for the specific domain now exist whose id is > the id from (1). This is also cheap, < 100ms. If there is a very long time since the last insertion for the domain, this could take longer, but in practice this doesn't seem to be the case. If this couple order of magnitude improvement still poses issues in the future, we can consider other alternatives, probably involving locking or fancier CTE queries. Some example queries and how long they take: Before (2x of these would occur!): <img width="477" alt="Screen Shot 2024-06-28 at 12 36 41 PM" src="https://github.com/hyperlane-xyz/hyperlane-monorepo/assets/20362969/5cc8ac35-74fb-4847-9c6f-448deb253a0c"> Now: 1. Max id for a given domain <img width="494" alt="Screen Shot 2024-06-28 at 12 37 00 PM" src="https://github.com/hyperlane-xyz/hyperlane-monorepo/assets/20362969/56dbb3ff-b540-458c-809c-56ebfbfd3b0f"> 2. Subsequent count but with an ID filter - note I actually changed this query to look for IDs > 3600000, which includes 55300 rows and it's still just 102ms! <img width="632" alt="Screen Shot 2024-06-28 at 12 37 49 PM" src="https://github.com/hyperlane-xyz/hyperlane-monorepo/assets/20362969/11593180-682f-4867-9867-97fd93512802"> ### Drive-by changes - Updated the neutron image along with a deploy there ### Related issues ### Backward compatibility ### Testing Ran some sorta ad-hoc unit tests to test my new query by running against the prod db with a tokio::task. Tested for a new domain and an existing onepull/4088/head
parent
36e75af4ed
commit
6f4ef05d41
Loading…
Reference in new issue