Skip to content

feat(api): make API calls non-blocking#6733

Open
xxo1shine wants to merge 1 commit intotronprotocol:developfrom
xxo1shine:feat/api-rate-limiter-opt
Open

feat(api): make API calls non-blocking#6733
xxo1shine wants to merge 1 commit intotronprotocol:developfrom
xxo1shine:feat/api-rate-limiter-opt

Conversation

@xxo1shine
Copy link
Copy Markdown
Collaborator

@xxo1shine xxo1shine commented Apr 29, 2026

What does this PR do?

Convert the HTTP and gRPC rate-limiter paths from blocking acquire() to non-blocking tryAcquire(), and fix the related issues that surfaced along the way.

  1. Core API switched to non-blocking

    • GlobalRateLimiter.acquire()GlobalRateLimiter.tryAcquire() (returns boolean)
    • All IRateLimiter.acquire() / strategies / adapters updated to tryAcquire
    • Over-limit requests now return HTTP 429 / gRPC RESOURCE_EXHAUSTED immediately instead of holding the worker thread in a wait queue
  2. Reordered limit checks: per-endpoint first

    • Both RateLimiterServlet.service and RateLimiterInterceptor.interceptCall now check the per-endpoint limit first, and only consult the global limiter on
      success
    • acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(...)
  3. Fixed semaphore leaks in IPreemptibleRateLimiter

    • HTTP path: the per-endpoint permit must be released even when the global limiter rejects (the finally block now keys off perEndpointAcquired instead of
      acquireResource)
    • gRPC path: explicit release() added on the early-return branch (global reject after per-endpoint acquired) and in the catch block when next.startCall()
      throws
  4. Hardened cache-load failure handling in GlobalRateLimiter / IPQpsStrategy

    • catch (ExecutionException) broadened to catch (Exception) (covers UncheckedExecutionException, etc.)
    • Added logger.warn with the offending IP and e.getMessage()
    • Fail-closed on exception (return false) — the previous code silently fail-opened by creating a fresh, non-cached RateLimiter per call, which effectively
      bypassed rate limiting for that IP after the first failure
  5. Cleanup + test coverage

    • GlobalPreemptibleStrategy: removed the unused @Slf4j and an obsolete timeout constant
    • Deleted AdaptorThread.java (the 20-thread timing-based assertion); replaced by deterministic sequential tests
    • Added RateLimiterServletTest and RateLimiterInterceptorTest; expanded GlobalRateLimiterTest

Why are these changes required?

  1. Blocking acquire() is a tail-latency hazard. Under load, request threads pile up waiting for permits. Once the bounded Netty/Jetty worker pool is exhausted
    by such waiters, the node stops responding to healthy traffic too — and the impact persists past the original burst. Non-blocking tryAcquire returns immediately,
    so threads remain available for legitimate requests.

  2. The old ordering wasted global quota. A request was charged a global token before the per-endpoint check ran. A hot endpoint that was itself throttled could
    still drain the global budget, letting attackers exhaust shared capacity through a single endpoint they were already capped on. Per-endpoint-first closes this gap.

  3. Permit leaks in IPreemptibleRateLimiter are a slow-motion outage. The previous code skipped release() in two paths:

    • the per-endpoint permit on global rejection;
    • the per-endpoint permit when next.startCall() threw in the gRPC path.

    The semaphore drifts monotonically downward; once it reaches zero, every subsequent gRPC call blocks indefinitely. This is a latent availability bug that
    accumulates with time.

  4. Silent fail-open is an anti-abuse anti-pattern. When cache.get(loader) threw, the original code created a fresh RateLimiter without inserting it into the
    cache, so every subsequent call from the failing path got a full burst budget — effectively no rate limit, and no operational signal. Fail-closed plus a warn log
    makes the failure both safe and observable.

  5. The old AdaptorTest was a flaky test. 20 concurrent threads plus assertTrue(t1 - t0 > 4000) is highly sensitive to CI scheduler jitter and produced
    sporadic failures. Replaced with sequential tryAcquire calls that assert deterministic burst behavior of Guava's SmoothBursty, so CI is stable.

This PR has been tested by:

  • Unit Tests
  • Manual Testing

Follow up

Extra details

Fixes #6363

Copy link
Copy Markdown
Collaborator

@waynercheung waynercheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

super.service(req, resp);
Metrics.histogramObserve(requestTimer);
} else {
resp.getWriter()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MUST] The rejection branch never sets a status code, so HttpServletResponse defaults to 200 OK — clients receive:

  HTTP/1.1 200 OK
  Content-Type: application/json; charset=utf-8

  {"Error":"class java.lang.IllegalAccessException : lack of computing resources"}

This contradicts the expected non-blocking behavior where rate-limited requests should return 429 Too Many Requests.
Standard HTTP clients (OkHttp, axios, fetch, …) treat 200 as success and won't trigger retry / back-pressure logic.

Suggested fix:

    resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS); // 429
    resp.setHeader("Retry-After", "1"); // retry time
    resp.getWriter().println(Util.printErrorMsg(
        new IllegalAccessException("lack of computing resources")));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON-RPC rejection response is not JSON-RPC 2.0 compliant

JsonRpcServlet, JsonRpcOnSolidityServlet, and JsonRpcOnPBFTServlet all extend RateLimiterServlet and don't override service(). When rate-limited, they hit the same rejection branch in RateLimiterServlet.service.

Spec-compliant JSON-RPC clients expect:
{"jsonrpc":"2.0","error":{"code":-32005,"message":"Limit exceeded"},"id":<request id>}

Suggested approach:

Extract the rejection handling logic in RateLimiterServlet into a protected hook method. This allows subclasses such as JsonRpcServlet to override it and return a JSON-RPC 2.0–compliant error response.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. However, your suggestion is unrelated to this issue's fix. You can discuss it in a separate issue later.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd respectfully push back — this is the PR's stated behavior, just not implemented yet.
The PR description (bullet 1) reads:

"Over-limit requests now return HTTP 429 / gRPC RESOURCE_EXHAUSTED immediately instead of holding the worker thread in a wait queue"

@0xbigapple
Copy link
Copy Markdown
Collaborator

[SHOULD] invalid global.qps brings down all APIs with opaque 500s.

Not introduced by this PR . Flagging here because the PR is the right window to fix it (small change, already touching this logic).

Reproduction

Setting rate.limiter.global.qps = 0.001 (or just 0) in config.conf:

  • Node starts cleanly (no error in logs)
  • First HTTP / gRPC request → HTTP 500
  • Every subsequent request → HTTP 500
  • Limiter never functions

Root cause

Two issues compound:

(a) RateLimiterConfig.GlobalConfig.qps is declared as int (common/src/main/java/org/tron/core/config/args/RateLimiterConfig.java:28,35,41). typesafe config silently truncates 0.001 to 0 via (long) 0.001 cast in ConfigDouble.longValue():

"0.001" → ConfigDouble(0.001)
        → SimpleConfig.getInt → intValueRangeChecked → (long) 0.001 = 0L → return 0
        → PARAMETER.rateLimiterGlobalQps = 0

intValueRangeChecked only validates the long-to-int range; it does not check whether the double is whole. No exception, no warning.

(b) GlobalRateLimiter initializes its static RateLimiter field at class-load time:

private static RateLimiter rateLimiter = RateLimiter.create(QPS);

RateLimiter.create(0) throws IllegalArgumentException("rate must be positive"). JVM wraps it as ExceptionInInitializerError, putting the class in a permanently erroneous state:

Stage Result
Node startup ✅ Succeeds (GlobalRateLimiter is lazy-loaded)
First request triggers class init ExceptionInInitializerError → 500
Subsequent requests NoClassDefFoundError → 500
Servlet catch (Exception unexpected) Doesn't catch — ExceptionInInitializerError is Error, not Exception

Suggested fix

Must-have — fail-fast validation in Args.applyRateLimiterConfig:

private static void applyRateLimiterConfig(RateLimiterConfig rl) {
  int qps    = rl.getGlobal().getQps();
  int ipQps  = rl.getGlobal().getIp().getQps();
  int apiQps = rl.getGlobal().getApi().getQps();
  if (qps <= 0 || ipQps <= 0 || apiQps <= 0) {
    throw new TronError(
        String.format("rate.limiter.global qps must be positive: qps=%d, ip.qps=%d, api.qps=%d",
            qps, ipQps, apiQps),
        TronError.ErrCode.RATE_LIMITER_INIT);
  }
  ...
}

This alone resolves the catastrophic "node up, all APIs 500" mode: the same misconfiguration now produces a clear startup error instead of opaque runtime failures. Users who write 0.5 will see qps must be positive (slightly misleading because the typed value was positive, but truncated to 0) — acceptable trade-off for a one-line guard.

Nice-to-have — widen field type from int to double:

GlobalConfig.qps is currently int while per-endpoint adapters already accept fractional QPS via paramString = "qps=0.5". Making the global field a double removes the silent-truncation surprise and makes the stack consistent. But for global.qps specifically (default 50000, intended as a node-level safety cap), fractional values have no real use case — this can land as a follow-up if it complicates review.

Happy to follow up in a separate PR if it's out of scope here, but flagging because the failure mode (node up, all APIs 500, no startup error) is hard to diagnose in the field.

@halibobo1205 halibobo1205 added the topic:api rpc/http related issue label Apr 30, 2026
@halibobo1205 halibobo1205 added this to the GreatVoyage-v4.8.2 milestone Apr 30, 2026
@xxo1shine
Copy link
Copy Markdown
Collaborator Author

This issue is unrelated to the current fix. Administrators need to configure settings appropriately. This issue can be discussed in a separate issue to see if it needs to be fixed, or how to fix it.

// Check per-endpoint first to avoid consuming global IP/QPS quota for requests
// that would be rejected by the per-endpoint limiter anyway.
boolean perEndpointAcquired = rateLimiter == null || rateLimiter.tryAcquire(runtimeData);
boolean acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(runtimeData);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting on line 98 because the issue lines 113 are outside the diff hunk)

[SHOULD] HTTP rejection should set 429 status, not implicit 200

The PR description says "Over-limit requests now return HTTP 429", but this branch only writes an error JSON body via resp.getWriter().println(...) and never calls resp.setStatus(...). The HTTP status stays at the default 200, so client SDKs cannot back off based on status code, gateways and Prometheus templates classify rate-limited requests as success, and the behavior is asymmetric with the gRPC path that returns RESOURCE_EXHAUSTED.

Suggestion: call resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS) here (and consider a Retry-After header), or update the PR description to remove the 429 claim.

}
}
};
} catch (Exception e) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] catch block on startCall failure should close the gRPC call

When next.startCall(call, headers) throws, this catch block releases the permit, marks metrics, and logs, but never invokes call.close(...). The returned listener is the empty new ServerCall.Listener<ReqT>() {} from line 107, so the client receives no status / trailers and only fails when the transport-level deadline fires. This is a pre-existing issue, but since the catch block is being rewritten in this PR (release added), it is a good moment to close it properly.

Suggestion: call call.close(Status.fromCode(Code.INTERNAL).withDescription("rpc handler init failed"), new Metadata()); in the catch block, and add a test that verifies call.close is invoked on startCall exceptions.

Cache<String, RateLimiter> freshCache = CacheBuilder.newBuilder()
.maximumSize(10000).expireAfterWrite(1, TimeUnit.HOURS).build();
cacheField.set(null, freshCache);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] Reset GlobalRateLimiter static fields in @afterclass to avoid singleton pollution

This class reflectively writes IP_QPS, rateLimiter, and cache in @Before, but @AfterClass only calls Args.clearParam(). After this class finishes, those static fields stay at the last test's values (e.g. globalQps=10 / ipQps=1). Any other test in the same surefire fork that calls GlobalRateLimiter.tryAcquire without mockStatic will read the polluted state and behave non-deterministically — exactly the kind of cross-class flake that surfaces in CI and not locally.

Suggestion: in @AfterClass, call a helper that restores the three static fields to production defaults (re-read from Args) so the JVM state on exit matches the state on entry.

if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) {
((IPreemptibleRateLimiter) rateLimiter).release();
}
call.close(Status.fromCode(Code.RESOURCE_EXHAUSTED), new Metadata());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] No metrics on rate-limit reject paths

The early-return path here (and the symmetric HTTP one in RateLimiterServlet.service) closes the call with RESOURCE_EXHAUSTED but emits no metrics. MetricsUtil.meterMark(NET_API_FAIL_QPS) only fires on startCall exceptions, so operators cannot observe rate-limit hit rate to tune thresholds. The PR claims observability as a benefit but reject metrics are still missing.

Suggestion: add a MetricsUtil.meterMark(...) (reuse NET_API_FAIL_QPS or introduce NET_API_RATE_LIMIT_REJECT_QPS) on both reject branches; can be a follow-up if out of scope.

if (rateLimiter instanceof IPreemptibleRateLimiter && acquireResource) {
// Release whenever the per-endpoint permit was acquired (covers both the normal
// completion path and the case where GlobalRateLimiter rejected the request).
if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting on line 122 because the issue lines 113 are outside the diff hunk)

[NIT] IllegalAccessException is the wrong exception type for rate limiting

Util.printErrorMsg(new IllegalAccessException("lack of computing resources")) makes the response body contain java.lang.IllegalAccessException: lack of computing resources. IllegalAccessException is the standard JDK exception for reflective access failure and is unrelated to throttling, so the body misleads SDK authors and is operator-facing rather than client-facing. Since this branch is being rewritten anyway, swapping it now is the cheap moment.

Suggestion: use a clearer message or a dedicated RateLimitedException (or fold this into the setStatus(429) change so the body just says "rate limit exceeded").

try {
// cache.get is atomic: only one loader executes per key under concurrent requests,
// preventing multiple RateLimiter instances from being created for the same IP.
r = cache.get(ip, () -> RateLimiter.create(IP_QPS));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] cache.get atomicity claim has no concurrent regression test

The new cache.get(ip, () -> RateLimiter.create(IP_QPS)) (and the symmetric one in IPQpsStrategy.tryAcquire) relies on Guava's documented loader atomicity to fix the original getIfPresent + put race. The behavior is correct, but no test asserts it — only the documentation does. A future Guava upgrade or an accidental refactor back to getIfPresent + put would silently regress.

Suggestion: add a small concurrent test (CountDownLatch + 2+ threads + AtomicInteger counting loader invocations) that asserts the loader runs at most once per key under concurrent access.

@Slf4j
public class GlobalRateLimiter {

private static double QPS = Args.getInstance().getRateLimiterGlobalQps();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting on line 14 because the issue lines 16 are outside the diff hunk)

[NIT] IP_QPS<=0 misconfiguration causes a silent fail-closed warn storm

If node.rateLimiter.global.IP.qps is misconfigured to 0 or a negative value, RateLimiter.create(IP_QPS) throws IllegalArgumentException. With the new fail-closed behavior, every new IP triggers the loader, fails, and emits a warn — so under hot traffic the log gets flooded and every client is rejected. The root cause (a bad config) is buried inside an in-flight code path instead of being reported at startup.

Suggestion: validate IP_QPS > 0 (and QPS > 0) at static initialization or Args parsing, and fail-fast if invalid, so misconfiguration surfaces before the first request.


verify(perEndpoint, times(1)).release();
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] Missing test: handler RuntimeException must still release the permit

The new test class covers the per-endpoint reject, global reject, and happy path, but never lets super.service (i.e. the user handler) throw a RuntimeException. The finally block does release the permit on that path (because perEndpointAcquired is true), but no assertion guards against future regressions if someone reorders the catch / finally.

Suggestion: add a testHandlerExceptionStillReleasesPermit case where doGet throws RuntimeException and verify(perEndpoint, times(1)).release() runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:api rpc/http related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP calls experience delay due to rate limiting

5 participants