feat(api): make API calls non-blocking#6733
feat(api): make API calls non-blocking#6733xxo1shine wants to merge 1 commit intotronprotocol:developfrom
Conversation
| super.service(req, resp); | ||
| Metrics.histogramObserve(requestTimer); | ||
| } else { | ||
| resp.getWriter() |
There was a problem hiding this comment.
[MUST] The rejection branch never sets a status code, so HttpServletResponse defaults to 200 OK — clients receive:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{"Error":"class java.lang.IllegalAccessException : lack of computing resources"}
This contradicts the expected non-blocking behavior where rate-limited requests should return 429 Too Many Requests.
Standard HTTP clients (OkHttp, axios, fetch, …) treat 200 as success and won't trigger retry / back-pressure logic.
Suggested fix:
resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS); // 429
resp.setHeader("Retry-After", "1"); // retry time
resp.getWriter().println(Util.printErrorMsg(
new IllegalAccessException("lack of computing resources")));
There was a problem hiding this comment.
JSON-RPC rejection response is not JSON-RPC 2.0 compliant
JsonRpcServlet, JsonRpcOnSolidityServlet, and JsonRpcOnPBFTServlet all extend RateLimiterServlet and don't override service(). When rate-limited, they hit the same rejection branch in RateLimiterServlet.service.
Spec-compliant JSON-RPC clients expect:
{"jsonrpc":"2.0","error":{"code":-32005,"message":"Limit exceeded"},"id":<request id>}
Suggested approach:
Extract the rejection handling logic in RateLimiterServlet into a protected hook method. This allows subclasses such as JsonRpcServlet to override it and return a JSON-RPC 2.0–compliant error response.
There was a problem hiding this comment.
Thank you for your suggestion. However, your suggestion is unrelated to this issue's fix. You can discuss it in a separate issue later.
There was a problem hiding this comment.
I'd respectfully push back — this is the PR's stated behavior, just not implemented yet.
The PR description (bullet 1) reads:
"Over-limit requests now return HTTP
429/ gRPCRESOURCE_EXHAUSTEDimmediately instead of holding the worker thread in a wait queue"
|
[SHOULD] invalid Not introduced by this PR . Flagging here because the PR is the right window to fix it (small change, already touching this logic). ReproductionSetting
Root causeTwo issues compound: (a)
(b) private static RateLimiter rateLimiter = RateLimiter.create(QPS);
Suggested fixMust-have — fail-fast validation in private static void applyRateLimiterConfig(RateLimiterConfig rl) {
int qps = rl.getGlobal().getQps();
int ipQps = rl.getGlobal().getIp().getQps();
int apiQps = rl.getGlobal().getApi().getQps();
if (qps <= 0 || ipQps <= 0 || apiQps <= 0) {
throw new TronError(
String.format("rate.limiter.global qps must be positive: qps=%d, ip.qps=%d, api.qps=%d",
qps, ipQps, apiQps),
TronError.ErrCode.RATE_LIMITER_INIT);
}
...
}This alone resolves the catastrophic "node up, all APIs 500" mode: the same misconfiguration now produces a clear startup error instead of opaque runtime failures. Users who write Nice-to-have — widen field type from
Happy to follow up in a separate PR if it's out of scope here, but flagging because the failure mode (node up, all APIs 500, no startup error) is hard to diagnose in the field. |
|
This issue is unrelated to the current fix. Administrators need to configure settings appropriately. This issue can be discussed in a separate issue to see if it needs to be fixed, or how to fix it. |
| // Check per-endpoint first to avoid consuming global IP/QPS quota for requests | ||
| // that would be rejected by the per-endpoint limiter anyway. | ||
| boolean perEndpointAcquired = rateLimiter == null || rateLimiter.tryAcquire(runtimeData); | ||
| boolean acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(runtimeData); |
There was a problem hiding this comment.
(commenting on line 98 because the issue lines 113 are outside the diff hunk)
[SHOULD] HTTP rejection should set 429 status, not implicit 200
The PR description says "Over-limit requests now return HTTP 429", but this branch only writes an error JSON body via resp.getWriter().println(...) and never calls resp.setStatus(...). The HTTP status stays at the default 200, so client SDKs cannot back off based on status code, gateways and Prometheus templates classify rate-limited requests as success, and the behavior is asymmetric with the gRPC path that returns RESOURCE_EXHAUSTED.
Suggestion: call resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS) here (and consider a Retry-After header), or update the PR description to remove the 429 claim.
| } | ||
| } | ||
| }; | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
[SHOULD] catch block on startCall failure should close the gRPC call
When next.startCall(call, headers) throws, this catch block releases the permit, marks metrics, and logs, but never invokes call.close(...). The returned listener is the empty new ServerCall.Listener<ReqT>() {} from line 107, so the client receives no status / trailers and only fails when the transport-level deadline fires. This is a pre-existing issue, but since the catch block is being rewritten in this PR (release added), it is a good moment to close it properly.
Suggestion: call call.close(Status.fromCode(Code.INTERNAL).withDescription("rpc handler init failed"), new Metadata()); in the catch block, and add a test that verifies call.close is invoked on startCall exceptions.
| Cache<String, RateLimiter> freshCache = CacheBuilder.newBuilder() | ||
| .maximumSize(10000).expireAfterWrite(1, TimeUnit.HOURS).build(); | ||
| cacheField.set(null, freshCache); | ||
| } |
There was a problem hiding this comment.
[SHOULD] Reset GlobalRateLimiter static fields in @afterclass to avoid singleton pollution
This class reflectively writes IP_QPS, rateLimiter, and cache in @Before, but @AfterClass only calls Args.clearParam(). After this class finishes, those static fields stay at the last test's values (e.g. globalQps=10 / ipQps=1). Any other test in the same surefire fork that calls GlobalRateLimiter.tryAcquire without mockStatic will read the polluted state and behave non-deterministically — exactly the kind of cross-class flake that surfaces in CI and not locally.
Suggestion: in @AfterClass, call a helper that restores the three static fields to production defaults (re-read from Args) so the JVM state on exit matches the state on entry.
| if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) { | ||
| ((IPreemptibleRateLimiter) rateLimiter).release(); | ||
| } | ||
| call.close(Status.fromCode(Code.RESOURCE_EXHAUSTED), new Metadata()); |
There was a problem hiding this comment.
[NIT] No metrics on rate-limit reject paths
The early-return path here (and the symmetric HTTP one in RateLimiterServlet.service) closes the call with RESOURCE_EXHAUSTED but emits no metrics. MetricsUtil.meterMark(NET_API_FAIL_QPS) only fires on startCall exceptions, so operators cannot observe rate-limit hit rate to tune thresholds. The PR claims observability as a benefit but reject metrics are still missing.
Suggestion: add a MetricsUtil.meterMark(...) (reuse NET_API_FAIL_QPS or introduce NET_API_RATE_LIMIT_REJECT_QPS) on both reject branches; can be a follow-up if out of scope.
| if (rateLimiter instanceof IPreemptibleRateLimiter && acquireResource) { | ||
| // Release whenever the per-endpoint permit was acquired (covers both the normal | ||
| // completion path and the case where GlobalRateLimiter rejected the request). | ||
| if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) { |
There was a problem hiding this comment.
(commenting on line 122 because the issue lines 113 are outside the diff hunk)
[NIT] IllegalAccessException is the wrong exception type for rate limiting
Util.printErrorMsg(new IllegalAccessException("lack of computing resources")) makes the response body contain java.lang.IllegalAccessException: lack of computing resources. IllegalAccessException is the standard JDK exception for reflective access failure and is unrelated to throttling, so the body misleads SDK authors and is operator-facing rather than client-facing. Since this branch is being rewritten anyway, swapping it now is the cheap moment.
Suggestion: use a clearer message or a dedicated RateLimitedException (or fold this into the setStatus(429) change so the body just says "rate limit exceeded").
| try { | ||
| // cache.get is atomic: only one loader executes per key under concurrent requests, | ||
| // preventing multiple RateLimiter instances from being created for the same IP. | ||
| r = cache.get(ip, () -> RateLimiter.create(IP_QPS)); |
There was a problem hiding this comment.
[NIT] cache.get atomicity claim has no concurrent regression test
The new cache.get(ip, () -> RateLimiter.create(IP_QPS)) (and the symmetric one in IPQpsStrategy.tryAcquire) relies on Guava's documented loader atomicity to fix the original getIfPresent + put race. The behavior is correct, but no test asserts it — only the documentation does. A future Guava upgrade or an accidental refactor back to getIfPresent + put would silently regress.
Suggestion: add a small concurrent test (CountDownLatch + 2+ threads + AtomicInteger counting loader invocations) that asserts the loader runs at most once per key under concurrent access.
| @Slf4j | ||
| public class GlobalRateLimiter { | ||
|
|
||
| private static double QPS = Args.getInstance().getRateLimiterGlobalQps(); |
There was a problem hiding this comment.
(commenting on line 14 because the issue lines 16 are outside the diff hunk)
[NIT] IP_QPS<=0 misconfiguration causes a silent fail-closed warn storm
If node.rateLimiter.global.IP.qps is misconfigured to 0 or a negative value, RateLimiter.create(IP_QPS) throws IllegalArgumentException. With the new fail-closed behavior, every new IP triggers the loader, fails, and emits a warn — so under hot traffic the log gets flooded and every client is rejected. The root cause (a bad config) is buried inside an in-flight code path instead of being reported at startup.
Suggestion: validate IP_QPS > 0 (and QPS > 0) at static initialization or Args parsing, and fail-fast if invalid, so misconfiguration surfaces before the first request.
|
|
||
| verify(perEndpoint, times(1)).release(); | ||
| } | ||
| } |
There was a problem hiding this comment.
[NIT] Missing test: handler RuntimeException must still release the permit
The new test class covers the per-endpoint reject, global reject, and happy path, but never lets super.service (i.e. the user handler) throw a RuntimeException. The finally block does release the permit on that path (because perEndpointAcquired is true), but no assertion guards against future regressions if someone reorders the catch / finally.
Suggestion: add a testHandlerExceptionStillReleasesPermit case where doGet throws RuntimeException and verify(perEndpoint, times(1)).release() runs.
What does this PR do?
Convert the HTTP and gRPC rate-limiter paths from blocking
acquire()to non-blockingtryAcquire(), and fix the related issues that surfaced along the way.Core API switched to non-blocking
GlobalRateLimiter.acquire()→GlobalRateLimiter.tryAcquire()(returnsboolean)IRateLimiter.acquire()/ strategies / adapters updated totryAcquire429/ gRPCRESOURCE_EXHAUSTEDimmediately instead of holding the worker thread in a wait queueReordered limit checks: per-endpoint first
RateLimiterServlet.serviceandRateLimiterInterceptor.interceptCallnow check the per-endpoint limit first, and only consult the global limiter onsuccess
acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(...)Fixed semaphore leaks in
IPreemptibleRateLimiterfinallyblock now keys offperEndpointAcquiredinstead ofacquireResource)release()added on the early-return branch (global reject after per-endpoint acquired) and in thecatchblock whennext.startCall()throws
Hardened cache-load failure handling in
GlobalRateLimiter/IPQpsStrategycatch (ExecutionException)broadened tocatch (Exception)(coversUncheckedExecutionException, etc.)logger.warnwith the offending IP ande.getMessage()return false) — the previous code silently fail-opened by creating a fresh, non-cachedRateLimiterper call, which effectivelybypassed rate limiting for that IP after the first failure
Cleanup + test coverage
GlobalPreemptibleStrategy: removed the unused@Slf4jand an obsolete timeout constantAdaptorThread.java(the 20-thread timing-based assertion); replaced by deterministic sequential testsRateLimiterServletTestandRateLimiterInterceptorTest; expandedGlobalRateLimiterTestWhy are these changes required?
Blocking
acquire()is a tail-latency hazard. Under load, request threads pile up waiting for permits. Once the bounded Netty/Jetty worker pool is exhaustedby such waiters, the node stops responding to healthy traffic too — and the impact persists past the original burst. Non-blocking
tryAcquirereturns immediately,so threads remain available for legitimate requests.
The old ordering wasted global quota. A request was charged a global token before the per-endpoint check ran. A hot endpoint that was itself throttled could
still drain the global budget, letting attackers exhaust shared capacity through a single endpoint they were already capped on. Per-endpoint-first closes this gap.
Permit leaks in
IPreemptibleRateLimiterare a slow-motion outage. The previous code skippedrelease()in two paths:next.startCall()threw in the gRPC path.The semaphore drifts monotonically downward; once it reaches zero, every subsequent gRPC call blocks indefinitely. This is a latent availability bug that
accumulates with time.
Silent fail-open is an anti-abuse anti-pattern. When
cache.get(loader)threw, the original code created a freshRateLimiterwithout inserting it into thecache, so every subsequent call from the failing path got a full burst budget — effectively no rate limit, and no operational signal. Fail-closed plus a
warnlogmakes the failure both safe and observable.
The old
AdaptorTestwas a flaky test. 20 concurrent threads plusassertTrue(t1 - t0 > 4000)is highly sensitive to CI scheduler jitter and producedsporadic failures. Replaced with sequential
tryAcquirecalls that assert deterministic burst behavior of Guava'sSmoothBursty, so CI is stable.This PR has been tested by:
Follow up
Extra details
Fixes #6363