feat(api): make API calls non-blocking by xxo1shine · Pull Request #6733 · tronprotocol/java-tron

xxo1shine · 2026-04-29T07:30:03Z

What does this PR do?

Convert the HTTP and gRPC rate-limiter paths from blocking acquire() to non-blocking tryAcquire(), and fix the related issues that surfaced along the way.

Core API switched to non-blocking
- GlobalRateLimiter.acquire() → GlobalRateLimiter.tryAcquire() (returns boolean)
- All IRateLimiter.acquire() / strategies / adapters updated to tryAcquire
- Over-limit requests now return HTTP 429 / gRPC RESOURCE_EXHAUSTED immediately instead of holding the worker thread in a wait queue
Reordered limit checks: per-endpoint first
- Both RateLimiterServlet.service and RateLimiterInterceptor.interceptCall now check the per-endpoint limit first, and only consult the global limiter on
  success
- acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(...)
Fixed semaphore leaks in IPreemptibleRateLimiter
- HTTP path: the per-endpoint permit must be released even when the global limiter rejects (the finally block now keys off perEndpointAcquired instead of
  acquireResource)
- gRPC path: explicit release() added on the early-return branch (global reject after per-endpoint acquired) and in the catch block when next.startCall()
  throws
Hardened cache-load failure handling in GlobalRateLimiter / IPQpsStrategy
- catch (ExecutionException) broadened to catch (Exception) (covers UncheckedExecutionException, etc.)
- Added logger.warn with the offending IP and e.getMessage()
- Fail-closed on exception (return false) — the previous code silently fail-opened by creating a fresh, non-cached RateLimiter per call, which effectively
  bypassed rate limiting for that IP after the first failure
Cleanup + test coverage
- GlobalPreemptibleStrategy: removed the unused @Slf4j and an obsolete timeout constant
- Deleted AdaptorThread.java (the 20-thread timing-based assertion); replaced by deterministic sequential tests
- Added RateLimiterServletTest and RateLimiterInterceptorTest; expanded GlobalRateLimiterTest

Why are these changes required?

Blocking acquire() is a tail-latency hazard. Under load, request threads pile up waiting for permits. Once the bounded Netty/Jetty worker pool is exhausted
by such waiters, the node stops responding to healthy traffic too — and the impact persists past the original burst. Non-blocking tryAcquire returns immediately,
so threads remain available for legitimate requests.
The old ordering wasted global quota. A request was charged a global token before the per-endpoint check ran. A hot endpoint that was itself throttled could
still drain the global budget, letting attackers exhaust shared capacity through a single endpoint they were already capped on. Per-endpoint-first closes this gap.
Permit leaks in IPreemptibleRateLimiter are a slow-motion outage. The previous code skipped release() in two paths:
- the per-endpoint permit on global rejection;
- the per-endpoint permit when next.startCall() threw in the gRPC path.
The semaphore drifts monotonically downward; once it reaches zero, every subsequent gRPC call blocks indefinitely. This is a latent availability bug that
accumulates with time.
Silent fail-open is an anti-abuse anti-pattern. When cache.get(loader) threw, the original code created a fresh RateLimiter without inserting it into the
cache, so every subsequent call from the failing path got a full burst budget — effectively no rate limit, and no operational signal. Fail-closed plus a warn log
makes the failure both safe and observable.
The old AdaptorTest was a flaky test. 20 concurrent threads plus assertTrue(t1 - t0 > 4000) is highly sensitive to CI scheduler jitter and produced
sporadic failures. Replaced with sequential tryAcquire calls that assert deterministic burst behavior of Guava's SmoothBursty, so CI is stable.

This PR has been tested by:

Unit Tests
Manual Testing

Follow up

Extra details

Fixes #6363

waynercheung

LGTM

0xbigapple · 2026-04-30T03:49:58Z

        super.service(req, resp);
        Metrics.histogramObserve(requestTimer);
      } else {
        resp.getWriter()


[MUST] The rejection branch never sets a status code, so HttpServletResponse defaults to 200 OK — clients receive:

HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 {"Error":"class java.lang.IllegalAccessException : lack of computing resources"}

This contradicts the expected non-blocking behavior where rate-limited requests should return 429 Too Many Requests.
Standard HTTP clients (OkHttp, axios, fetch, …) treat 200 as success and won't trigger retry / back-pressure logic.

Suggested fix:

resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS); // 429 resp.setHeader("Retry-After", "1"); // retry time resp.getWriter().println(Util.printErrorMsg( new IllegalAccessException("lack of computing resources")));

JSON-RPC rejection response is not JSON-RPC 2.0 compliant

JsonRpcServlet, JsonRpcOnSolidityServlet, and JsonRpcOnPBFTServlet all extend RateLimiterServlet and don't override service(). When rate-limited, they hit the same rejection branch in RateLimiterServlet.service.

Spec-compliant JSON-RPC clients expect:
{"jsonrpc":"2.0","error":{"code":-32005,"message":"Limit exceeded"},"id":<request id>}

Suggested approach:

Extract the rejection handling logic in RateLimiterServlet into a protected hook method. This allows subclasses such as JsonRpcServlet to override it and return a JSON-RPC 2.0–compliant error response.

Thank you for your suggestion. However, your suggestion is unrelated to this issue's fix. You can discuss it in a separate issue later.

I'd respectfully push back — this is the PR's stated behavior, just not implemented yet.
The PR description (bullet 1) reads:

"Over-limit requests now return HTTP 429 / gRPC RESOURCE_EXHAUSTED immediately instead of holding the worker thread in a wait queue"

0xbigapple · 2026-04-30T05:03:47Z

[SHOULD] invalid global.qps brings down all APIs with opaque 500s.

Not introduced by this PR . Flagging here because the PR is the right window to fix it (small change, already touching this logic).

Reproduction

Setting rate.limiter.global.qps = 0.001 (or just 0) in config.conf:

Node starts cleanly (no error in logs)
First HTTP / gRPC request → HTTP 500
Every subsequent request → HTTP 500
Limiter never functions

Root cause

Two issues compound:

(a) RateLimiterConfig.GlobalConfig.qps is declared as int (common/src/main/java/org/tron/core/config/args/RateLimiterConfig.java:28,35,41). typesafe config silently truncates 0.001 to 0 via (long) 0.001 cast in ConfigDouble.longValue():

"0.001" → ConfigDouble(0.001)
        → SimpleConfig.getInt → intValueRangeChecked → (long) 0.001 = 0L → return 0
        → PARAMETER.rateLimiterGlobalQps = 0

intValueRangeChecked only validates the long-to-int range; it does not check whether the double is whole. No exception, no warning.

(b) GlobalRateLimiter initializes its static RateLimiter field at class-load time:

private static RateLimiter rateLimiter = RateLimiter.create(QPS);

RateLimiter.create(0) throws IllegalArgumentException("rate must be positive"). JVM wraps it as ExceptionInInitializerError, putting the class in a permanently erroneous state:

Stage	Result
Node startup	✅ Succeeds (`GlobalRateLimiter` is lazy-loaded)
First request triggers class init	`ExceptionInInitializerError` → 500
Subsequent requests	`NoClassDefFoundError` → 500
Servlet `catch (Exception unexpected)`	Doesn't catch — `ExceptionInInitializerError` is `Error`, not `Exception`

Suggested fix

Must-have — fail-fast validation in Args.applyRateLimiterConfig:

private static void applyRateLimiterConfig(RateLimiterConfig rl) {
  int qps    = rl.getGlobal().getQps();
  int ipQps  = rl.getGlobal().getIp().getQps();
  int apiQps = rl.getGlobal().getApi().getQps();
  if (qps <= 0 || ipQps <= 0 || apiQps <= 0) {
    throw new TronError(
        String.format("rate.limiter.global qps must be positive: qps=%d, ip.qps=%d, api.qps=%d",
            qps, ipQps, apiQps),
        TronError.ErrCode.RATE_LIMITER_INIT);
  }
  ...
}

This alone resolves the catastrophic "node up, all APIs 500" mode: the same misconfiguration now produces a clear startup error instead of opaque runtime failures. Users who write 0.5 will see qps must be positive (slightly misleading because the typed value was positive, but truncated to 0) — acceptable trade-off for a one-line guard.

Nice-to-have — widen field type from int to double:

GlobalConfig.qps is currently int while per-endpoint adapters already accept fractional QPS via paramString = "qps=0.5". Making the global field a double removes the silent-truncation surprise and makes the stack consistent. But for global.qps specifically (default 50000, intended as a node-level safety cap), fractional values have no real use case — this can land as a follow-up if it complicates review.

Happy to follow up in a separate PR if it's out of scope here, but flagging because the failure mode (node up, all APIs 500, no startup error) is hard to diagnose in the field.

xxo1shine · 2026-04-30T06:53:29Z

This issue is unrelated to the current fix. Administrators need to configure settings appropriately. This issue can be discussed in a separate issue to see if it needs to be fixed, or how to fix it.

bladehan1 · 2026-04-30T09:04:04Z

+    // Check per-endpoint first to avoid consuming global IP/QPS quota for requests
+    // that would be rejected by the per-endpoint limiter anyway.
+    boolean perEndpointAcquired = rateLimiter == null || rateLimiter.tryAcquire(runtimeData);
+    boolean acquireResource = perEndpointAcquired && GlobalRateLimiter.tryAcquire(runtimeData);


(commenting on line 98 because the issue lines 113 are outside the diff hunk)

[SHOULD] HTTP rejection should set 429 status, not implicit 200

The PR description says "Over-limit requests now return HTTP 429", but this branch only writes an error JSON body via resp.getWriter().println(...) and never calls resp.setStatus(...). The HTTP status stays at the default 200, so client SDKs cannot back off based on status code, gateways and Prometheus templates classify rate-limited requests as success, and the behavior is asymmetric with the gRPC path that returns RESOURCE_EXHAUSTED.

Suggestion: call resp.setStatus(HttpServletResponse.SC_TOO_MANY_REQUESTS) here (and consider a Retry-After header), or update the PR description to remove the 429 claim.

bladehan1 · 2026-04-30T09:04:06Z

-      }
+        }
+      };
    } catch (Exception e) {


[SHOULD] catch block on startCall failure should close the gRPC call

When next.startCall(call, headers) throws, this catch block releases the permit, marks metrics, and logs, but never invokes call.close(...). The returned listener is the empty new ServerCall.Listener<ReqT>() {} from line 107, so the client receives no status / trailers and only fails when the transport-level deadline fires. This is a pre-existing issue, but since the catch block is being rewritten in this PR (release added), it is a good moment to close it properly.

Suggestion: call call.close(Status.fromCode(Code.INTERNAL).withDescription("rpc handler init failed"), new Metadata()); in the catch block, and add a test that verifies call.close is invoked on startCall exceptions.

bladehan1 · 2026-04-30T09:04:08Z

+    Cache<String, RateLimiter> freshCache = CacheBuilder.newBuilder()
+        .maximumSize(10000).expireAfterWrite(1, TimeUnit.HOURS).build();
+    cacheField.set(null, freshCache);
+  }


[SHOULD] Reset GlobalRateLimiter static fields in @afterclass to avoid singleton pollution

This class reflectively writes IP_QPS, rateLimiter, and cache in @Before, but @AfterClass only calls Args.clearParam(). After this class finishes, those static fields stay at the last test's values (e.g. globalQps=10 / ipQps=1). Any other test in the same surefire fork that calls GlobalRateLimiter.tryAcquire without mockStatic will read the polluted state and behave non-deterministically — exactly the kind of cross-class flake that surfaces in CI and not locally.

Suggestion: in @AfterClass, call a helper that restores the three static fields to production defaults (re-read from Args) so the JVM state on exit matches the state on entry.

bladehan1 · 2026-04-30T09:04:10Z

+      if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) {
+        ((IPreemptibleRateLimiter) rateLimiter).release();
+      }
+      call.close(Status.fromCode(Code.RESOURCE_EXHAUSTED), new Metadata());


[NIT] No metrics on rate-limit reject paths

The early-return path here (and the symmetric HTTP one in RateLimiterServlet.service) closes the call with RESOURCE_EXHAUSTED but emits no metrics. MetricsUtil.meterMark(NET_API_FAIL_QPS) only fires on startCall exceptions, so operators cannot observe rate-limit hit rate to tune thresholds. The PR claims observability as a benefit but reject metrics are still missing.

Suggestion: add a MetricsUtil.meterMark(...) (reuse NET_API_FAIL_QPS or introduce NET_API_RATE_LIMIT_REJECT_QPS) on both reject branches; can be a follow-up if out of scope.

bladehan1 · 2026-04-30T09:04:12Z

-      if (rateLimiter instanceof IPreemptibleRateLimiter && acquireResource) {
+      // Release whenever the per-endpoint permit was acquired (covers both the normal
+      // completion path and the case where GlobalRateLimiter rejected the request).
+      if (rateLimiter instanceof IPreemptibleRateLimiter && perEndpointAcquired) {


(commenting on line 122 because the issue lines 113 are outside the diff hunk)

[NIT] IllegalAccessException is the wrong exception type for rate limiting

Util.printErrorMsg(new IllegalAccessException("lack of computing resources")) makes the response body contain java.lang.IllegalAccessException: lack of computing resources. IllegalAccessException is the standard JDK exception for reflective access failure and is unrelated to throttling, so the body misleads SDK authors and is operator-facing rather than client-facing. Since this branch is being rewritten anyway, swapping it now is the cheap moment.

Suggestion: use a clearer message or a dedicated RateLimitedException (or fold this into the setStatus(429) change so the body just says "rate limit exceeded").

bladehan1 · 2026-04-30T09:04:14Z

+      try {
+        // cache.get is atomic: only one loader executes per key under concurrent requests,
+        // preventing multiple RateLimiter instances from being created for the same IP.
+        r = cache.get(ip, () -> RateLimiter.create(IP_QPS));


[NIT] cache.get atomicity claim has no concurrent regression test

The new cache.get(ip, () -> RateLimiter.create(IP_QPS)) (and the symmetric one in IPQpsStrategy.tryAcquire) relies on Guava's documented loader atomicity to fix the original getIfPresent + put race. The behavior is correct, but no test asserts it — only the documentation does. A future Guava upgrade or an accidental refactor back to getIfPresent + put would silently regress.

Suggestion: add a small concurrent test (CountDownLatch + 2+ threads + AtomicInteger counting loader invocations) that asserts the loader runs at most once per key under concurrent access.

bladehan1 · 2026-04-30T09:04:16Z

+@Slf4j
 public class GlobalRateLimiter {

  private static double QPS = Args.getInstance().getRateLimiterGlobalQps();


(commenting on line 14 because the issue lines 16 are outside the diff hunk)

[NIT] IP_QPS<=0 misconfiguration causes a silent fail-closed warn storm

If node.rateLimiter.global.IP.qps is misconfigured to 0 or a negative value, RateLimiter.create(IP_QPS) throws IllegalArgumentException. With the new fail-closed behavior, every new IP triggers the loader, fails, and emits a warn — so under hot traffic the log gets flooded and every client is rejected. The root cause (a bad config) is buried inside an in-flight code path instead of being reported at startup.

Suggestion: validate IP_QPS > 0 (and QPS > 0) at static initialization or Args parsing, and fail-fast if invalid, so misconfiguration surfaces before the first request.

bladehan1 · 2026-04-30T09:04:18Z

+
+      verify(perEndpoint, times(1)).release();
+    }
+  }


[NIT] Missing test: handler RuntimeException must still release the permit

The new test class covers the per-endpoint reject, global reject, and happy path, but never lets super.service (i.e. the user handler) throw a RuntimeException. The finally block does release the permit on that path (because perEndpointAcquired is true), but no assertion guards against future regressions if someone reorders the catch / finally.

Suggestion: add a testHandlerExceptionStillReleasesPermit case where doGet throws RuntimeException and verify(perEndpoint, times(1)).release() runs.

feat(api): make API calls non-blocking

443119c

github-actions Bot requested review from 0xbigapple, bladehan1 and waynercheung April 29, 2026 07:30

waynercheung approved these changes Apr 29, 2026

View reviewed changes

0xbigapple reviewed Apr 30, 2026

View reviewed changes

halibobo1205 added the topic:api rpc/http related issue label Apr 30, 2026

halibobo1205 added this to the GreatVoyage-v4.8.2 milestone Apr 30, 2026

bladehan1 reviewed Apr 30, 2026

View reviewed changes

Conversation

xxo1shine commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waynercheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xbigapple commented Apr 30, 2026

Reproduction

Root cause

Suggested fix

Uh oh!

xxo1shine commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xxo1shine commented Apr 29, 2026 •

edited

Loading