Skip to content

Assessment: AI Assessment Pipeline#788

Merged
vprashrex merged 39 commits intomainfrom
feature/assessment
May 6, 2026
Merged

Assessment: AI Assessment Pipeline#788
vprashrex merged 39 commits intomainfrom
feature/assessment

Conversation

@vprashrex
Copy link
Copy Markdown
Collaborator

@vprashrex vprashrex commented Apr 27, 2026

Summary

Target issue is #791
Explain the motivation for making this change. What existing problem does the pull request solve?

This PR introduces a shared assessment pipeline for partner programs, replacing manual and fragmented evaluation workflows. It adds multimodal dataset ingestion, config-driven batch assessment runs, status tracking, retries, cron-based processing, and exportable results for easier comparison and operations.

Note

  • Updated the config table to include a tag column with enum ['ASSESSMENT']. Only configs and versions with tag = 'ASSESSMENT' will be shown in the assessment module. This separation is intentional, since assessment configs don’t include system instructions, and also don't want to show up the general configs in assessment config ui.
  • Used type='assessment' in evaluation dataset batch. so don't want to mixup the assessment dataset with other evaluation datasets
  • Using evaluation dataset table for storing assessment datasets

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Full assessment system for multi-configuration batch evaluations (create/retry flows, parent/child manager views).
    • Dataset upload/management for CSV/XLSX with validation and size limits.
    • Result export/download in JSON, CSV, XLSX, or ZIP per run.
    • Enhanced batch processing and richer attachment handling (images/files).
  • Behavioral Change

    • Cron job now also polls and reports assessment processing totals alongside regular evaluations.
  • Documentation

    • Added API docs for assessment endpoints and flows.
  • Tests

    • New comprehensive tests covering assessment utils, processing, parsing, mappers, validators, and cron aggregation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a full "assessment" subsystem: DB models and migration, dataset upload/validation, batch payload builders and submitters for OpenAI/Google, async polling/processing, SSE event broker, export utilities, mappers/crud/service layers, FastAPI routes, and cron/router integration.

Changes

Cohort / File(s) Summary
API Routing & Cron Integration
backend/app/api/main.py, backend/app/api/routes/cron.py
Registers new assessment router in main API; converts cron handler to async, awaits primary evaluation processing and then calls poll_all_pending_assessment_evaluations, merging assessment counters and capturing polling errors into the cron response.
Assessment Models & DB Migration
backend/app/assessment/models.py, backend/app/alembic/versions/055_add_assessment_manager_table.py, backend/app/models/__init__.py
Adds Assessment and AssessmentRun SQLModel tables, related Pydantic schemas, indexes, alembic migration, and re-exports models from package init.
CRUD & Status Management
backend/app/assessment/crud.py
New data-access APIs: create/list/fetch assessments and runs, update run status, recompute parent assessment status and run-stats with transactional commits and error handling.
Batch Submission & Attachments
backend/app/assessment/batch.py
New batch builder handling CSV/XLSX rows, prompt construction, attachment resolution (Drive URLs, data: URIs, base64), provider-specific JSONL for OpenAI and Google, and batch submission via provider clients.
Polling & Processing
backend/app/assessment/processing.py, backend/app/assessment/cron.py
Async polling for pending assessments and runs, provider batch-state checks, result download, JSON sanitization/parsing, run status updates, SSE events, uploads of raw results, and aggregate summaries.
Service Orchestration & Routes
backend/app/assessment/service.py, backend/app/assessment/routes.py
Start/retry orchestration, create manager/run records, submit batches and update run status; new FastAPI /assessment router for dataset CRUD, evaluation creation/retry, SSE /events, listings, and result export endpoints (JSON/CSV/XLSX/ZIP).
Dataset Handling & Validation
backend/app/assessment/dataset.py, backend/app/assessment/validators.py
Dataset upload with row counting (CSV encoding fallbacks, Excel via openpyxl), object-store upload, filename sanitization, and file validation (10MB limit, allowed extensions).
Export & Parsing Utilities
backend/app/assessment/utils/export.py, backend/app/assessment/utils/parsing.py, backend/app/assessment/utils/__init__.py
Export row flattening/serialization (JSON/CSV/XLSX), parsing stored JSON/JSONL results, usage/token totals, loading export rows for runs, and public re-exports.
Mappers & Normalization
backend/app/assessment/mappers.py
Maps Kaapi LLM params to OpenAI/Google request params, normalizes instruction text, enforces schema strictness, handles reasoning/effort mapping, and transforms output schemas.
Events Broker
backend/app/assessment/events.py
SSE broker singleton with per-subscriber asyncio queues, keep-alives, and non-blocking publish.
Module Init & Minor Model Changes
backend/app/assessment/__init__.py, backend/app/models/batch_job.py, backend/app/models/llm/request.py
Adds module docstring; extends BatchJobType with ASSESSMENT; adds effort, summary, top_p, and max_output_tokens to TextLLMParams.
Tests & Docs
backend/app/tests/..., backend/app/api/docs/assessment/*, backend/pyproject.toml
New tests for assessment features (export, mappers, parsing, processing, validators), updated cron tests to async mocks, docs for assessment endpoints, and adds openpyxl dependency.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as API Routes
    participant Service as Assessment Service
    participant DB as Database
    participant Provider as Batch Provider
    participant Storage as Object Storage
    participant Cron as Cron Poller

    Client->>API: POST /assessment (dataset, configs)
    API->>Service: start_assessment(request)
    Service->>DB: create Assessment (manager)
    Service->>DB: resolve configs

    loop per config
        Service->>DB: create AssessmentRun
        Service->>Provider: submit_batch(payload)
        Provider-->>Service: batch_job_id
        Service->>DB: update run -> processing
    end
    Service-->>Client: AssessmentResponse

    Cron->>DB: fetch pending runs
    Cron->>Provider: check_batch_status(job_id)
    alt batch complete
        Provider-->>Cron: results
        Cron->>Storage: upload_raw_results()
        Cron->>Service: parse_assessment_output
        Cron->>DB: update run -> completed
        Cron->>API: publish SSE (results_ready)
    else batch failed
        Cron->>DB: update run -> failed
    else in-progress
        Cron-->>Cron: keep processing
    end

    Client->>API: GET /assessment/{id}/results?format=csv|json|xlsx
    API->>Service: build_export_response
    Service->>Storage: fetch_results()
    Service-->>Client: StreamingResponse (file)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • kartpop
  • AkhileshNegi

🐰 I hopped through rows and configs bright,
Built batches, polled throughout the night,
SSE beacons, CSV dreams,
Providers fed my rabbit schemes—
A bunny cheers: assessments take flight! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately reflects the main change: introducing a comprehensive assessment pipeline with dataset ingestion, batch processing, status tracking, and result export capabilities.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/assessment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (15)
backend/app/assessment/utils/parsing.py-25-26 (1)

25-26: ⚠️ Potential issue | 🟡 Minor

Use explicit None checks to avoid masking zero token counts.

usage.get("input_tokens") or usage.get("prompt_tokens") will fall through to the alternate key when the primary key is present but 0. While provider payloads with zero input/output tokens are uncommon, the fallthrough also makes the precedence between the two aliases inconsistent (only when truthy). Prefer an explicit None check so the _tokens key always wins when present.

♻️ Proposed fix
-    input_tokens = usage.get("input_tokens") or usage.get("prompt_tokens")
-    output_tokens = usage.get("output_tokens") or usage.get("completion_tokens")
+    input_tokens = (
+        usage.get("input_tokens")
+        if usage.get("input_tokens") is not None
+        else usage.get("prompt_tokens")
+    )
+    output_tokens = (
+        usage.get("output_tokens")
+        if usage.get("output_tokens") is not None
+        else usage.get("completion_tokens")
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/parsing.py` around lines 25 - 26, The current
token extraction uses truthy fallback which treats 0 as missing; replace the
"or" fallbacks with explicit None checks so that input_tokens prefers
usage["input_tokens"] when the key exists (even if 0) and only uses
usage["prompt_tokens"] when usage.get("input_tokens") is None, and do the
analogous change for output_tokens so usage["output_tokens"] takes precedence
over usage["completion_tokens"]; update the two assignments in parsing.py that
set input_tokens and output_tokens accordingly (use explicit is not None checks
against usage.get(...)).
backend/app/models/user.py-81-81 (1)

81-81: ⚠️ Potential issue | 🟡 Minor

PATCH /me should compute and return features like GET /me, or document the inconsistency.

The features field defaults to [] in most endpoints but is populated by flag computation only in GET /me. This creates an inconsistency for PATCH /me (line 97), which returns the user directly without computing features—clients calling PATCH /me will receive features: [] even if the user has feature flags enabled, contradicting the behavior of GET /me.

Recommend either:

  • Computing features in PATCH /me before returning, or
  • Documenting that features reflects the current state only after GET /me.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/user.py` at line 81, PATCH `/me` returns the User model
with features defaulting to [] because it doesn't run the same flag computation
used by GET `/me`; modify the PATCH `/me` handler to compute and populate the
features field before returning (reuse the same flag computation logic used by
GET `/me`, or extract that logic into a helper like compute_user_features(user)
and call it from both GET `/me` and PATCH `/me`) so the returned User.features
matches GET `/me`.
backend/app/api/permissions.py-92-101 (1)

92-101: ⚠️ Potential issue | 🟡 Minor

Misleading 403 detail when organization context is absent.

When org_id is None (caller has no organization on the auth context), the message still reads "is not enabled for this organization", which conflates "feature disabled" with "no organization scope". Consider distinguishing the two for clearer client-side handling.

🛡️ Suggested differentiation
-        if org_id is None or not is_enabled(
-            session=session,
-            flag=flag,
-            organization_id=org_id,
-            project_id=project_id,
-        ):
-            raise HTTPException(
-                status_code=403,
-                detail=f"Feature '{flag.name}' is not enabled for this organization.",
-            )
+        if org_id is None:
+            raise HTTPException(
+                status_code=403,
+                detail=f"Feature '{flag.name}' requires an organization context.",
+            )
+        if not is_enabled(
+            session=session,
+            flag=flag,
+            organization_id=org_id,
+            project_id=project_id,
+        ):
+            raise HTTPException(
+                status_code=403,
+                detail=f"Feature '{flag.name}' is not enabled for this organization.",
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/permissions.py` around lines 92 - 101, The current 403
message conflates missing org context and a disabled feature; update the check
around org_id and is_enabled in the block containing org_id, is_enabled(...),
and the raised HTTPException so that if org_id is None you raise a 403 (or
appropriate 400/401 if preferred) with a clear message like "No organization
context provided" (referencing org_id), otherwise keep the 403 with detail
f"Feature '{flag.name}' is not enabled for this organization." (use session,
project_id and flag.name as currently used); ensure the two paths are distinct
and use the same HTTPException symbol for consistency.
backend/app/assessment/cron.py-144-182 (1)

144-182: ⚠️ Potential issue | 🟡 Minor

Failure-path itself is unprotected — a single bad cleanup aborts the entire poll.

update_assessment_run_status (commits, may raise on DB error / rollback) and recompute_assessment_status (raises ValueError if the assessment was concurrently deleted) are both invoked inside the except handler with no further protection. If either raises, the exception escapes the per-run loop and skips every remaining assessment in this cron tick.

Wrap the cleanup so a failure on one run can't stall the cron:

🛡️ Proposed defensive cleanup
             except Exception as e:
                 logger.error(
                     "[poll_all_pending_assessment_evaluations] "
                     f"Failed run {run.id} | "
                     ...
                     exc_info=True,
                 )
-                update_assessment_run_status(
-                    session=session,
-                    run=run,
-                    status="failed",
-                    error_message="Processing failed. Check server logs for details.",
-                )
-                refreshed_assessment = recompute_assessment_status(
-                    session=session, assessment_id=assessment.id
-                )
+                try:
+                    update_assessment_run_status(
+                        session=session,
+                        run=run,
+                        status="failed",
+                        error_message="Processing failed. Check server logs for details.",
+                    )
+                    refreshed_assessment = recompute_assessment_status(
+                        session=session, assessment_id=assessment.id
+                    )
+                except Exception as cleanup_err:
+                    logger.error(
+                        "[poll_all_pending_assessment_evaluations] "
+                        f"Cleanup failed for run {run.id} | error={cleanup_err}",
+                        exc_info=True,
+                    )
+                    failed += 1
+                    continue
                 failure_result = {
                     ...
                 }
                 all_results.append(failure_result)
                 assessment_event_broker.publish(
                     _build_callback_payload(
                         refreshed_assessment,
                         run,
                         failure_result,
                     )
                 )
                 failed += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 144 - 182, The except handler
for run processing must be made defensive so cleanup errors don't abort the
whole poll: wrap calls to update_assessment_run_status,
recompute_assessment_status, and assessment_event_broker.publish (and any
related cleanup logic that may commit/raise) in their own try/except block(s),
catch and log any exceptions (including the run.id, run.run_name and
assessment_id) with exc_info=True, ensure you still append a failure_result and
increment failed for this run even if cleanup fails, and swallow the exception
so the per-run loop continues to the next run.
backend/app/assessment/models.py-32-36 (1)

32-36: ⚠️ Potential issue | 🟡 Minor

id: int = SQLField(default=None, ...) mismatches the type — should be int | None.

Both Assessment.id and AssessmentRun.id are typed int but defaulted to None (until the DB assigns one on insert). Pydantic v2 + SQLModel will raise validation errors when constructing instances pre-flush, and static type-checkers will (correctly) flag every place that reads assessment.id assuming non-None when the row is unflushed.

🛠️ Proposed fix
-    id: int = SQLField(
+    id: int | None = SQLField(
         default=None,
         primary_key=True,
         sa_column_kwargs={"comment": "Unique identifier for the assessment"},
     )
@@
-    id: int = SQLField(
+    id: int | None = SQLField(
         default=None,
         primary_key=True,
         sa_column_kwargs={"comment": "Unique identifier for the assessment run"},
     )

Also applies to: 144-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/models.py` around lines 32 - 36, The id fields are
declared as `id: int = SQLField(default=None, ...)` but default None means the
type should allow None; update the type annotations of Assessment.id and
AssessmentRun.id to be nullable (e.g., `int | None` or `Optional[int]`) to match
the SQLField default and avoid Pydantic/typing errors, and then review any call
sites that assume non-None to handle the unflushed case or assert/coerce as
needed.
backend/app/assessment/mappers.py-10-22 (1)

10-22: ⚠️ Potential issue | 🟡 Minor

Type signature doesn't match call sites — text may be None.

normalize_llm_text is declared as text: str -> str, but it's called with kaapi_params.get("instructions") (lines 100, 185), which returns Optional[str]. The body already handles non-str via the isinstance(text, str) early return, so widening the annotation correctly will keep the code honest and silence type-checkers.

🛠️ Proposed fix
-def normalize_llm_text(text: str) -> str:
-    if not isinstance(text, str) or not text:
-        return text
+def normalize_llm_text(text: str | None) -> str | None:
+    if not isinstance(text, str) or not text:
+        return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 10 - 22, The function
normalize_llm_text has a too-narrow type signature: change its annotation from
def normalize_llm_text(text: str) -> str to accept and return Optional[str]
(e.g., def normalize_llm_text(text: Optional[str]) -> Optional[str]) and add the
necessary Optional import from typing; this matches call sites like
kaapi_params.get("instructions") and the existing early-return behavior that
handles non-str/None values.
backend/app/api/routes/features.py-26-30 (1)

26-30: ⚠️ Potential issue | 🟡 Minor

Missing return type annotation on helper.

_parse_flag_or_400 returns a FeatureFlag enum value but has no return annotation.

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🛠️ Proposed fix
-def _parse_flag_or_400(flag_key: str):
+from app.core.feature_flags import FeatureFlag as FeatureFlagEnum
+
+def _parse_flag_or_400(flag_key: str) -> FeatureFlagEnum:
     try:
         return parse_feature_flag(flag_key)
     except ValueError as exc:
         raise HTTPException(status_code=400, detail=str(exc)) from exc

(Use whatever import path matches your existing parse_feature_flag return type.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/features.py` around lines 26 - 30, _add a return type
annotation to the helper _parse_flag_or_400 indicating it returns the
FeatureFlag enum returned by parse_feature_flag; update the function signature
to include -> FeatureFlag and import FeatureFlag from its module (or use the
correct fully-qualified import used elsewhere) so the helper's return type is
explicitly annotated and matches parse_feature_flag's return type.
backend/app/assessment/batch.py-107-135 (1)

107-135: ⚠️ Potential issue | 🟡 Minor

Workbook handle is leaked on exceptions.

openpyxl.load_workbook(...) opens a file handle (especially in read_only=True mode where it streams). If next(rows_iter, None), the column build, or the row iteration raises, wb.close() is skipped and the underlying ZIP/file handle is leaked until GC. Wrap the body in try/finally or use a context-manager-style pattern.

🛠️ Proposed fix
-    wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
-    ws = wb.active
-    if ws is None:
-        wb.close()
-        return []
-
-    rows_iter = ws.iter_rows(values_only=True)
-    header = next(rows_iter, None)
-    if header is None:
-        wb.close()
-        return []
-
-    columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
-    result = []
-    for row in rows_iter:
-        if row and any(cell is not None for cell in row):
-            row_dict = {
-                columns[i]: str(cell) if cell is not None else ""
-                for i, cell in enumerate(row)
-                if i < len(columns)
-            }
-            result.append(row_dict)
-
-    wb.close()
-    return result
+    wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
+    try:
+        ws = wb.active
+        if ws is None:
+            return []
+
+        rows_iter = ws.iter_rows(values_only=True)
+        header = next(rows_iter, None)
+        if header is None:
+            return []
+
+        columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
+        result: list[dict[str, str]] = []
+        for row in rows_iter:
+            if row and any(cell is not None for cell in row):
+                result.append(
+                    {
+                        columns[i]: str(cell) if cell is not None else ""
+                        for i, cell in enumerate(row)
+                        if i < len(columns)
+                    }
+                )
+        return result
+    finally:
+        wb.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 107 - 135, The workbook handle
opened by openpyxl.load_workbook in _parse_excel_rows can leak if any subsequent
operation raises; wrap workbook usage in a try/finally or context-manager
pattern so wb.close() always runs: after creating wb, enter a try block that
performs ws = wb.active, header = next(rows_iter, ...), column construction and
row iteration, and put wb.close() in the finally (or use
openpyxl.load_workbook’s context manager if available) to ensure the file/ZIP
handle is closed on success or exception.
backend/app/assessment/processing.py-37-70 (1)

37-70: ⚠️ Potential issue | 🟡 Minor

_sanitize_json_output doesn't escape \b or \f.

JSON spec (RFC 8259) treats every U+0000–U+001F character as illegal inside string values. The current implementation handles \n, \r, \t, but model outputs occasionally contain literal \b (0x08) and \f (0x0C) — most often when they're hallucinated. If those slip through, the second json.loads will still fail.

Tightening the sanitizer to escape any ord(ch) < 0x20 while inside a string covers all illegal control characters in one shot.

🛠️ Proposed fix
-        elif in_string and ch == "\n":
-            result.append("\\n")
-        elif in_string and ch == "\r":
-            result.append("\\r")
-        elif in_string and ch == "\t":
-            result.append("\\t")
-        else:
-            result.append(ch)
+        elif in_string and ord(ch) < 0x20:
+            escapes = {"\n": "\\n", "\r": "\\r", "\t": "\\t",
+                       "\b": "\\b", "\f": "\\f"}
+            result.append(escapes.get(ch, f"\\u{ord(ch):04x}"))
+        else:
+            result.append(ch)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` around lines 37 - 70, The sanitizer
_sanitize_json_output currently only replaces newline, carriage return and tab
inside JSON strings and misses other control characters like \b and \f; update
the loop so that when inside a string (in_string is True) and the current
character is not an escaped character (escape_next is False) you check if
ord(ch) < 0x20 and, if so, append a JSON-safe escape (e.g. a specific mapping
for '\n','\r','\t','\b','\f' or generically append "\\u{ord(ch):04x}") instead
of the raw character; keep the existing handling for backslash (escape_next) and
quote toggling so previously escaped characters are preserved.
backend/app/crud/feature_flag.py-37-63 (1)

37-63: ⚠️ Potential issue | 🟡 Minor

Concurrent create can leak IntegrityError (500) instead of clean conflict.

Between get_feature_flag(...) and session.commit(), two concurrent requests with the same (key, organization_id, project_id) can both pass the existence check and try to insert. The DB's uq_feature_flag_key_org_project constraint correctly rejects the second one, but as an uncaught IntegrityError, which surfaces as HTTP 500 in create_feature_flag_route (it expects None to translate to 409).

Catching the integrity error and returning None (or re-fetching) keeps behavior consistent with the existing None-on-conflict contract.

🛡️ Proposed fix
+from sqlalchemy.exc import IntegrityError
@@
     feature_flag = FeatureFlag(
         key=key,
         organization_id=organization_id,
         project_id=project_id,
         enabled=enabled,
     )
     session.add(feature_flag)
-    session.commit()
-    session.refresh(feature_flag)
-    return feature_flag
+    try:
+        session.commit()
+    except IntegrityError:
+        session.rollback()
+        return None
+    session.refresh(feature_flag)
+    return feature_flag
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/feature_flag.py` around lines 37 - 63, The
create_feature_flag function can raise an uncaught IntegrityError when two
requests concurrently pass get_feature_flag and both attempt to insert the same
(key, organization_id, project_id); wrap the insert/commit in a try/except
catching sqlalchemy.exc.IntegrityError, call session.rollback() in the except,
then return None (or optionally re-fetch via get_feature_flag and return that)
so the function preserves the contract of returning None on conflict instead of
letting a 500 bubble up; update create_feature_flag to import IntegrityError and
handle the exception around session.commit()/session.add() accordingly.
backend/app/assessment/routes.py-332-336 (1)

332-336: ⚠️ Potential issue | 🟡 Minor

Missing return type annotation on event_stream.

-    async def event_stream():
+    async def event_stream() -> AsyncIterator[str]:
         async for chunk in assessment_event_broker.subscribe():
             yield chunk
             await asyncio.sleep(0)

You'll need from collections.abc import AsyncIterator (or from typing import AsyncIterator on older Pythons).

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/routes.py` around lines 332 - 336, Add a return type
annotation to the async generator by changing the signature of event_stream to
specify AsyncIterator[bytes] (e.g. async def event_stream() ->
AsyncIterator[bytes]:) and import AsyncIterator from collections.abc at the top
of the module; keep the body using assessment_event_broker.subscribe() unchanged
so the generator type matches the yielded chunk type.
backend/app/assessment/batch.py-395-420 (1)

395-420: ⚠️ Potential issue | 🟡 Minor

google_params["reasoning"] is set by the mapper but never consumed here.

map_kaapi_to_google_params (mappers.py line 205–207) writes google_params["reasoning"] = reasoning, but the builder at lines 395–420 only reads instructions, temperature, top_p, max_output_tokens, thinking_config, and output_schema. The reasoning value is silently dropped and never included in the Gemini request.

Either map reasoning into the appropriate Gemini field (e.g., merge with thinkingConfig) or remove the assignment from the mapper to eliminate dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 395 - 420, The
google_params["reasoning"] value produced by map_kaapi_to_google_params is never
used in the Gemini request builder; update the builder in batch.py to consume
google_params["reasoning"] by merging it into the outgoing generationConfig
(e.g., if thinking_config exists, combine or extend it with reasoning and set
generation_config["thinkingConfig"] accordingly; if not, set
generation_config["thinkingConfig"] = reasoning) so the reasoning intent is sent
as part of the request, or alternatively remove the assignment of
google_params["reasoning"] in map_kaapi_to_google_params to eliminate dead code.
backend/app/assessment/mappers.py-93-106 (1)

93-106: ⚠️ Potential issue | 🟡 Minor

Add guard clause for model before calling litellm.supports_reasoning().

If kaapi_params lacks "model", the code calls litellm.supports_reasoning(model="openai/None"), which returns False (or raises an error depending on litellm version). This causes misleading warnings at lines 131–134 that claim the model "does not support reasoning" when the actual problem is that no model was specified.

Fix
-    support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
+    support_reasoning = (
+        litellm.supports_reasoning(model=f"openai/{model}") if model else False
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 93 - 106, The code calls
litellm.supports_reasoning with model=f"openai/{model}" even when model is None;
add a guard that checks whether model is present in kaapi_params before invoking
litellm.supports_reasoning: if model is truthy call support_reasoning =
litellm.supports_reasoning(model=f"openai/{model}"), otherwise set
support_reasoning = False (and optionally log or raise a clear "model missing"
message) so you don't pass "openai/None" into supports_reasoning or produce
misleading reasoning-support warnings; update any subsequent logic that uses
support_reasoning accordingly (look for the model variable and support_reasoning
usage in this function).
backend/app/assessment/utils/export.py-395-417 (1)

395-417: ⚠️ Potential issue | 🟡 Minor

assessment_id=run.assessment_id or 0 uses 0 as a sentinel for runs without a parent assessment, inconsistent with other models.

AssessmentExportRow.assessment_id is typed as int, forcing the conversion of None to 0. However, AssessmentRunPublic, AssessmentRunSummary, and the source field run.assessment_id all use int | None. Change AssessmentExportRow.assessment_id to int | None to match the source type and existing patterns in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 395 - 417,
AssessmentExportRow.assessment_id is currently coerced to 0 in the export
(assessment_id=run.assessment_id or 0) and typed as int; change the
AssessmentExportRow dataclass/typing to accept int | None for assessment_id and
stop converting None to 0 by passing run.assessment_id directly in the
export_rows append. Update the type annotation
(AssessmentExportRow.assessment_id: int | None), any related imports/typing
declarations, and adjust any callers that assumed 0 to handle None where
necessary (references: AssessmentExportRow and the export loop building
export_rows using run.assessment_id).
backend/app/assessment/crud.py-275-292 (1)

275-292: ⚠️ Potential issue | 🟡 Minor

Status taxonomy is inconsistent; "in_progress" is checked but never set, causing silent data loss.

processing_runs accepts both "processing" and "in_progress", while the other counters and the model documentation accept single literals. The documented statuses are only "pending", "processing", "completed", "failed" — yet the aggregator checks for a fifth value that is never assigned anywhere in the codebase.

If any run has a status outside the expected set (typo, legacy value, or other unexpected value), it is excluded from all four counters but still contributes to total_runs = len(runs). When _determine_assessment_status receives mismatched totals, it falls through to "processing", silently masking the inconsistency.

Recommend (a) consolidating on a single canonical set of statuses (StrEnum) enforced at assignment time, and (b) asserting when a run's status falls outside that set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 275 - 292, The aggregator is
mixing an unexpected "in_progress" literal into processing_runs and not counting
unexpected statuses, causing total_runs mismatch and silent masking in
_determine_assessment_status; fix by consolidating to the canonical status set
(pending, processing, completed, failed) and stop checking "in_progress" in the
processing_runs calculation (only check r.status == "processing"), and add a
validation/assertion step when iterating runs (e.g., validate_run_status or
inline check inside the block that computes
pending_runs/processing_runs/completed_runs/failed_runs) that logs or raises if
r.status is not in the canonical set so invalid/legacy values are caught at
assignment time rather than ignored. Ensure assessment.total_runs still equals
len(runs) and that the counters sum to total_runs unless a validation error is
raised; update references to processing_runs and _determine_assessment_status
accordingly.
🧹 Nitpick comments (16)
backend/app/assessment/utils/parsing.py (1)

13-17: Consider hardening JSON parsing against malformed payloads.

json.loads on the array branch and per-line in the JSONL branch will raise JSONDecodeError straight to callers. If batch outputs from providers occasionally contain a malformed line, the entire result load fails. Depending on caller expectations, you may want to wrap line parsing with a try/except + logger warning so a single bad record doesn’t poison the whole run. Skip if upstream callers are already expected to handle this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/parsing.py` around lines 13 - 17, The JSON
parsing currently calls json.loads directly on content (array branch using
parsed) and per-line (JSONL branch), which will raise JSONDecodeError for
malformed payloads; wrap the top-level json.loads(content) in a try/except
JSONDecodeError and log a warning then return an empty list (or appropriate
fallback), and for the JSONL branch catch JSONDecodeError around each
json.loads(line) so you log the bad line and skip it instead of letting one bad
record fail the whole parse; use the existing logger (or add one) to emit
warnings that include the offending line and the exception.
backend/app/assessment/validators.py (1)

53-67: Simplify by reading once and checking len(content).

The file.file.seek/tell/seek dance reaches into the underlying SpooledTemporaryFile synchronously, which works but bypasses Starlette's async UploadFile API and is brittle if the upload backend changes. Reading first and validating size on the bytes avoids the seek-around entirely and is also cheaper since read() is being called anyway. The 10 MB upper bound provides a natural memory cap.

-    file.file.seek(0, 2)
-    file_size = file.file.tell()
-    file.file.seek(0)
-
-    if file_size > MAX_FILE_SIZE:
+    content = await file.read()
+    file_size = len(content)
+
+    if file_size > MAX_FILE_SIZE:
         raise HTTPException(
             status_code=413,
             detail=f"File too large. Maximum size: {MAX_FILE_SIZE / (1024 * 1024):.0f}MB",
         )
 
     if file_size == 0:
         raise HTTPException(status_code=422, detail="Empty file uploaded")
 
-    content = await file.read()
     return content, file_ext

Note: if you ever need to enforce the limit before reading (to prevent OOM on very large uploads), prefer streaming chunked reads with a running counter rather than the seek/tell approach — but for a 10 MB cap this is fine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/validators.py` around lines 53 - 67, Replace the
synchronous file.file.seek/tell/seek usage with a single async read: call await
file.read() once (using UploadFile.read()), then use len(content) to enforce
MAX_FILE_SIZE and detect empty uploads instead of file_size; remove references
to file.file.seek/tell and keep the existing HTTPException behavior and return
(content, file_ext) from the same validator function.
backend/app/main.py (1)

70-75: Add a return type annotation to lifespan.

Per coding guidelines for **/*.py, all function parameters and return values should have type hints. The app: FastAPI parameter is annotated, but the return is not. Ruff's ARG001 warning about unused app is a false positive — FastAPI's lifespan protocol requires this signature.

-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    """App startup / shutdown lifecycle."""
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncIterator[None]:
+    """App startup / shutdown lifecycle."""

(Add from collections.abc import AsyncIterator to imports.) As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/main.py` around lines 70 - 75, The lifespan function lacks a
return type hint; update its signature to annotate the return as an async
iterator (e.g., AsyncIterator[None]) and add the required import from
collections.abc (import AsyncIterator) so the signature becomes async def
lifespan(app: FastAPI) -> AsyncIterator[None]:; keep the existing app parameter
annotation (to avoid Ruff false-positive) and no other behavior changes.
backend/app/models/__init__.py (1)

100-100: Consider moving Assessment/AssessmentRun re-exports to a dedicated location or removing them entirely.

The import of Assessment and AssessmentRun from app.assessment.models at line 100 is currently safe (no circular import exists), but consolidating external model re-exports in a central __init__ can create tight coupling if the assessment subpackage ever needs to import from app.models for FK relationships. A cleaner approach:

  1. Remove this re-export and have callers from app.assessment.models import Assessment, AssessmentRun directly (preferred — keeps subpackage models self-contained).
  2. If table discovery is required for Alembic, register models in your env.py/metadata module instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/__init__.py` at line 100, The __init__.py currently
re-exports Assessment and AssessmentRun (from app.assessment.models) which can
create tight coupling; remove the import line exporting Assessment and
AssessmentRun from app.assessment.models in app.models.__init__ and update
callers to import directly from app.assessment.models instead (or, if you need
automatic table discovery for Alembic, register those models in your Alembic
env.py/metadata module rather than re-exporting them here); ensure no remaining
references rely on the central re-export and run tests/migrations to verify
imports still resolve.
backend/app/models/feature_flag.py (1)

61-78: FeatureFlagCreate and FeatureFlagUpdate are identical — collapse them.

They define the exact same fields and semantics; either alias or have one inherit from the other so that future changes to one don't silently diverge.

♻️ Proposed simplification
 class FeatureFlagCreate(SQLModel):
     key: str = Field(min_length=1, max_length=128)
     organization_id: int
     project_id: int | None = None
     enabled: bool


-class FeatureFlagUpdate(SQLModel):
-    key: str = Field(min_length=1, max_length=128)
-    organization_id: int
-    project_id: int | None = None
-    enabled: bool
+class FeatureFlagUpdate(FeatureFlagCreate):
+    """Same payload shape as create; kept distinct for OpenAPI clarity."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/feature_flag.py` around lines 61 - 78, FeatureFlagCreate
and FeatureFlagUpdate are identical; collapse them by introducing a single
shared model (e.g., FeatureFlagBase or keep FeatureFlagCreate as canonical) and
have the other name inherit or alias it to avoid duplication—update references
to use the unified model (FeatureFlagCreate or FeatureFlagUpdate) and leave
FeatureFlagDelete unchanged; ensure the shared model preserves the existing
Field validations (key, organization_id, project_id, enabled) so future changes
apply in one place.
backend/app/assessment/events.py (1)

25-26: Use builtin TimeoutError (Ruff UP041).

In Python 3.11+, asyncio.TimeoutError is an alias of the builtin TimeoutError. Per the project's Python 3.11+ stance and the static analysis hint, prefer the builtin.

-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue

As per coding guidelines: "Use Python 3.11+ with type hints throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 25 - 26, Replace the except
that catches asyncio.TimeoutError with the builtin TimeoutError in the async
block where you await queue.get() (the code that sets payload = await
asyncio.wait_for(queue.get(), timeout=15)); locate the try/except around that
await in events.py and change the exception class to TimeoutError to follow
Python 3.11+ conventions and satisfy the Ruff UP041 lint.
backend/app/core/feature_flags/__init__.py (1)

22-23: Misleading "unused" discard.

organization_id and project_id are actually used below — only user_id is unused. The _ = (organization_id, project_id, user_id) line falsely signals all three are unused. If the intent is to silence linters for user_id, prefer renaming the parameter.

♻️ Proposed cleanup
 def is_enabled(
     session: Session,
     flag: FeatureFlag,
     organization_id: int,
     project_id: int | None = None,
-    user_id: int | None = None,
+    user_id: int | None = None,  # noqa: ARG001 — reserved for future user-scoped flags
 ) -> bool:
     """Check whether *flag* is enabled for the given scope."""
-    _ = (organization_id, project_id, user_id)
     resolved_flag = get_feature_flag_enabled(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/feature_flags/__init__.py` around lines 22 - 23, The tuple
assignment "_ = (organization_id, project_id, user_id)" falsely marks all three
parameters as unused; instead keep real usage of organization_id and project_id
and only silence the unused user_id by renaming it (e.g. _user_id) or explicitly
assigning _ = user_id, and remove the tuple assignment; update the function
signature or the single-use discard so that organization_id and project_id are
no longer masked as unused while only user_id is ignored (refer to the
parameters organization_id, project_id, user_id and the surrounding check/flag
function in __init__.py).
backend/app/assessment/service.py (2)

57-62: Redundant UUID(str(...)) round-trip.

run.config_id is already typed as UUID | None on the model; converting to str and back to UUID is a no-op (and would crash on None, but you already validate non-None on lines 52–56).

-        configs.append(
-            AssessmentConfigRef(
-                config_id=UUID(str(run.config_id)),
-                config_version=run.config_version,
-            )
-        )
+        configs.append(
+            AssessmentConfigRef(
+                config_id=run.config_id,
+                config_version=run.config_version,
+            )
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 57 - 62, The code
unnecessarily wraps run.config_id in UUID(str(...)) before constructing
AssessmentConfigRef; since run.config_id is already a UUID (and you validate
it's non-None earlier in the block around lines 52–56), replace
UUID(str(run.config_id)) with run.config_id when creating AssessmentConfigRef to
avoid the redundant conversion and potential crash on None.

144-195: Redundant recompute_assessment_status calls inside the per-run loop.

recompute_assessment_status is invoked after each successful submit (line 178), after each failed submit (line 191), and again once at the end (line 195). For an assessment with N configs, that's N + 1 full re-aggregations of the same rows; only the final one is observable to clients (the others are immediately overwritten).

Move the call out of the loop:

♻️ Proposed simplification
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="processing",
                 batch_job_id=batch_job.id,
                 total_items=batch_job.total_items,
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)

         except Exception as e:
             ...
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="failed",
                 error_message="Batch submission failed. Please try again or contact support.",
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)

         runs.append(run)

     recompute_assessment_status(session=session, assessment_id=assessment.id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 144 - 195, The code currently
calls recompute_assessment_status() inside the per-config loop (after successful
submit and in the except block) causing redundant re-aggregations; remove the
two in-loop calls to recompute_assessment_status (the ones immediately after
update_assessment_run_status in both the try and except paths around
submit_assessment_batch and update_assessment_run_status) and keep only the
single recompute_assessment_status(session=session, assessment_id=assessment.id)
that runs once after the loop finishes so aggregation runs once for the final
state.
backend/app/api/routes/features.py (1)

33-170: Endpoints don't load Swagger descriptions from external markdown.

None of these new endpoints (/features, /feature-flags GET/POST/PATCH/DELETE) supply description=load_description("features/<action>.md"). The assessment dataset routes in this same PR (e.g., backend/app/assessment/routes.py:93) already follow this pattern.

As per coding guidelines: "Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/features.py` around lines 33 - 170, The router
endpoints get_features, get_feature_flags, create_feature_flag_route,
patch_feature_flag, and remove_feature_flag are missing Swagger descriptions
loaded from external Markdown; update each `@router`.<method> decorator to include
description=load_description("features/<action>.md") (use names like
"features/get.md", "features/list.md" or
"features/create.md"/"patch.md"/"delete.md" matching your docs convention) so
the OpenAPI docs pull content from external files rather than inline strings;
ensure you import or have access to load_description and use the exact function
names (get_features, get_feature_flags, create_feature_flag_route,
patch_feature_flag, remove_feature_flag) to locate the decorators to modify.
backend/app/assessment/processing.py (1)

253-253: Log prefix uses run id rather than function name.

log_prefix = f"[assessment_run={run.id}]" is useful context, but the repo convention is to lead with the function name. Combining both keeps the run-scoped context while satisfying the guideline.

-    log_prefix = f"[assessment_run={run.id}]"
+    log_prefix = f"[check_and_process_assessment][run={run.id}]"

As per coding guidelines: "Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message ...")".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` at line 253, Update the log_prefix
assignment so it begins with the current function name in square brackets
followed by the existing run-scoped context (i.e., include both the function
name and assessment run id); change the log_prefix variable (currently set as
log_prefix = f"[assessment_run={run.id}]") to the repo-standard pattern like
"[<function_name>][assessment_run=<run.id>]" (use the actual function name where
this line lives rather than the literal "<function_name>") so all subsequent
logger calls follow the required convention.
backend/app/assessment/routes.py (1)

348-436: Repetitive AssessmentPublic(...) / AssessmentRunPublic(...) construction.

list_assessment_managers, get_assessment_manager, list_evaluations, and get_evaluation each manually copy 14–15 fields one by one. Since both AssessmentPublic and AssessmentRunPublic are pure Pydantic projections of fields already on the SQLModel rows, you can replace these blocks with AssessmentPublic.model_validate(a, from_attributes=True) (Pydantic v2) — or set model_config = ConfigDict(from_attributes=True) on the public schemas and pass the row directly.

This shrinks four endpoints to one-liners and removes the risk of drift when fields are added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/routes.py` around lines 348 - 436, The endpoints
list_assessment_managers, get_assessment_manager (and the analogous
list_evaluations/get_evaluation) repetitively map fields into
AssessmentPublic/AssessmentRunPublic; replace the manual field-by-field
construction by either enabling attribute parsing on the Pydantic models (set
model_config = ConfigDict(from_attributes=True) on AssessmentPublic and
AssessmentRunPublic) and pass the SQLModel row directly, or call
AssessmentPublic.model_validate(a, from_attributes=True) (and
AssessmentRunPublic.model_validate(...)) when building the APIResponse; update
imports if needed and replace each block in list_assessment_managers,
get_assessment_manager, list_evaluations, and get_evaluation with the single
model_validate (or direct model instantiation) call to return the Pydantic
projection.
backend/app/assessment/crud.py (2)

196-227: Constrain status to known values.

status: str allows any string to be persisted to AssessmentRun.status, which couples directly into the aggregation in recompute_assessment_status (see comment above). Tightening to a Literal["pending", "processing", "completed", "failed", ...] (or the same StrEnum used elsewhere) gives compile-time safety and prevents drift between writers and the aggregator.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 196 - 227, The
update_assessment_run_status function accepts status: str which lets arbitrary
values be persisted to AssessmentRun.status and can drift from the aggregator
logic in recompute_assessment_status; change the status parameter to the
project-wide enum type (or a typing.Literal of the allowed values) used
elsewhere (the same StrEnum/StatusEnum used for assessment runs) and update any
call sites to pass that enum/value so only known statuses are stored and
compile-time/type-checking will prevent drift; ensure type imports and
annotations are updated where update_assessment_run_status is defined and
referenced.

1-12: Place CRUD operations in backend/app/crud/ per project convention.

This new module is a pure data-access layer for the Assessment / AssessmentRun tables but lives at backend/app/assessment/crud.py instead of the central backend/app/crud/ package used elsewhere. Consider moving it (e.g. to backend/app/crud/assessment.py) so future contributors can locate DB code consistently and so it composes with the existing CRUD test/fixtures conventions.

Based on learnings: "Use CRUD pattern for database access operations located in backend/app/crud/."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 1 - 12, The CRUD code for
Assessment and AssessmentRun lives in app.assessment.crud but should be placed
in the central CRUD package; move the module into the project's crud package
(e.g., create app.crud.assessment) and update any imports to reference
app.crud.assessment instead of app.assessment.crud, keeping the same symbols
(Assessment, AssessmentRun, logger, and any CRUD functions/classes) and ensure
the new module exports the same public API so existing callers continue to work.
backend/app/assessment/utils/export.py (2)

160-171: Hard-coded base_fields.index("result_status") will raise ValueError if the field is renamed.

If AssessmentExportRow is later refactored to drop or rename result_status, every CSV/XLSX export call crashes with an opaque ValueError. Consider guarding it (fall back to len(base_fields)) or anchor the column ordering on a stable constant declared in this module so the dependency is explicit.

♻️ Suggested guard
-    output_idx = base_fields.index("result_status") + 1  # after result_status
+    try:
+        output_idx = base_fields.index("result_status") + 1  # place output_keys after result_status
+    except ValueError:
+        output_idx = len(base_fields)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 160 - 171, The code uses
base_fields.index("result_status") which will raise ValueError if that field is
renamed; update the logic in the block that computes output_idx/fieldnames
(referencing base_fields, output_idx, fieldnames, has_unparsed_output) to guard
the lookup: either define and use a stable constant (e.g., RESULT_STATUS_FIELD)
and check membership before calling index, or wrap the index call in a
try/except and fall back to len(base_fields) (or another sensible insertion
point) so exports don't crash; ensure the subsequent insertion of "output_raw"
uses the guarded output_idx when computing the insert position.

200-228: Pin the Excel engine explicitly to "openpyxl".

When pd.ExcelWriter(buf) is called with a BytesIO object, pandas infers the file format as .xlsx and selects an engine. Without an explicit engine parameter, pandas prefers xlsxwriter if installed, otherwise falls back to openpyxl. Since the codebase only declares openpyxl>=3.1.0 as a dependency and the error message explicitly references openpyxl support, the engine should be pinned to ensure deterministic behavior that matches the stated runtime requirement.

♻️ Suggested change
-    with pd.ExcelWriter(buf) as writer:
+    with pd.ExcelWriter(buf, engine="openpyxl") as writer:
         data_frame.to_excel(writer, index=False, sheet_name="results")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 200 - 228, The Excel
writer call should explicitly pin the engine to openpyxl to ensure deterministic
.xlsx output; update the pd.ExcelWriter usage (the block that creates buf,
data_frame = pd.DataFrame(expanded, columns=excel_fields) and the with
pd.ExcelWriter(buf) as writer: ... block) to pass engine="openpyxl" so pandas
uses openpyxl rather than xlsxwriter or another engine, keeping the rest of the
export flow (_drop_empty_columns, expanded, excel_fields,
data_frame.to_excel(...)) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 091c4f08-e02a-48f6-b5b8-7905058d0c10

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4da2e and ec2eeda.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (34)
  • .gitignore
  • backend/app/alembic/versions/050_add_assessment_manager_table.py
  • backend/app/alembic/versions/055_create_feature_flag_table.py
  • backend/app/api/main.py
  • backend/app/api/permissions.py
  • backend/app/api/routes/cron.py
  • backend/app/api/routes/features.py
  • backend/app/api/routes/users.py
  • backend/app/assessment/__init__.py
  • backend/app/assessment/batch.py
  • backend/app/assessment/cron.py
  • backend/app/assessment/crud.py
  • backend/app/assessment/dataset.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/models.py
  • backend/app/assessment/processing.py
  • backend/app/assessment/routes.py
  • backend/app/assessment/service.py
  • backend/app/assessment/utils/__init__.py
  • backend/app/assessment/utils/export.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/assessment/validators.py
  • backend/app/core/feature_flags/__init__.py
  • backend/app/core/feature_flags/constants.py
  • backend/app/crud/feature_flag.py
  • backend/app/main.py
  • backend/app/models/__init__.py
  • backend/app/models/batch_job.py
  • backend/app/models/feature_flag.py
  • backend/app/models/llm/request.py
  • backend/app/models/user.py
  • backend/app/tests/api/test_permissions.py
  • backend/pyproject.toml

Comment thread backend/app/alembic/versions/055_create_feature_flag_table.py Outdated
Comment thread backend/app/crud/assessment/batch.py Outdated
Comment thread backend/app/crud/assessment/batch.py Outdated
Comment thread backend/app/assessment/dataset.py Outdated
Comment thread backend/app/services/assessment/dataset.py
Comment thread backend/app/assessment/routes.py Outdated
Comment thread backend/app/assessment/utils/export.py Outdated
Comment thread backend/app/main.py Outdated
Comment thread backend/app/models/feature_flag.py Outdated
Comment thread backend/app/models/llm/request.py Outdated
- Added `service.py` to handle assessment evaluation orchestration, including starting and retrying assessments.
- Introduced utility functions for exporting assessment results in various formats (CSV, XLSX, JSON) in `export.py`.
- Created parsing utilities for handling batch results in `parsing.py`.
- Implemented validation for dataset file uploads in `validators.py`.
- Updated batch job model to include assessment type.
- Enhanced evaluation run model to reference parent assessments.
- Added new fields for reasoning effort and summary preferences in LLM request model.
@vprashrex vprashrex force-pushed the feature/assessment branch from b0f5463 to bce0944 Compare April 27, 2026 15:54
…n model

- Introduced AssessmentRun model to replace EvaluationRun for assessment-specific runs.
- Updated CRUD operations to manage Assessment and AssessmentRun entities.
- Refactored processing logic to accommodate new AssessmentRun structure.
- Adjusted logging and error handling to reflect changes in run management.
- Modified routes and services to utilize AssessmentRun for assessment evaluations.
- Enhanced export functionality to support new AssessmentRun model.
- Removed EvaluationRun references where applicable and ensured backward compatibility.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (3)
backend/app/assessment/dataset.py (2)

78-101: ⚠️ Potential issue | 🟠 Major

Resource leak: wb.close() is not in a finally block.

If iter_rows/next or the count comprehension raises (corrupt xlsx, unexpected types), control jumps to except at line 99 and wb.close() is never reached, leaking the underlying zip file handle. Move cleanup into finally (or use contextlib.closing).

🛡️ Proposed fix
 def _count_excel_rows(content: bytes) -> int:
     """Count data rows in an Excel file (excluding header)."""
+    wb = None
     try:
         import openpyxl
 
         wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
         ws = wb.active
         if ws is None:
             return 0
 
         rows_iter = ws.iter_rows(values_only=True)
         header = next(rows_iter, None)
         if header is None:
-            wb.close()
             return 0
 
-        count = sum(
+        return sum(
             1 for row in rows_iter if row and any(cell is not None for cell in row)
         )
-        wb.close()
-        return count
     except Exception as e:
         logger.warning(f"[_count_excel_rows] Failed to count rows | {e}")
         return 0
+    finally:
+        if wb is not None:
+            try:
+                wb.close()
+            except Exception:
+                pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/dataset.py` around lines 78 - 101, The
_count_excel_rows function can leak the workbook handle because wb.close() is
only called on normal return; ensure the workbook is always closed by moving
wb.close() into a finally block (or use contextlib.closing or openpyxl.Workbook
context manager) so that after creating wb = openpyxl.load_workbook(...) and
before returning from any path (including exceptions) the workbook is closed;
update the logic around ws, header, and the count comprehension to use this
guaranteed cleanup and remove duplicate close calls inside the try block.

140-163: ⚠️ Potential issue | 🟠 Major

Silent partial failure: dataset row persisted even when object-store upload fails.

_upload_file_to_object_store swallows exceptions and returns None, after which create_evaluation_dataset is invoked unconditionally — leaving an "EvaluationDataset" record whose object_store_url is None. Downstream batch processing (backend/app/assessment/batch.py::_load_dataset_rows raises ValueError(f"Dataset {dataset.id} has no object_store_url")) will then fail mysteriously, and the caller of upload_dataset has no signal of the failure.

Fail loudly on upload failure rather than persisting an orphan metadata row.

🛡️ Proposed behavior
     object_store_url = _upload_file_to_object_store(
         session=session,
         project_id=project_id,
         file_content=file_content,
         file_ext=file_ext,
         dataset_name=dataset_name,
     )
+    if object_store_url is None:
+        raise HTTPException(
+            status_code=502,
+            detail="Failed to upload dataset to object storage. Please retry.",
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/dataset.py` around lines 140 - 163, The code currently
calls _upload_file_to_object_store and then unconditionally creates the DB row
via create_evaluation_dataset even when the upload fails or returns None; change
upload_dataset so it treats upload failures as fatal: call
_upload_file_to_object_store inside a try/except, propagate or raise a clear
exception when the upload raises or when the returned object_store_url is
falsy/None, and only call create_evaluation_dataset when object_store_url is a
valid non-empty string; reference _upload_file_to_object_store and
create_evaluation_dataset to locate the change and ensure no EvaluationDataset
is persisted if the upload did not succeed.
backend/app/assessment/events.py (1)

13-48: ⚠️ Potential issue | 🟠 Major

Unbounded subscriber queues — slow/idle SSE clients can OOM the process.

asyncio.Queue() defaults to unlimited size, and put_nowait here never blocks/drops. A subscriber that connects but stops draining (e.g., its TCP send buffer is backed up) will accumulate every published assessment event indefinitely. Under cron-driven publish bursts across many subscribers this is a real memory-pressure / reliability issue.

Use a bounded queue with explicit overflow handling (drop oldest, or evict the slow subscriber).

🛡️ Proposed bounded queue with overflow drop
+_QUEUE_MAXSIZE = 256
+
 class AssessmentEventBroker:
     def __init__(self) -> None:
         self._subscribers: set[asyncio.Queue[dict]] = set()
 
     async def subscribe(self) -> AsyncIterator[str]:
-        queue: asyncio.Queue[dict] = asyncio.Queue()
+        queue: asyncio.Queue[dict] = asyncio.Queue(maxsize=_QUEUE_MAXSIZE)
         self._subscribers.add(queue)
@@
         for queue in list(self._subscribers):
-            queue.put_nowait(payload)
+            try:
+                queue.put_nowait(payload)
+            except asyncio.QueueFull:
+                logger.warning(
+                    "[publish] Subscriber queue full, dropping oldest | type=%s",
+                    payload.get("type"),
+                )
+                try:
+                    queue.get_nowait()
+                    queue.put_nowait(payload)
+                except (asyncio.QueueEmpty, asyncio.QueueFull):
+                    pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 13 - 48, The subscriber queues
are unbounded (asyncio.Queue()) so slow/idle SSE clients can cause OOM; fix by
creating bounded queues in subscribe (e.g., queue: asyncio.Queue[dict] =
asyncio.Queue(maxsize=XX)) and update publish to handle overflow: if
queue.put_nowait(payload) raises asyncio.QueueFull, either drop the oldest item
from that queue (queue.get_nowait() then retry put_nowait) or treat the
subscriber as dead by discarding the queue from self._subscribers and
logging/removing it; change symbols: modify subscribe (queue creation) and
publish (the for loop over self._subscribers and use of queue.put_nowait) to
implement the chosen overflow strategy and add appropriate debug/info logs.
🧹 Nitpick comments (18)
backend/app/api/routes/cron.py (2)

79-95: Be defensive about the shape of result before mutating it.

If process_all_pending_evaluations ever returns something other than a dict (e.g., None after an internal error path), result["assessment"] = ... will raise a TypeError outside the inner try/except, and the original error context will be lost. A quick isinstance(result, dict) guard (or initializing result = result or {}) keeps the merge robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/cron.py` around lines 79 - 95, The code mutates result
assuming it's a dict which can raise a TypeError if
process_all_pending_evaluations (or the variable result) is None or not a dict;
guard and normalize it before merging by checking isinstance(result, dict) (or
using result = result or {}) and then safely set result["assessment"] =
assessment_result and update totals from assessment_result; reference the
variables result and assessment_result and the surrounding merge block so the
normalization happens immediately before the lines that assign
result["assessment"] and update
total_processed/total_failed/total_still_processing.

71-77: Lazy import inside the request handler — surface the reason or hoist it.

The runtime from app.assessment.cron import poll_all_pending_assessment_evaluations inside the handler is typically a circular-import workaround. If that's the case, leave a brief comment explaining why; otherwise hoist the import to module scope so it's loaded once at startup rather than on every cron invocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/cron.py` around lines 71 - 77, The inline runtime
import of poll_all_pending_assessment_evaluations inside the cron request
handler should be resolved: either hoist the import to module scope so the
function is imported once at startup (remove the in-handler from
app.assessment.cron import ... and add it at top-level), or if this was done to
avoid a circular import, add a concise comment above the inline import
explaining the circular-import reason and why the import must remain lazy.
Ensure you keep the existing call to
poll_all_pending_assessment_evaluations(session=session) unchanged and reference
that symbol in your change.
backend/app/assessment/cron.py (1)

73-76: Stray adjacent string concatenation produces an awkward log line.

"[poll_all_pending_assessment_evaluations] " "No active assessments found" (and the same pattern on line 75) creates a single string with two spaces between the prefix and the message. Drop the empty piece for a cleaner log.

♻️ Proposed change
-        logger.info(
-            "[poll_all_pending_assessment_evaluations] " "No active assessments found"
-        )
+        logger.info(
+            "[poll_all_pending_assessment_evaluations] No active assessments found"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 73 - 76, The log call in
poll_all_pending_assessment_evaluations uses two adjacent string literals
("[poll_all_pending_assessment_evaluations] " "No active assessments found")
which concatenates into a string with double space; update the logger.info
invocation in cron.py to use a single combined string (e.g.,
"[poll_all_pending_assessment_evaluations] No active assessments found") or an
f-string so the prefix and message are one contiguous string with a single
space, ensuring the message reads cleanly.
backend/app/assessment/events.py (1)

26-26: Use the builtin TimeoutError alias.

Per Ruff UP041, asyncio.TimeoutError is a deprecated alias of the builtin TimeoutError (since Python 3.11). Since this repo targets 3.11+, use the builtin directly.

♻️ Proposed change
-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` at line 26, Replace the deprecated alias
asyncio.TimeoutError with the builtin TimeoutError in the exception handler
inside backend.app.assessment.events (the except block currently using
asyncio.TimeoutError); update the except clause in the function/method where
that try/except appears (look for the "except asyncio.TimeoutError:" line) to
use "except TimeoutError:" so it relies on the built-in exception instead of the
asyncio alias.
backend/app/alembic/versions/055_add_assessment_manager_table.py (1)

121-132: Consider DB-level server_default for inserted_at/updated_at.

inserted_at and updated_at are NOT NULL but have no server_default. The model relies on SQLModel's default_factory=now, so any direct SQL INSERT (e.g., from psql, a fixture script, or an Alembic data migration) without these columns set will fail. Adding server_default=sa.func.now() keeps both ORM and raw inserts safe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/055_add_assessment_manager_table.py` around
lines 121 - 132, The migration defines NOT NULL columns inserted_at and
updated_at without DB defaults, which breaks raw INSERTs; update the Column
definitions for "inserted_at" and "updated_at" to include a database server
default (server_default=sa.func.now()) and add server_onupdate=sa.func.now() for
"updated_at" so the DB will populate timestamps for direct SQL inserts and
automatic updates from the database side.
backend/app/assessment/service.py (4)

10-10: Importing a private symbol (_resolve_config) across modules.

The leading underscore signals _resolve_config is batch.py's implementation detail; importing it from service.py couples the two modules to that internal contract. Either rename it (e.g., resolve_config) and treat it as a real public helper, or move it to a shared assessment._utils / assessment.config module so both call sites depend on a stable name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` at line 10, service.py is importing the
private function _resolve_config from batch.py which couples modules to an
internal symbol; change this by promoting the helper to a public API (rename
_resolve_config to resolve_config) or move it into a shared module (e.g.,
assessment.config or assessment_utils) and update both callers to import the new
public name, then remove the underscore import from batch.py and service.py and
ensure submit_assessment_batch and any other callers reference the new
resolve_config location.

145-193: start_assessment does multi-row creates/updates without an explicit transactional boundary — partial state on failure.

Each create_assessment, create_assessment_run, and update_assessment_run_status commits on its own (per the CRUD helpers). If something between iterations raises an unhandled error (e.g., a network blip during submit_assessment_batch that escapes the try, or a DB error during recompute_assessment_status), the parent Assessment plus a partial set of AssessmentRun rows will be persisted with the user receiving a 500. The cron will then try to process orphaned pending/processing rows.

Consider either (a) wrapping the orchestration in an explicit transaction with session.begin() plus a single final commit, or (b) ensuring the failure path here marks the whole parent assessment as failed with a clear error_message so cron + the UI can reason about it. The current per-iteration try/except only handles submit_assessment_batch failures, not the surrounding CRUD ops.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 145 - 193, The
start_assessment flow performs multiple DB creates/updates (create_assessment,
create_assessment_run, update_assessment_run_status,
recompute_assessment_status) without an explicit transaction, allowing partial
commits if an error occurs (e.g., during submit_assessment_batch or
recompute_assessment_status); wrap the multi-step orchestration in a single
transactional boundary (use session.begin() / session.begin_nested() as
appropriate) so all create/update calls are atomic and commit once at the end,
or alternatively, on any uncaught exception ensure you mark the parent
Assessment as failed (update_assessment_run_status + update parent assessment
status and error_message via the same session) before re-raising/returning to
avoid leaving orphaned pending/processing runs; locate these changes in
start_assessment and the exception handling around
submit_assessment_batch/recompute_assessment_status to implement the chosen
approach.

159-195: Per-iteration recompute_assessment_status is redundant and triples the DB writes on the parent row.

Each loop iteration calls recompute_assessment_status on the success path (line 178) or the failure path (line 191), then there's a final one at line 195. For an N-config assessment, that's N+1 recomputes (each does a SELECT over runs + an UPDATE/COMMIT on the parent). The final one alone is sufficient since the parent's status only needs to be correct after the loop completes — none of the per-iteration values are read between iterations.

Drop the inner calls and rely on line 195.

♻️ Proposed change
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="processing",
                 batch_job_id=batch_job.id,
                 total_items=batch_job.total_items,
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)
 
         except Exception as e:
             ...
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="failed",
                 error_message="Batch submission failed. Please try again or contact support.",
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)
 
         runs.append(run)
 
     recompute_assessment_status(session=session, assessment_id=assessment.id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 159 - 195, The loop currently
calls recompute_assessment_status inside both the try (after
update_assessment_run_status) and the except blocks, causing redundant DB
writes; remove the per-iteration calls to recompute_assessment_status within the
try/except and keep only the final recompute_assessment_status after the loop
completes so the parent's status is updated once; locate the calls around
submit_assessment_batch, update_assessment_run_status, and the except block
handling the exception for run processing and delete those inner
recompute_assessment_status invocations while leaving the final
recompute_assessment_status(session=session, assessment_id=assessment.id) at the
end.

51-62: UUID(str(run.config_id)) is a redundant round-trip.

AssessmentRun.config_id is already typed as UUID | None per the model, and the if not run.config_id guard above ensures it's a UUID here. UUID(str(uuid_obj)) is a no-op conversion.

♻️ Proposed change
         configs.append(
             AssessmentConfigRef(
-                config_id=UUID(str(run.config_id)),
+                config_id=run.config_id,
                 config_version=run.config_version,
             )
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 51 - 62, The code is
redundantly wrapping an already-UUID config_id with UUID(str(...)) in the loop
building AssessmentConfigRef; change the AssessmentConfigRef call to use the
existing UUID directly (e.g. config_id=run.config_id) and if the type checker
complains, assert or cast the non-None type before use (e.g. assert
run.config_id is not None then use run.config_id) so you remove the unnecessary
UUID(str(run.config_id)) round-trip; update the instantiation in the loop that
creates AssessmentConfigRef accordingly.
backend/app/assessment/mappers.py (3)

10-22: Type hint inaccuracy: normalize_llm_text can return None.

The signature is (text: str) -> str, but lines 11–12 return text unchanged when text is falsy or non-str — so a None input yields None. Callers at lines 100 and 185 pass kaapi_params.get("instructions") (possibly None) and rely on this. Tighten the annotation.

♻️ Proposed change
-def normalize_llm_text(text: str) -> str:
-    if not isinstance(text, str) or not text:
+def normalize_llm_text(text: str | None) -> str | None:
+    if not isinstance(text, str) or not text:
         return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 10 - 22, The type hint for
normalize_llm_text is inaccurate because it can accept and return None; update
its annotation to accept Optional[str] and return Optional[str] (from typing
import Optional), i.e. change def normalize_llm_text(text: str) -> str to def
normalize_llm_text(text: Optional[str]) -> Optional[str], and keep the existing
early-return behavior so callers that pass kaapi_params.get("instructions")
(which may be None) are correctly typed.

25-59: Strict-schema/strip-schema helpers don't recurse into anyOf/oneOf/allOf.

_ensure_openai_strict_schema and _strip_additional_properties only walk properties and items. JSON Schemas that compose subschemas via anyOf/oneOf/allOf (or nested under $defs) won't have additionalProperties: false injected (OpenAI strict mode will reject them) and won't have additionalProperties stripped on the Google side. Worth handling these for parity with the OpenAI strict-schema spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 25 - 59, Both
_ensure_openai_strict_schema and _strip_additional_properties only recurse into
"properties" and "items", so they miss subschemas under composition keywords and
definitions; update both functions to also traverse "anyOf", "oneOf", "allOf"
(iterating and recursively processing each subschema when they're lists) and
"$defs" / "definitions" (recursively processing dict values), ensuring you treat
nested dicts/lists consistently (use _ensure_openai_strict_schema in the former
and _strip_additional_properties in the latter) so additionalProperties is
injected or removed throughout composed/nested schemas.

5-5: Avoid depending on google.genai._transformers (private module).

The leading underscore signals it's not part of the public API; minor SDK upgrades can rename, restructure, or remove it without notice, breaking assessment schema conversion silently. The Google Genai SDK now natively supports JSON Schema via the documented response_json_schema parameter (announced Nov 2025)—pass raw JSON Schema directly without converting through t_schema. Alternatively, hand-roll the conversion in _convert_json_schema_to_google by implementing the transformation logic (uppercase types, resolve $ref from $defs, convert anyOf[null, X] to nullable) and pin the SDK version explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` at line 5, The import of
google.genai._transformers in backend.app.assessment.mappers should be removed
and the assessment->GenAI schema flow updated to not rely on the private module;
instead pass the raw JSON Schema into the SDK's documented response_json_schema
parameter when building requests, or implement a robust internal converter
_convert_json_schema_to_google that performs the necessary transformations
(uppercase/openAI-style type names, resolve $ref entries from $defs, convert
anyOf [ "null", X ] into a nullable representation) and pin the SDK version if
you must keep an internal transformer; update any references in this file (e.g.,
where genai_transformers was used) to call response_json_schema or the new
_convert_json_schema_to_google helper.
backend/app/assessment/batch.py (2)

289-304: Shallow copy of openai_params shares nested text/reasoning/tools dicts across all rows.

body = dict(openai_params) is a shallow copy: every JSONL line built in this loop ends up referencing the same nested text.format.json_schema, reasoning, and tools objects. Today nothing mutates them after this point, but any future code that does (e.g. per-row schema override) would silently affect every other row.

Use copy.deepcopy(openai_params) here (and on the Gemini side at the equivalent assignment) if these payloads are ever expected to diverge per row.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 289 - 304, The current shallow
copy "body = dict(openai_params)" causes nested objects (like
text.format.json_schema, reasoning, tools) to be shared across all rows; replace
that with a deep copy (use copy.deepcopy(openai_params)) when constructing body
inside the loop so each JSONL row gets an independent payload; apply the same
deep-copy pattern at the equivalent Gemini-side assignment as well to avoid
cross-row mutation later (look for the "body" variable, "openai_params", and the
jsonl_data.append block/BATCH_KEY usage to locate the spots to change).

138-161: Placeholder-in-value injection in _build_text_prompt.

prompt.replace("{col}", row[col]) replaces all occurrences in sequence. If a column value itself contains another placeholder (e.g., column name has the literal text {policy}), and policy is also a substituted column processed later, the value gets re-expanded. Equivalent to chained substitution. For Indic-text or user-uploaded data this is unlikely, but worth either documenting the contract or using str.format_map with a defaultdict to do a single-pass substitution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 138 - 161, The
_build_text_prompt function currently does sequential replace which allows
placeholder-in-value re-expansion (e.g., a column value containing "{other}"
gets re-substituted when that column is processed later); update
_build_text_prompt to perform a single-pass, safe substitution by building a
mapping of column->normalized value and using a single-call formatting approach
(e.g., str.format_map with a default/missing-safe dict) instead of iterative
prompt.replace, ensuring you still call normalize_llm_text for each value before
substitution; keep the original behavior for the no-template branch unchanged.
backend/app/assessment/models.py (1)

32-36: Primary-key fields are typed as non-optional but default to None.

id: int = SQLField(default=None, primary_key=True) is internally inconsistent with strict type checking — the value is None until the DB assigns it. SQLModel's idiomatic form is id: int | None = SQLField(default=None, primary_key=True).

Low-priority cleanup; behavior is unchanged.

Also applies to: 144-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/models.py` around lines 32 - 36, The primary-key
fields (e.g., the id field declared as id: int = SQLField(default=None,
primary_key=True)) are typed non-optional while defaulting to None; update the
type annotations to allow None (use int | None or Optional[int]) so the
signature matches the SQLField(default=None) behavior, e.g., change id: int to
id: int | None = SQLField(...); apply the same fix to the other primary-key
field(s) with the same pattern elsewhere in the file (the block around the
second id declaration).
backend/app/assessment/crud.py (1)

255-315: Recompute is read‑modify‑write without locking — concurrent updates can leave stale counters.

recompute_assessment_status reads child runs, computes counters, and writes them back to the parent. If two workers run this concurrently for the same assessment_id (e.g. cron + an SSE-triggered recompute), the later commit overwrites the earlier one based on a possibly stale snapshot of child rows.

Today's call sites appear to be single-threaded (cron + sequential processing), so this is informational. If parallel polling is ever introduced, consider SELECT ... FOR UPDATE on the parent row, or recompute counters in SQL with a single UPDATE ... FROM (SELECT ...) aggregation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 255 - 315,
recompute_assessment_status currently does a read-modify-write on the Assessment
row which can be lost under concurrent recomputes; to fix, acquire a row lock on
the parent Assessment (e.g., SELECT ... FOR UPDATE) before recomputing in
recompute_assessment_status or replace the logic with a single atomic UPDATE ...
FROM (SELECT ...) aggregation that computes totals from AssessmentRun and
updates Assessment in one statement; reference the recompute_assessment_status
function and the Assessment/AssessmentRun models and ensure you commit while
holding the lock or use a transaction so concurrent workers cannot overwrite
each other's results.
backend/app/assessment/utils/export.py (1)

395-417: assessment_id=run.assessment_id or 0 masks orphan rows.

Coercing a missing assessment_id to 0 produces an export row that looks like it belongs to assessment 0. If the schema invariant is that exported runs always have a parent, fail loud instead; otherwise relax AssessmentExportRow.assessment_id to int | None in models.py and pass the real value through.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 395 - 417, The export is
masking missing parent relationships by coercing assessment_id to 0; update the
export to pass the real value (use assessment_id=run.assessment_id) and either
(A) if a missing parent should be an error, add an explicit check in the export
routine (raise ValueError or similar when run.assessment_id is None) before
building AssessmentExportRow, or (B) if exports may legitimately have no parent,
update AssessmentExportRow.assessment_id type in models.py to allow None (int |
None) and keep passing run.assessment_id through; locate the construction of
AssessmentExportRow in export_rows.append and implement one of these two fixes
around that symbol.
backend/app/assessment/processing.py (1)

482-486: LGTM on the wrapper, but the local import is doing real work — call it out.

Importing poll_all_pending_assessment_evaluations inside the function avoids a circular import with app.assessment.cron. A short inline comment will save the next maintainer a debugging session if they "clean up" the import to module scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` around lines 482 - 486, The local
import of poll_all_pending_assessment_evaluations inside
poll_all_pending_assessments is intentional to avoid a circular import; add a
concise inline comment above the from app.assessment.cron import
poll_all_pending_assessment_evaluations line explaining that the import must
remain local (not module-level) to prevent circular imports and to warn future
maintainers not to move it to top-level.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/alembic/versions/055_add_assessment_manager_table.py`:
- Around line 15-16: The migration's revision identifier is not zero-padded and
mismatches the filename and down_revision; update the module-level variable
revision from "55" to "055" so it matches the filename and existing
down_revision = "054", keeping the three-digit convention used across migrations
and ensuring subsequent revisions align.

In `@backend/app/assessment/batch.py`:
- Around line 107-135: The workbook created in _parse_excel_rows (wb returned by
openpyxl.load_workbook) may never be closed if iter_rows, next(header) or the
row-to-dict conversion raises; wrap the workbook usage in a try/finally (or
context manager if available) so wb.close() is always called in the finally
block after opening wb and before returning, and move any early returns (e.g.,
when ws is None or header is None) into that try so they still hit the finally
cleanup; ensure variables referenced (wb, ws, rows_iter, header, result) are
initialized inside the try scope.
- Around line 481-569: The code is defaulting unknown provider_name to "openai"
via _NATIVE_PROVIDERS.get(..., "openai"), causing unknown providers to silently
use the OpenAI path; change this to fail fast by using
_NATIVE_PROVIDERS.get(provider_name) (no default) or explicitly checking if
provider_name not in _NATIVE_PROVIDERS and raising ValueError, then use
base_provider for the if/elif branches (referencing the variables base_provider
and provider_name and the mapping _NATIVE_PROVIDERS); ensure the unreachable
else that raises ValueError remains reachable so start_batch_job is only called
for known providers.

In `@backend/app/assessment/cron.py`:
- Around line 144-182: The cleanup steps inside the broad except should be
isolated so a secondary DB error can't abort the whole cron loop: wrap the calls
to update_assessment_run_status, recompute_assessment_status, the
construction/appending of failure_result, assessment_event_broker.publish (and
the failed += 1) in their own try/except block; on exception catch and
logger.error with context (include run.id/run.assessment_id/run.run_name) and
exc_info=True, then continue the for-loop without re-raising so one failed
cleanup won't stop processing remaining runs.
- Around line 41-61: The payload returned by _build_callback_payload nests
"type" inside APIResponse.success_response(...).data so events.py's subscribe()
(which reads payload.get("type", "message")) always falls back to "message"; fix
by promoting the event type to the root of the published payload: in
_build_callback_payload (or immediately after calling
APIResponse.success_response(...).model_dump()) ensure the returned dict
contains a top-level "type": "assessment.child_status_changed" key (while
keeping the detailed run/assessment info under "data"), so subscribe() can
detect the event without changing events.py.

In `@backend/app/assessment/crud.py`:
- Around line 192-223: The update_assessment_run_status function currently
treats None as "do not change", preventing callers from clearing fields; fix
this by introducing a unique sentinel (e.g., UNSET = object()) as the default
for error_message, batch_job_id, total_items, and object_store_url, then change
the checks to "is not UNSET" and assign the value (which may be None) to
run.error_message, run.batch_job_id, run.total_items, and run.object_store_url
inside update_assessment_run_status so callers can explicitly clear fields by
passing None; update any call sites accordingly to use the sentinel omission vs
explicit None semantics.

In `@backend/app/assessment/processing.py`:
- Line 253: Update the log_prefix to include the current function name so all
subsequent logger calls follow the project's convention; replace the line
setting log_prefix = f"[assessment_run={run.id}]" with something like log_prefix
= f"[{inspect.currentframe().f_code.co_name}][assessment_run={run.id}]" (and add
"import inspect" if missing) so functions like the one containing run.id will
prefix logs with the function name and the assessment_run id used in
logger.info/logger.error calls.
- Around line 198-231: In the Google/Gemini branch (the block handling
provider_name in ("google","google-native")), populate the "usage" field from
the response instead of hard-coding None (e.g., extract response.get("usage")
or, if Gemini batch responses use a different shape, parse the batch structure
to pull usage/metadata) and treat an empty string from
extract_text_from_response_dict(response) as an error by setting error="Empty
response output" (mirror the Anthropic branch behavior); update the logic around
response, text = extract_text_from_response_dict(response) and the
results.append entries to include the extracted usage and to set error when text
is empty.

In `@backend/app/assessment/utils/export.py`:
- Around line 60-66: The expansion loop in export.py silently overwrites input
columns that share names with export fields (e.g. output, error, row_id, run_id,
result_status, response_id) by setting new_row[k] from input_data then calling
new_row.update(row); modify the loop that builds new_row (the variables
row_payload, input_data, input_keys, new_row are present) to detect collisions
against a reserved_fields set and instead namespace conflicting input keys (e.g.
store as "input_<k>") or, alternatively, emit a single warning when any
collision is detected; ensure namespacing logic also avoids clobbering an
existing "input_<k>" in row and preserves original row fields when calling
new_row.update(row).
- Around line 422-433: The sort_export_rows function currently sorts by
row.row_id lexicographically which makes "row_10" come before "row_2"; update
sort_export_rows (and any callers expecting stable order like
load_export_rows_for_run / AssessmentExportRow) to extract the numeric index
from row.row_id (e.g. parse the suffix after "row_") and use that integer in the
sort key instead of the raw string, keeping the existing tie-breakers
(row.config_version or 0, row.run_id); ensure the parser handles missing/invalid
suffixes by falling back to 0 so sorting remains stable.

---

Duplicate comments:
In `@backend/app/assessment/dataset.py`:
- Around line 78-101: The _count_excel_rows function can leak the workbook
handle because wb.close() is only called on normal return; ensure the workbook
is always closed by moving wb.close() into a finally block (or use
contextlib.closing or openpyxl.Workbook context manager) so that after creating
wb = openpyxl.load_workbook(...) and before returning from any path (including
exceptions) the workbook is closed; update the logic around ws, header, and the
count comprehension to use this guaranteed cleanup and remove duplicate close
calls inside the try block.
- Around line 140-163: The code currently calls _upload_file_to_object_store and
then unconditionally creates the DB row via create_evaluation_dataset even when
the upload fails or returns None; change upload_dataset so it treats upload
failures as fatal: call _upload_file_to_object_store inside a try/except,
propagate or raise a clear exception when the upload raises or when the returned
object_store_url is falsy/None, and only call create_evaluation_dataset when
object_store_url is a valid non-empty string; reference
_upload_file_to_object_store and create_evaluation_dataset to locate the change
and ensure no EvaluationDataset is persisted if the upload did not succeed.

In `@backend/app/assessment/events.py`:
- Around line 13-48: The subscriber queues are unbounded (asyncio.Queue()) so
slow/idle SSE clients can cause OOM; fix by creating bounded queues in subscribe
(e.g., queue: asyncio.Queue[dict] = asyncio.Queue(maxsize=XX)) and update
publish to handle overflow: if queue.put_nowait(payload) raises
asyncio.QueueFull, either drop the oldest item from that queue
(queue.get_nowait() then retry put_nowait) or treat the subscriber as dead by
discarding the queue from self._subscribers and logging/removing it; change
symbols: modify subscribe (queue creation) and publish (the for loop over
self._subscribers and use of queue.put_nowait) to implement the chosen overflow
strategy and add appropriate debug/info logs.

---

Nitpick comments:
In `@backend/app/alembic/versions/055_add_assessment_manager_table.py`:
- Around line 121-132: The migration defines NOT NULL columns inserted_at and
updated_at without DB defaults, which breaks raw INSERTs; update the Column
definitions for "inserted_at" and "updated_at" to include a database server
default (server_default=sa.func.now()) and add server_onupdate=sa.func.now() for
"updated_at" so the DB will populate timestamps for direct SQL inserts and
automatic updates from the database side.

In `@backend/app/api/routes/cron.py`:
- Around line 79-95: The code mutates result assuming it's a dict which can
raise a TypeError if process_all_pending_evaluations (or the variable result) is
None or not a dict; guard and normalize it before merging by checking
isinstance(result, dict) (or using result = result or {}) and then safely set
result["assessment"] = assessment_result and update totals from
assessment_result; reference the variables result and assessment_result and the
surrounding merge block so the normalization happens immediately before the
lines that assign result["assessment"] and update
total_processed/total_failed/total_still_processing.
- Around line 71-77: The inline runtime import of
poll_all_pending_assessment_evaluations inside the cron request handler should
be resolved: either hoist the import to module scope so the function is imported
once at startup (remove the in-handler from app.assessment.cron import ... and
add it at top-level), or if this was done to avoid a circular import, add a
concise comment above the inline import explaining the circular-import reason
and why the import must remain lazy. Ensure you keep the existing call to
poll_all_pending_assessment_evaluations(session=session) unchanged and reference
that symbol in your change.

In `@backend/app/assessment/batch.py`:
- Around line 289-304: The current shallow copy "body = dict(openai_params)"
causes nested objects (like text.format.json_schema, reasoning, tools) to be
shared across all rows; replace that with a deep copy (use
copy.deepcopy(openai_params)) when constructing body inside the loop so each
JSONL row gets an independent payload; apply the same deep-copy pattern at the
equivalent Gemini-side assignment as well to avoid cross-row mutation later
(look for the "body" variable, "openai_params", and the jsonl_data.append
block/BATCH_KEY usage to locate the spots to change).
- Around line 138-161: The _build_text_prompt function currently does sequential
replace which allows placeholder-in-value re-expansion (e.g., a column value
containing "{other}" gets re-substituted when that column is processed later);
update _build_text_prompt to perform a single-pass, safe substitution by
building a mapping of column->normalized value and using a single-call
formatting approach (e.g., str.format_map with a default/missing-safe dict)
instead of iterative prompt.replace, ensuring you still call normalize_llm_text
for each value before substitution; keep the original behavior for the
no-template branch unchanged.

In `@backend/app/assessment/cron.py`:
- Around line 73-76: The log call in poll_all_pending_assessment_evaluations
uses two adjacent string literals ("[poll_all_pending_assessment_evaluations] "
"No active assessments found") which concatenates into a string with double
space; update the logger.info invocation in cron.py to use a single combined
string (e.g., "[poll_all_pending_assessment_evaluations] No active assessments
found") or an f-string so the prefix and message are one contiguous string with
a single space, ensuring the message reads cleanly.

In `@backend/app/assessment/crud.py`:
- Around line 255-315: recompute_assessment_status currently does a
read-modify-write on the Assessment row which can be lost under concurrent
recomputes; to fix, acquire a row lock on the parent Assessment (e.g., SELECT
... FOR UPDATE) before recomputing in recompute_assessment_status or replace the
logic with a single atomic UPDATE ... FROM (SELECT ...) aggregation that
computes totals from AssessmentRun and updates Assessment in one statement;
reference the recompute_assessment_status function and the
Assessment/AssessmentRun models and ensure you commit while holding the lock or
use a transaction so concurrent workers cannot overwrite each other's results.

In `@backend/app/assessment/events.py`:
- Line 26: Replace the deprecated alias asyncio.TimeoutError with the builtin
TimeoutError in the exception handler inside backend.app.assessment.events (the
except block currently using asyncio.TimeoutError); update the except clause in
the function/method where that try/except appears (look for the "except
asyncio.TimeoutError:" line) to use "except TimeoutError:" so it relies on the
built-in exception instead of the asyncio alias.

In `@backend/app/assessment/mappers.py`:
- Around line 10-22: The type hint for normalize_llm_text is inaccurate because
it can accept and return None; update its annotation to accept Optional[str] and
return Optional[str] (from typing import Optional), i.e. change def
normalize_llm_text(text: str) -> str to def normalize_llm_text(text:
Optional[str]) -> Optional[str], and keep the existing early-return behavior so
callers that pass kaapi_params.get("instructions") (which may be None) are
correctly typed.
- Around line 25-59: Both _ensure_openai_strict_schema and
_strip_additional_properties only recurse into "properties" and "items", so they
miss subschemas under composition keywords and definitions; update both
functions to also traverse "anyOf", "oneOf", "allOf" (iterating and recursively
processing each subschema when they're lists) and "$defs" / "definitions"
(recursively processing dict values), ensuring you treat nested dicts/lists
consistently (use _ensure_openai_strict_schema in the former and
_strip_additional_properties in the latter) so additionalProperties is injected
or removed throughout composed/nested schemas.
- Line 5: The import of google.genai._transformers in
backend.app.assessment.mappers should be removed and the assessment->GenAI
schema flow updated to not rely on the private module; instead pass the raw JSON
Schema into the SDK's documented response_json_schema parameter when building
requests, or implement a robust internal converter
_convert_json_schema_to_google that performs the necessary transformations
(uppercase/openAI-style type names, resolve $ref entries from $defs, convert
anyOf [ "null", X ] into a nullable representation) and pin the SDK version if
you must keep an internal transformer; update any references in this file (e.g.,
where genai_transformers was used) to call response_json_schema or the new
_convert_json_schema_to_google helper.

In `@backend/app/assessment/models.py`:
- Around line 32-36: The primary-key fields (e.g., the id field declared as id:
int = SQLField(default=None, primary_key=True)) are typed non-optional while
defaulting to None; update the type annotations to allow None (use int | None or
Optional[int]) so the signature matches the SQLField(default=None) behavior,
e.g., change id: int to id: int | None = SQLField(...); apply the same fix to
the other primary-key field(s) with the same pattern elsewhere in the file (the
block around the second id declaration).

In `@backend/app/assessment/processing.py`:
- Around line 482-486: The local import of
poll_all_pending_assessment_evaluations inside poll_all_pending_assessments is
intentional to avoid a circular import; add a concise inline comment above the
from app.assessment.cron import poll_all_pending_assessment_evaluations line
explaining that the import must remain local (not module-level) to prevent
circular imports and to warn future maintainers not to move it to top-level.

In `@backend/app/assessment/service.py`:
- Line 10: service.py is importing the private function _resolve_config from
batch.py which couples modules to an internal symbol; change this by promoting
the helper to a public API (rename _resolve_config to resolve_config) or move it
into a shared module (e.g., assessment.config or assessment_utils) and update
both callers to import the new public name, then remove the underscore import
from batch.py and service.py and ensure submit_assessment_batch and any other
callers reference the new resolve_config location.
- Around line 145-193: The start_assessment flow performs multiple DB
creates/updates (create_assessment, create_assessment_run,
update_assessment_run_status, recompute_assessment_status) without an explicit
transaction, allowing partial commits if an error occurs (e.g., during
submit_assessment_batch or recompute_assessment_status); wrap the multi-step
orchestration in a single transactional boundary (use session.begin() /
session.begin_nested() as appropriate) so all create/update calls are atomic and
commit once at the end, or alternatively, on any uncaught exception ensure you
mark the parent Assessment as failed (update_assessment_run_status + update
parent assessment status and error_message via the same session) before
re-raising/returning to avoid leaving orphaned pending/processing runs; locate
these changes in start_assessment and the exception handling around
submit_assessment_batch/recompute_assessment_status to implement the chosen
approach.
- Around line 159-195: The loop currently calls recompute_assessment_status
inside both the try (after update_assessment_run_status) and the except blocks,
causing redundant DB writes; remove the per-iteration calls to
recompute_assessment_status within the try/except and keep only the final
recompute_assessment_status after the loop completes so the parent's status is
updated once; locate the calls around submit_assessment_batch,
update_assessment_run_status, and the except block handling the exception for
run processing and delete those inner recompute_assessment_status invocations
while leaving the final recompute_assessment_status(session=session,
assessment_id=assessment.id) at the end.
- Around line 51-62: The code is redundantly wrapping an already-UUID config_id
with UUID(str(...)) in the loop building AssessmentConfigRef; change the
AssessmentConfigRef call to use the existing UUID directly (e.g.
config_id=run.config_id) and if the type checker complains, assert or cast the
non-None type before use (e.g. assert run.config_id is not None then use
run.config_id) so you remove the unnecessary UUID(str(run.config_id))
round-trip; update the instantiation in the loop that creates
AssessmentConfigRef accordingly.

In `@backend/app/assessment/utils/export.py`:
- Around line 395-417: The export is masking missing parent relationships by
coercing assessment_id to 0; update the export to pass the real value (use
assessment_id=run.assessment_id) and either (A) if a missing parent should be an
error, add an explicit check in the export routine (raise ValueError or similar
when run.assessment_id is None) before building AssessmentExportRow, or (B) if
exports may legitimately have no parent, update
AssessmentExportRow.assessment_id type in models.py to allow None (int | None)
and keep passing run.assessment_id through; locate the construction of
AssessmentExportRow in export_rows.append and implement one of these two fixes
around that symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 072190d8-1f9b-4ad9-9f98-d507cb197d57

📥 Commits

Reviewing files that changed from the base of the PR and between ec2eeda and f2e512e.

📒 Files selected for processing (21)
  • backend/app/alembic/versions/055_add_assessment_manager_table.py
  • backend/app/api/main.py
  • backend/app/api/routes/cron.py
  • backend/app/assessment/__init__.py
  • backend/app/assessment/batch.py
  • backend/app/assessment/cron.py
  • backend/app/assessment/crud.py
  • backend/app/assessment/dataset.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/models.py
  • backend/app/assessment/processing.py
  • backend/app/assessment/routes.py
  • backend/app/assessment/service.py
  • backend/app/assessment/utils/__init__.py
  • backend/app/assessment/utils/export.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/assessment/validators.py
  • backend/app/models/__init__.py
  • backend/app/models/batch_job.py
  • backend/app/models/llm/request.py
✅ Files skipped from review due to trivial changes (5)
  • backend/app/models/batch_job.py
  • backend/app/assessment/init.py
  • backend/app/api/main.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/models/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • backend/app/models/llm/request.py
  • backend/app/assessment/utils/init.py
  • backend/app/assessment/routes.py

Comment thread backend/app/alembic/versions/055_add_assessment_manager_table.py Outdated
Comment thread backend/app/assessment/batch.py Outdated
Comment thread backend/app/assessment/batch.py Outdated
Comment thread backend/app/assessment/cron.py Outdated
Comment thread backend/app/assessment/cron.py Outdated
Comment thread backend/app/crud/assessment/core.py
Comment thread backend/app/assessment/processing.py Outdated
Comment thread backend/app/assessment/processing.py Outdated
Comment thread backend/app/services/assessment/utils/export.py
Comment thread backend/app/services/assessment/utils/export.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/assessment/dataset.py`:
- Around line 22-26: Remove support for legacy .xls by deleting the ".xls" entry
from the _MIME_TYPES dict and from ALLOWED_EXTENSIONS in validators.py so .xls
files are rejected at validation (do not route them to openpyxl). Also update
the validation routine that checks ALLOWED_EXTENSIONS to explicitly reject
".xls" (and return a clear validation error). Finally, harden the parsing code
that calls openpyxl (the read/parse functions in dataset.py and batch.py where
openpyxl is invoked) by replacing the broad silent except that returns 0 rows
with either re-raising the caught InvalidFileException or logging the exception
and returning a proper error response so legacy-file parsing failures aren’t
swallowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f0dd174-276b-4d14-9951-73c8a226e4b3

📥 Commits

Reviewing files that changed from the base of the PR and between f2e512e and 0c04860.

📒 Files selected for processing (3)
  • backend/app/assessment/batch.py
  • backend/app/assessment/dataset.py
  • backend/app/models/llm/request.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/assessment/batch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/models/llm/request.py

Comment thread backend/app/services/assessment/dataset.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
backend/app/assessment/events.py (2)

36-48: Tighten the payload type annotation.

dict without parameters loses information for callers and static analyzers. Since payloads are JSON-serialized and the type field is read via .get("type"), prefer dict[str, Any] to match the queue type and align with the Python 3.11+ typing convention used elsewhere in the codebase.

♻️ Proposed fix
-    def publish(self, payload: dict) -> None:
+    def publish(self, payload: dict[str, Any]) -> None:

Add the import at the top:

-from collections.abc import AsyncIterator
+from collections.abc import AsyncIterator
+from typing import Any

Optionally also update the queue annotation on Line 13 / 16 to asyncio.Queue[dict[str, Any]] for consistency.

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 36 - 48, Update the publish
method signature to use a precise payload type: change payload: dict to payload:
dict[str, Any] and add from typing import Any to the module imports; also update
any asyncio.Queue annotations used for subscribers (e.g.,
asyncio.Queue[dict[str, Any]] or the _subscribers collection element type) so
the queue and publish agree on dict[str, Any] payloads, keeping the logic
unchanged.

24-24: Use builtin TimeoutError instead of the deprecated asyncio.TimeoutError alias.

Since Python 3.11, asyncio.TimeoutError is a deprecated alias for the builtin TimeoutError (per Ruff UP041). Given the codebase targets Python 3.11+, prefer the builtin.

♻️ Proposed fix
-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` at line 24, Replace the deprecated
asyncio.TimeoutError alias with the builtin TimeoutError in the exception
handler: change the except clause currently written as "except
asyncio.TimeoutError" in backend/app/assessment/events.py to "except
TimeoutError" and remove any now-unused reference to asyncio if that's the only
usage in this module; ensure the surrounding function/method (the exception
handling block in events.py) still behaves the same.
backend/app/tests/api/routes/test_cron.py (1)

28-47: Consider asserting the new assessment merge behavior and aligning the mock shape with the production contract.

The three updated tests patch poll_all_pending_assessment_evaluations but never assert on the new behavior the cron handler adds (merging assessment counts into the totals and exposing result["assessment"]). With all assessment counts mocked to 0, the existing total_* assertions would still pass even if the merge logic in evaluation_cron_job regressed (e.g., assessment counts ignored, key renamed, exception swallowed into assessment_error). Adding at least one case where assessment counts are non-zero would actually exercise the new code path this PR is testing.

Additionally, the real poll_all_pending_assessment_evaluations always returns total and details alongside the three counters (see backend/app/assessment/cron.py lines 189-195). The mock omitting those fields drifts from the production contract; if a future change reads assessment["details"] or assessment["total"], these tests will silently keep passing.

♻️ Suggested tightening (apply to one or more scenarios)
     with patch(
         "app.api.routes.cron.process_all_pending_evaluations",
         new=AsyncMock(return_value=mock_result),
     ), patch(
         "app.assessment.cron.poll_all_pending_assessment_evaluations",
         new=AsyncMock(
-            return_value={"processed": 0, "failed": 0, "still_processing": 0}
+            return_value={
+                "total": 2,
+                "processed": 1,
+                "failed": 1,
+                "still_processing": 0,
+                "details": [],
+            }
         ),
     ):
         response = client.get(
             f"{settings.API_V1_STR}/cron/evaluations",
             headers={"X-API-KEY": superuser_api_key.key},
         )

     assert response.status_code == 200
     data = response.json()
     assert data["status"] == "success"
-    assert data["total_processed"] == 5
-    assert data["total_failed"] == 0
-    assert data["total_still_processing"] == 1
+    # Verify assessment counts are merged into the aggregated totals
+    assert data["total_processed"] == 5 + 1
+    assert data["total_failed"] == 0 + 1
+    assert data["total_still_processing"] == 1 + 0
+    assert data["assessment"]["processed"] == 1
+    assert data["assessment"]["failed"] == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_cron.py` around lines 28 - 47, Update the
tests that patch poll_all_pending_assessment_evaluations to return a realistic
shape (include "processed", "failed", "still_processing" with at least one
non-zero count plus the production fields "total" and "details") and then assert
that evaluation_cron_job/cron handler merges those assessment counts into the
response (check response.json()["assessment"] exists and that
total_processed/total_failed/total_still_processing include the assessment
counts). Ensure the mock uses AsyncMock(return_value=...) for
poll_all_pending_assessment_evaluations and add assertions referencing the
merged totals and the assessment sub-dictionary.
backend/app/assessment/crud.py (1)

1-12: CRUD module should be placed under backend/app/crud/ per project convention.

This file is located at backend/app/assessment/crud.py, but the project convention places CRUD modules in backend/app/crud/ (as noted in learnings and confirmed by the existing backend/app/crud/ directory). If your team is intentionally adopting feature-folder organization for the assessment subsystem, document this exception in CLAUDE.md or contributor guidelines to keep the convention consistent for reviewers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 1 - 12, The CRUD module for
Assessment is in the wrong location; move backend/app/assessment/crud.py into
the shared CRUD package under backend/app/crud/ (keep filename crud.py) so that
the Assessment and AssessmentRun operations live alongside other CRUD modules;
update any imports that reference app.assessment.crud to app.crud.crud (or
equivalent) and ensure logger = logging.getLogger(__name__) and references to
Assessment and AssessmentRun continue to resolve; if you intentionally want
feature-folder organization instead, add a clear exception note to CLAUDE.md or
the contributor guidelines describing why assessment CRUD lives under
app/assessment and list the impacted symbols (Assessment, AssessmentRun,
functions in crud.py) so reviewers are aware.
backend/app/assessment/mappers.py (3)

117-118: Dead branch — "null" is no longer a valid summary value.

TextLLMParams.summary now only accepts "auto" | "detailed" | "concise" | None. The summary == "null" check can never be True for validated params and only adds confusion. Drop the special-case.

♻️ Suggested simplification
-        if summary is not None:
-            reasoning_payload["summary"] = None if summary == "null" else summary
+        if summary is not None:
+            reasoning_payload["summary"] = summary
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 117 - 118, The `"null"`
branch is dead because TextLLMParams.summary no longer allows "null"; inside the
block that checks `if summary is not None` (where `reasoning_payload` and
`summary` are used) remove the `summary == "null"` special-case and assign the
`summary` value directly (e.g., set `reasoning_payload["summary"]` to
`summary`), leaving None handling as-is; update any related comments if present.

96-98: Implicit aliasing of reasoningeffort is non-obvious.

effort = kaapi_params.get("effort") or reasoning quietly promotes TextLLMParams.reasoning (Literal low/medium/high) into the OpenAI reasoning.effort slot when effort isn't set. The two fields have different declared enums and semantics (reasoning is described as “Reasoning configuration or instructions,” while effort is the OpenAI-specific reasoning effort). At minimum add a short comment explaining the fallback so future readers don’t conflate them; ideally, decide whether reasoning is a deprecated alias for effort and document/log it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 96 - 98, The current fallback
"effort = kaapi_params.get('effort') or reasoning" silently reuses
TextLLMParams.reasoning for the OpenAI-specific reasoning.effort, which
conflates different enums/semantics; update the mapping in
backend/app/assessment/mappers.py to make this explicit by either (a) adding a
clear comment above the lines referencing kaapi_params explaining that reasoning
is used as a fallback for effort and why, or (b) preferably implement an
explicit conversion/logging path: if kaapi_params.get("effort") is missing but
reasoning is present, map/translate reasoning to the allowed effort values (or
log a deprecation/warning that reasoning is being used as an alias) before
assigning to effort; reference the symbols kaapi_params, reasoning, effort and
ensure any enum translation or warning message clarifies the intended semantics.

1-7: Unused logger.

logger = logging.getLogger(__name__) is defined but never used in this module. Either drop it, or replace the silent warnings.append(...) calls with structured logs prefixed with the function name (per the repo logging convention) so warnings are observable in the runtime, not only in the returned tuple.

As per coding guidelines: “Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message ...")”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 1 - 7, The module defines
logger = logging.getLogger(__name__) but never uses it; update the module to
either remove the unused logger variable or (preferred) replace the silent
warnings.append(...) calls in the functions in this file with structured logging
using that logger (e.g., logger.warning) and follow repo convention by prefixing
messages with the function name in square brackets (use the function's actual
name in the message), e.g., logger.warning(f"[function_name] descriptive
warning..."); keep the existing return tuple behavior but emit warnings via
logger instead of only appending to the warnings list so runtime logs are
observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/assessment/crud.py`:
- Around line 270-289: Remove the dead "in_progress" branch from status checks:
anywhere the code computes processing_runs or inspects AssessmentRun.status
using r.status in {"processing", "in_progress"} (e.g., the processing_runs
calculation and the other conditional in the same module) change the set to just
"processing". Update the expressions that sum or filter runs (symbol:
processing_runs) and any related conditionals that pass values into
_determine_assessment_status so they only check for "processing" to eliminate
the unreachable "in_progress" case.

In `@backend/app/assessment/mappers.py`:
- Around line 203-213: Remove the silent passthrough of the unused "reasoning"
field (delete the assignment to google_params["reasoning"]) and instead emit
warnings when unsupported fields appear: check kaapi_params for "effort" and
"summary" and call the module logger's warning (e.g., logger.warning) to notify
users those params are ignored; preserve the existing handling for
thinking_level -> google_params["thinking_config"] and output_schema ->
google_params["output_schema"] (using _convert_json_schema_to_google) unchanged.
- Around line 95-108: The OpenAI mapper must validate the required model early:
check kaapi_params.get("model") (variable model) and if it's missing/None
immediately return the same error behavior used in the Google mapper (i.e.,
surface a validation error rather than continuing); do this before calling
litellm.supports_reasoning(f"openai/{model}") so you never call
supports_reasoning with None and so model will be present when building
openai_params. Locate the OpenAI mapper's use of model, support_reasoning, and
openai_params and add that early guard/return to mirror the Google mapper's
validation logic.

In `@backend/app/models/llm/request.py`:
- Around line 38-43: Update the Field description for the summary attribute in
the request model to remove the obsolete "null/" wording; change the text that
currently reads "Use null/None to disable." to something like "Use None to
disable." so callers aren’t misled into passing the string "null" — locate the
summary: Literal["auto", "detailed", "concise"] | None = Field(...) definition
and edit only the description string accordingly.

---

Nitpick comments:
In `@backend/app/assessment/crud.py`:
- Around line 1-12: The CRUD module for Assessment is in the wrong location;
move backend/app/assessment/crud.py into the shared CRUD package under
backend/app/crud/ (keep filename crud.py) so that the Assessment and
AssessmentRun operations live alongside other CRUD modules; update any imports
that reference app.assessment.crud to app.crud.crud (or equivalent) and ensure
logger = logging.getLogger(__name__) and references to Assessment and
AssessmentRun continue to resolve; if you intentionally want feature-folder
organization instead, add a clear exception note to CLAUDE.md or the contributor
guidelines describing why assessment CRUD lives under app/assessment and list
the impacted symbols (Assessment, AssessmentRun, functions in crud.py) so
reviewers are aware.

In `@backend/app/assessment/events.py`:
- Around line 36-48: Update the publish method signature to use a precise
payload type: change payload: dict to payload: dict[str, Any] and add from
typing import Any to the module imports; also update any asyncio.Queue
annotations used for subscribers (e.g., asyncio.Queue[dict[str, Any]] or the
_subscribers collection element type) so the queue and publish agree on
dict[str, Any] payloads, keeping the logic unchanged.
- Line 24: Replace the deprecated asyncio.TimeoutError alias with the builtin
TimeoutError in the exception handler: change the except clause currently
written as "except asyncio.TimeoutError" in backend/app/assessment/events.py to
"except TimeoutError" and remove any now-unused reference to asyncio if that's
the only usage in this module; ensure the surrounding function/method (the
exception handling block in events.py) still behaves the same.

In `@backend/app/assessment/mappers.py`:
- Around line 117-118: The `"null"` branch is dead because TextLLMParams.summary
no longer allows "null"; inside the block that checks `if summary is not None`
(where `reasoning_payload` and `summary` are used) remove the `summary ==
"null"` special-case and assign the `summary` value directly (e.g., set
`reasoning_payload["summary"]` to `summary`), leaving None handling as-is;
update any related comments if present.
- Around line 96-98: The current fallback "effort = kaapi_params.get('effort')
or reasoning" silently reuses TextLLMParams.reasoning for the OpenAI-specific
reasoning.effort, which conflates different enums/semantics; update the mapping
in backend/app/assessment/mappers.py to make this explicit by either (a) adding
a clear comment above the lines referencing kaapi_params explaining that
reasoning is used as a fallback for effort and why, or (b) preferably implement
an explicit conversion/logging path: if kaapi_params.get("effort") is missing
but reasoning is present, map/translate reasoning to the allowed effort values
(or log a deprecation/warning that reasoning is being used as an alias) before
assigning to effort; reference the symbols kaapi_params, reasoning, effort and
ensure any enum translation or warning message clarifies the intended semantics.
- Around line 1-7: The module defines logger = logging.getLogger(__name__) but
never uses it; update the module to either remove the unused logger variable or
(preferred) replace the silent warnings.append(...) calls in the functions in
this file with structured logging using that logger (e.g., logger.warning) and
follow repo convention by prefixing messages with the function name in square
brackets (use the function's actual name in the message), e.g.,
logger.warning(f"[function_name] descriptive warning..."); keep the existing
return tuple behavior but emit warnings via logger instead of only appending to
the warnings list so runtime logs are observable.

In `@backend/app/tests/api/routes/test_cron.py`:
- Around line 28-47: Update the tests that patch
poll_all_pending_assessment_evaluations to return a realistic shape (include
"processed", "failed", "still_processing" with at least one non-zero count plus
the production fields "total" and "details") and then assert that
evaluation_cron_job/cron handler merges those assessment counts into the
response (check response.json()["assessment"] exists and that
total_processed/total_failed/total_still_processing include the assessment
counts). Ensure the mock uses AsyncMock(return_value=...) for
poll_all_pending_assessment_evaluations and add assertions referencing the
merged totals and the assessment sub-dictionary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 259f6741-a7b6-438c-973c-4239e35527ce

📥 Commits

Reviewing files that changed from the base of the PR and between 0c04860 and 593eea3.

📒 Files selected for processing (7)
  • backend/app/assessment/crud.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/routes.py
  • backend/app/models/__init__.py
  • backend/app/models/llm/request.py
  • backend/app/tests/api/routes/test_cron.py
✅ Files skipped from review due to trivial changes (2)
  • backend/app/models/init.py
  • backend/app/assessment/routes.py

Comment thread backend/app/assessment/crud.py Outdated
Comment thread backend/app/assessment/mappers.py Outdated
Comment thread backend/app/services/assessment/mappers.py
Comment thread backend/app/models/llm/request.py
@vprashrex vprashrex changed the title Feature/assessment Assessment: Ai Assessment Pipeline Apr 28, 2026
@vprashrex vprashrex self-assigned this Apr 28, 2026
@vprashrex vprashrex added the enhancement New feature or request label Apr 28, 2026
@vprashrex vprashrex moved this to In Review in Kaapi-dev Apr 28, 2026
@vprashrex vprashrex linked an issue Apr 28, 2026 that may be closed by this pull request
Comment thread backend/app/services/assessment/service.py
Comment thread backend/app/tests/api/routes/test_cron.py
Comment thread backend/app/assessment/mappers.py Outdated
Comment thread backend/app/models/assessment.py
Comment thread backend/app/assessment/models.py Outdated
Comment thread backend/app/assessment/batch.py Outdated
Comment thread backend/app/assessment/routes.py Outdated
Comment thread backend/app/alembic/versions/055_add_assessment_manager_table.py Outdated
…nctionality with new routes and models

refactor: Clean up and optimize assessment-related code and imports
test: Add tests for evaluation cron job and assessment polling
chore: Update dependencies in pyproject.toml and uv.lock
@vprashrex vprashrex requested a review from AkhileshNegi April 29, 2026 02:21
vprashrex added 5 commits May 4, 2026 10:18
- Updated assessment service to reject configs not tagged for assessment use.
- Enhanced ConfigCrud to handle default tags and added methods for tag scope filtering.
- Modified DocumentCrud to support tag filtering in read operations.
- Adjusted models to ensure default tags are applied where necessary.
- Added tests to verify behavior for default and explicit tag scenarios in config and document CRUD operations.
- Improved error handling and messaging for invalid config tags in assessments.
@vprashrex vprashrex requested review from Ayush8923 and kartpop and removed request for Ayush8923 May 4, 2026 11:44
Comment on lines +1 to +4
"""add tag column to config table

Revision ID: 056
Revises: 055
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need two migrations?

Comment on lines +174 to +178
if not assessment:
raise HTTPException(
status_code=404,
detail=f"Assessment {assessment_id} not found or not accessible",
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be inside get_assessment_by_id so we dont have to write this check in other parts

response_model=APIResponse[AssessmentDatasetResponse],
dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
)
async def upload_dataset(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this is inside the assessment folder for better readbability can we write it like upload_assessment_dataset as upload_dataset can be confused with evaluation dataset. Though we should follow similar thing there so there is no confusion. Not something to be picked immediately but food for thoughts

Comment on lines +120 to +122
f
for f in AssessmentExportRow.model_fields.keys()
if f not in ("output", "input_data")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace f with better variable name for readability

Comment on lines +152 to +173
for k in parsed:
if k not in seen_keys:
seen_keys[k] = None
output_keys.append(k)

if not output_keys:
# Keep original layout with output as a single column
fieldnames = input_col_names + list(AssessmentExportRow.model_fields.keys())
fieldnames = [f for f in fieldnames if f != "input_data"]
return row_payload, fieldnames

# Build expanded rows
expanded: list[dict[str, Any]] = []
for row, parsed in zip(row_payload, parsed_outputs, strict=True):
new_row = {k: v for k, v in row.items() if k != "output"}
if parsed:
for k in output_keys:
new_row[k] = parsed.get(k)
else:
for k in output_keys:
new_row[k] = None
if row.get("output") is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of single alphabet variable which isn't clean to look or read

Comment thread backend/app/crud/assessment/batch.py Outdated

logger = logging.getLogger(__name__)

_IMAGE_MIME_BY_EXT = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handles MIME type detection, base64 decoding, Google Drive URL normalization, and data-URL ]parsing should not be in CRUD so Move to services/assessment/utils/ maybe

Comment on lines +83 to +89
["organization_id"],
["organization.id"],
ondelete="CASCADE",
),
sa.ForeignKeyConstraint(
["project_id"],
["project.id"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is missing as in line:79 above

return sanitized or "assessment_results"


def _expand_input_columns(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add typehints in this file

Comment on lines +52 to +53
for k in input_data:
if k not in seen_keys:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use better variable names

Comment on lines +64 to +68
for k in input_keys:
col = f"input_{k}" if k in reserved_fields else k
key_map[k] = col

collisions = {k: v for k, v in key_map.items() if k != v}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better variable names

default=None,
description="Reasoning configuration or instructions",
)
effort: Literal["none", "minimal", "low", "medium", "high", "xhigh"] | None = Field(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is by default none so would not be much impact ful for the regular llm-call

vprashrex added 4 commits May 5, 2026 12:55
… safety

- Added HTTPException raises for not found cases in `get_assessment_by_id`, `get_assessment_run_by_id`, and `get_assessment_dataset_by_id` functions to improve error handling.
- Updated return types for several functions to ensure they always return the expected model instead of None.
- Refactored logging statements in assessment evaluation polling to use parameterized logging for better readability and performance.
- Introduced utility functions for attachment resolution, including MIME type detection and URL normalization.
- Improved the handling of assessment run statuses and batch processing, ensuring proper error logging and status updates.
- Enhanced test coverage for CRUD operations and mappers, ensuring that exceptions are raised and handled correctly.
- Updated assessment models to use stricter type definitions for status fields and optional attributes.
@vprashrex vprashrex requested a review from AkhileshNegi May 5, 2026 07:39
Comment thread backend/app/crud/assessment/dataset.py Outdated
)
raise HTTPException(
status_code=500,
detail=f"Failed to save assessment dataset metadata: {e}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in API response, we should return human readable error without leaking internal information about schema and constraints
This will be the error in this code

 {
    "success": false,
    "error": {
      "message": "Failed to save assessment dataset metadata: (psycopg2.errors.NotNullViolation) null value in column \"organization_id\" of relation \"evaluation_dataset\"
  violates not-null constraint\nDETAIL:  Failing row contains (null, 'My Dataset', 'description', 'ASSESSMENT', '{}', null, null, ...).\n[SQL: INSERT INTO evaluation_dataset
   (name, description, type, dataset_metadata, ...) VALUES (%(name)s, %(description)s, ...)]\n[parameters: {'name': 'My Dataset', ...}]"
    }
  }

instead add this information in log and only return human readable error in API response

@vprashrex vprashrex merged commit 5f8c4c3 into main May 6, 2026
3 checks passed
@vprashrex vprashrex deleted the feature/assessment branch May 6, 2026 10:00
@github-project-automation github-project-automation Bot moved this from In Review to Closed in Kaapi-dev May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

Assessment: AI assessment pipeline

3 participants