Skip to content

[feat] add lance#3564

Open
morningman wants to merge 1 commit intoapache:masterfrom
morningman:v20260421
Open

[feat] add lance#3564
morningman wants to merge 1 commit intoapache:masterfrom
morningman:v20260421

Conversation

@morningman
Copy link
Copy Markdown
Contributor

@morningman morningman commented Apr 22, 2026

Versions

  • dev
  • 4.x
  • 3.x
  • 2.1

Languages

  • Chinese
  • English

Docs Checklist

  • Checked by AI
  • Test Cases Built

## Limitations

- **TVF only**: Only the `s3` and `local` TVFs are supported. `CREATE CATALOG` is not supported yet.
- **Single data file per glob**: The `file_path` / `uri` must match exactly one `.lance` data file per dataset. If a glob matches multiple `.lance` files within the same multi-fragment dataset, each scan range will reopen the full dataset and produce duplicate rows.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged code (be/src/format/lance/lance_rust_reader.cpp:89-108) already extracts fragment_file from the scan-range path and passes it to the Rust side, which filters to exactly that
fragment. Glob-matching multiple .lance files in a multi-fragment dataset works correctly — no duplication. I verified this end-to-end locally: multi.lance/data/*.lance returned 15 rows
(correct), not 45.

└── ...
```

When querying via TVF, the `uri` / `file_path` must point to a single `.lance` data file inside the `data/` subdirectory of the dataset. Doris automatically resolves the dataset root from this path and reads all fragments belonging to the dataset.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the limitation says only one fragment is read if multiple are globbed. These can't both be true. With the merged fragment_file logic, the accurate statement is: each scan range reads
exactly one fragment; a glob over data/*.lance assigns one fragment to each scan range, producing the full dataset.

) ORDER BY id LIMIT 10;
```

### Aggregation over a Multi-Fragment Dataset
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT count(*), min(id), max(id) FROM s3(
"uri" = "s3://bucket/path/to/large.lance/data/fragment.lance",
...

data/fragment.lance points at one specific file — this reads one fragment, not the whole dataset. To show multi-fragment aggregation, use a glob:

"uri" = "s3://bucket/path/to/large.lance/data/*.lance",

Same for the local example — real Lance datasets have UUID-named fragment files, so pointing at one by name is fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants