Wildvine API — What Changed

Optimization deep-dive: what was done in the code, with snippets. Generated 2026-05-28.

TL;DR

~6.8s → 0.8s

Filtered /v3/trends (the slow case) — incl. ~200 ms network

2–6s → 0.1–0.4s

other v3/BDS endpoints, end-to-end

~0.6s exec

filtered Trend in-Lambda after the rewrite

byte-identical

verified vs pandas (modulo inherent ±RM0.01 price float-order)

The lever

Latency was dominated by Athena's ~1–2s per-query floor and pulling 215k raw rows for Trend, then heavy single-threaded pandas (enrich + cross-year census + aggregation). Fix: replace Athena with in-process DuckDB over a Parquet copy baked into the Lambda image, then do the entire Trend pipeline — including census — in DuckDB SQL (multi-threaded C++). The Lambda returns ~tens of rows and does almost no pandas.

1. `src/query.py` — Athena tuning + new DuckDB engine

Tuned the remaining Athena calls (most surface area moved off Athena entirely), and added query_gus() — runs SQL in-process against the bundled Parquet:

def query_athena(query_name, sql_query):
    return wr.athena.read_sql_query(
        sql_query.get_sql(), ctas_approach=False, database=ATHENA_DATABASE,
        s3_output='{}/{}'.format(ATHENA_S3_OUTPUT, query_name),
        athena_query_wait_polling_delay=0.25,                 # was the 1.0s default
        result_reuse_configuration={"ResultReuseByAgeConfiguration":
            {"Enabled": True, "MaxAgeInMinutes": ATHENA_RESULT_REUSE_MINUTES}},
    )

GUS_TABLE = 'dataset_gus_1_new_2ha'
GUS_PARQUET_DIR = pathlib.Path(__file__).parent / 'data' / 'dataset_gus'
_duck_con = None; _duck_lock = threading.Lock()

def _gus_connection():
    global _duck_con
    if _duck_con is None:
        with _duck_lock:
            if _duck_con is None:
                con = duckdb.connect(database=':memory:')
                glob = str(GUS_PARQUET_DIR / '**' / '*.parquet')
                con.execute(f'CREATE VIEW "{GUS_TABLE}" AS '
                            f"SELECT * FROM read_parquet('{glob}', hive_partitioning=true)")
                _duck_con = con
    return _duck_con

def query_gus(sql_query):
    sql = sql_query.get_sql() if hasattr(sql_query, 'get_sql') else sql_query
    return _gus_connection().cursor().execute(sql).df()

2. `src/v3/repository.py`

(a) Swapped Athena → DuckDB at every v3/BDS call site (7 + 2)

# before
df = prepare_dataframe(query_athena("v3-trends", q))
# after
df = prepare_dataframe(query_gus(q))

(b) Vectorized Trend aggregation + deterministic sort

Replaced ~1,100 re-filter passes (_metric per metric per group) with one group pass (_metric_from + _aggregate_trends), and added a stable sort so float sums are reproducible across separate fetches:

work = df.copy()
for c in cast_cols:
    work[c] = work[c].fillna(0).astype(float)
# Deterministic row order so float summation is reproducible across
# separate Athena fetches (result-page order is not guaranteed).
sort_keys = [c for c in ('year','dbh_size_class','species','tag') if c in work.columns]
if sort_keys:
    work = work.sort_values(sort_keys, kind='stable').reset_index(drop=True)

(c) Three-tier dispatch: cache → SQL fast path → pandas fallback

def get_trends(view_by, dbh_size_classes, location_ids, ...):
    if _trend_cache_eligible(...):                       # unfiltered → DynamoDB
        cached = _serve_trends_from_cache(...)
        if cached is not None: return cached
    return _get_trends_live(...)

def _get_trends_live(view_by, dbh_size_classes, location_ids, ...):
    if location_ids:                                     # SQL fast path
        total_ha = sum(LOCATION_AREA.get(int(l), 0) for l in location_ids)
        if total_ha > 0:
            res = trends_sql.compute(..., V3_METRICS, float(total_ha), _lookup_species_ids_by_name)
            if res is not None: return res
    # ---- pandas fallback (for dup-tag locations) ----
    q = _query(location_ids, None, species_id, dbh_size_classes, skip_year_filter=True)
    ...

(d) Flora — deterministic `ORDER BY` so pagination is stable + engine-independent

q = q.orderby(t.location_id, t.tag, t.year)

3. `src/v3/trends_sql.py` (new) — the entire Trend pipeline in SQL

Enrich (CSV join) + price (CASE on wood-group + DBH tier) + census via window functions / self-joins + aggregation (GROUP BY GROUPING SETS). The Lambda gets back ~tens of rows; near-zero pandas on the hot path.

The non-trivial bit — per-tree census computed in SQL via a self-join to the immediately preceding census year, with broadcast rate columns:

-- preceding census year per location (LAG over the DISTINCT (loc, yr))
yrs  AS (SELECT loc, yr, LAG(yr) OVER (PARTITION BY loc ORDER BY yr) pyear
         FROM (SELECT DISTINCT loc, yr FROM base)),

-- self-join curr → prev by (loc, tag, pyear): matched = survivor
j    AS (SELECT c.*, p.dbh p_data, p.basal_area p_ba, p.volume p_vol,
                CASE WHEN p.tag IS NOT NULL THEN 1 ELSE 0 END surv
         FROM cur c LEFT JOIN base p
              ON p.loc=c.loc AND p.tag=c.tag AND p.yr=c.pyear),

-- per-row census columns; rates broadcast per (loc, yr)
rows AS (SELECT ...,
           CASE WHEN surv=1 THEN dbh - p_data ELSE 0.0 END growth_rate,
           CASE WHEN pyear IS NULL THEN 0.0
                ELSE CAST(n_recruit AS DOUBLE)/n_curr END recruitment_rate,
           CASE WHEN pyear IS NULL THEN 0.0
                ELSE CAST(pc.n_prev - n_surv AS DOUBLE)/pc.n_prev END mortality_rate,
           ...)

-- final: both views + per-year 'sum', flagged by GROUPING() (not a NULL key)
SELECT yr, {gcol} AS grp, GROUPING({gcol}) AS is_sum,
       COUNT(*) n, COUNT(DISTINCT species) nuniq,
       SUM(...), AVG(mortality_rate), ...
FROM rows GROUP BY GROUPING SETS ((yr, {gcol}), (yr))

And the matching Python reshape — keys the sum row by is_sum=1 (the bug fix), skips real-NULL groups (matching pandas dropna):

for _, r in df.iterrows():
    y = int(r['yr'])
    if int(r['is_sum']) == 1:
        sum_by_year[y] = r
    else:
        g = r['grp']
        if g is None or (isinstance(g, float) and pd.isna(g)):
            continue                          # real-NULL group dropped, like pandas dropna
        grp_by_year.setdefault(y, {})[g] = r

A gate at the top of compute() returns None (→ pandas fallback) when the requested locations contain duplicate (tag, year) rows — the SQL census can't reproduce pandas' arbitrary order-dependent behavior in that edge case. loc2 (the slow 50 ha plot) is clean → SQL; loc1 → pandas (and it's already fast at 2 ha).

4. Supporting

File	Change
`src/models.py`	New `TrendCache` Dyntastic model (`wildvine_trend_cache`)
`src/v3/trend_precompute.py`	(new) precomputes the unfiltered cache
`src/requirements.txt`	added `duckdb`
`template.yaml`	added `DynamoDBReadPolicy` for `wildvine_trend_cache` (memory left at 3008 — account cap)
`scripts/refresh_gus_parquet.sh`	(new) re-pull Parquet + redeploy on data refresh
`src/data/dataset_gus/`	4.6 MB Parquet baked into the image

The bug we hit (and fixed)

First production rollout returned tree_count sum = 0 for filtered Trend. Root cause: the GROUPING SETS super-aggregate row's NULL group key deserialized as NaN on the Lambda's Python 3.12/pyarrow for some result shapes — so the reshape couldn't find the sum row. Per-group values were correct; only sum broke. Fix: identify the sum row with GROUPING() instead of relying on the NULL key — environment-robust. Verified live: tree_count 1986 sum = 40329 ✓.

Final live latency (deployed Lambda, warm round-trip incl. ~200 ms network)

Endpoint	Before	Now
`/v3/trends` filtered (frontend's case)	~6.8s	0.80s
`/v3/trends` unfiltered	~6.8s	~0.7s (cache)
`/v3/species-breakdown`	~2s	0.35s
`/v3/overview`	~4–6s	0.38s
`/v3/bds/simulation`	~2s	0.14s
`/v3/flora`	~2s	0.13s

Deployed live at api.wildvine.kotaksakti.com (Lambda wildvine-api, ap-southeast-1). Output verified byte-identical to the pandas baseline across 14+ endpoint cases (modulo the inherent ±RM0.01 price float-order, which already varied in the pre-existing system).

TL;DR

The lever

1. src/query.py — Athena tuning + new DuckDB engine

2. src/v3/repository.py

(a) Swapped Athena → DuckDB at every v3/BDS call site (7 + 2)

(b) Vectorized Trend aggregation + deterministic sort

(c) Three-tier dispatch: cache → SQL fast path → pandas fallback

(d) Flora — deterministic ORDER BY so pagination is stable + engine-independent

3. src/v3/trends_sql.py (new) — the entire Trend pipeline in SQL

4. Supporting

The bug we hit (and fixed)

Final live latency (deployed Lambda, warm round-trip incl. ~200 ms network)

1. `src/query.py` — Athena tuning + new DuckDB engine

2. `src/v3/repository.py`

(d) Flora — deterministic `ORDER BY` so pagination is stable + engine-independent

3. `src/v3/trends_sql.py` (new) — the entire Trend pipeline in SQL