/v3/trends (the slow case) — incl. ~200 ms networkLatency was dominated by Athena's ~1–2s per-query floor and pulling 215k raw rows for Trend, then heavy single-threaded pandas (enrich + cross-year census + aggregation). Fix: replace Athena with in-process DuckDB over a Parquet copy baked into the Lambda image, then do the entire Trend pipeline — including census — in DuckDB SQL (multi-threaded C++). The Lambda returns ~tens of rows and does almost no pandas.
src/query.py — Athena tuning + new DuckDB engineTuned the remaining Athena calls (most surface area moved off Athena entirely), and added query_gus() — runs SQL in-process against the bundled Parquet:
def query_athena(query_name, sql_query):
return wr.athena.read_sql_query(
sql_query.get_sql(), ctas_approach=False, database=ATHENA_DATABASE,
s3_output='{}/{}'.format(ATHENA_S3_OUTPUT, query_name),
athena_query_wait_polling_delay=0.25, # was the 1.0s default
result_reuse_configuration={"ResultReuseByAgeConfiguration":
{"Enabled": True, "MaxAgeInMinutes": ATHENA_RESULT_REUSE_MINUTES}},
)
GUS_TABLE = 'dataset_gus_1_new_2ha'
GUS_PARQUET_DIR = pathlib.Path(__file__).parent / 'data' / 'dataset_gus'
_duck_con = None; _duck_lock = threading.Lock()
def _gus_connection():
global _duck_con
if _duck_con is None:
with _duck_lock:
if _duck_con is None:
con = duckdb.connect(database=':memory:')
glob = str(GUS_PARQUET_DIR / '**' / '*.parquet')
con.execute(f'CREATE VIEW "{GUS_TABLE}" AS '
f"SELECT * FROM read_parquet('{glob}', hive_partitioning=true)")
_duck_con = con
return _duck_con
def query_gus(sql_query):
sql = sql_query.get_sql() if hasattr(sql_query, 'get_sql') else sql_query
return _gus_connection().cursor().execute(sql).df()
src/v3/repository.py# before
df = prepare_dataframe(query_athena("v3-trends", q))
# after
df = prepare_dataframe(query_gus(q))
Replaced ~1,100 re-filter passes (_metric per metric per group) with one group pass (_metric_from + _aggregate_trends), and added a stable sort so float sums are reproducible across separate fetches:
work = df.copy()
for c in cast_cols:
work[c] = work[c].fillna(0).astype(float)
# Deterministic row order so float summation is reproducible across
# separate Athena fetches (result-page order is not guaranteed).
sort_keys = [c for c in ('year','dbh_size_class','species','tag') if c in work.columns]
if sort_keys:
work = work.sort_values(sort_keys, kind='stable').reset_index(drop=True)
def get_trends(view_by, dbh_size_classes, location_ids, ...):
if _trend_cache_eligible(...): # unfiltered → DynamoDB
cached = _serve_trends_from_cache(...)
if cached is not None: return cached
return _get_trends_live(...)
def _get_trends_live(view_by, dbh_size_classes, location_ids, ...):
if location_ids: # SQL fast path
total_ha = sum(LOCATION_AREA.get(int(l), 0) for l in location_ids)
if total_ha > 0:
res = trends_sql.compute(..., V3_METRICS, float(total_ha), _lookup_species_ids_by_name)
if res is not None: return res
# ---- pandas fallback (for dup-tag locations) ----
q = _query(location_ids, None, species_id, dbh_size_classes, skip_year_filter=True)
...
ORDER BY so pagination is stable + engine-independentq = q.orderby(t.location_id, t.tag, t.year)
src/v3/trends_sql.py (new) — the entire Trend pipeline in SQLEnrich (CSV join) + price (CASE on wood-group + DBH tier) + census via window functions / self-joins + aggregation (GROUP BY GROUPING SETS). The Lambda gets back ~tens of rows; near-zero pandas on the hot path.
The non-trivial bit — per-tree census computed in SQL via a self-join to the immediately preceding census year, with broadcast rate columns:
-- preceding census year per location (LAG over the DISTINCT (loc, yr))
yrs AS (SELECT loc, yr, LAG(yr) OVER (PARTITION BY loc ORDER BY yr) pyear
FROM (SELECT DISTINCT loc, yr FROM base)),
-- self-join curr → prev by (loc, tag, pyear): matched = survivor
j AS (SELECT c.*, p.dbh p_data, p.basal_area p_ba, p.volume p_vol,
CASE WHEN p.tag IS NOT NULL THEN 1 ELSE 0 END surv
FROM cur c LEFT JOIN base p
ON p.loc=c.loc AND p.tag=c.tag AND p.yr=c.pyear),
-- per-row census columns; rates broadcast per (loc, yr)
rows AS (SELECT ...,
CASE WHEN surv=1 THEN dbh - p_data ELSE 0.0 END growth_rate,
CASE WHEN pyear IS NULL THEN 0.0
ELSE CAST(n_recruit AS DOUBLE)/n_curr END recruitment_rate,
CASE WHEN pyear IS NULL THEN 0.0
ELSE CAST(pc.n_prev - n_surv AS DOUBLE)/pc.n_prev END mortality_rate,
...)
-- final: both views + per-year 'sum', flagged by GROUPING() (not a NULL key)
SELECT yr, {gcol} AS grp, GROUPING({gcol}) AS is_sum,
COUNT(*) n, COUNT(DISTINCT species) nuniq,
SUM(...), AVG(mortality_rate), ...
FROM rows GROUP BY GROUPING SETS ((yr, {gcol}), (yr))
And the matching Python reshape — keys the sum row by is_sum=1 (the bug fix), skips real-NULL groups (matching pandas dropna):
for _, r in df.iterrows():
y = int(r['yr'])
if int(r['is_sum']) == 1:
sum_by_year[y] = r
else:
g = r['grp']
if g is None or (isinstance(g, float) and pd.isna(g)):
continue # real-NULL group dropped, like pandas dropna
grp_by_year.setdefault(y, {})[g] = r
A gate at the top of compute() returns None (→ pandas fallback) when the requested locations contain duplicate (tag, year) rows — the SQL census can't reproduce pandas' arbitrary order-dependent behavior in that edge case. loc2 (the slow 50 ha plot) is clean → SQL; loc1 → pandas (and it's already fast at 2 ha).
| File | Change |
|---|---|
src/models.py | New TrendCache Dyntastic model (wildvine_trend_cache) |
src/v3/trend_precompute.py | (new) precomputes the unfiltered cache |
src/requirements.txt | added duckdb |
template.yaml | added DynamoDBReadPolicy for wildvine_trend_cache (memory left at 3008 — account cap) |
scripts/refresh_gus_parquet.sh | (new) re-pull Parquet + redeploy on data refresh |
src/data/dataset_gus/ | 4.6 MB Parquet baked into the image |
First production rollout returned tree_count sum = 0 for filtered Trend. Root cause: the GROUPING SETS super-aggregate row's NULL group key deserialized as NaN on the Lambda's Python 3.12/pyarrow for some result shapes — so the reshape couldn't find the sum row. Per-group values were correct; only sum broke. Fix: identify the sum row with GROUPING() instead of relying on the NULL key — environment-robust. Verified live: tree_count 1986 sum = 40329 ✓.
| Endpoint | Before | Now |
|---|---|---|
/v3/trends filtered (frontend's case) | ~6.8s | 0.80s |
/v3/trends unfiltered | ~6.8s | ~0.7s (cache) |
/v3/species-breakdown | ~2s | 0.35s |
/v3/overview | ~4–6s | 0.38s |
/v3/bds/simulation | ~2s | 0.14s |
/v3/flora | ~2s | 0.13s |