Latency is not dominated by data volume or pandas processing — it is dominated by Athena's fixed per-query overhead. Proof: /v3/bds/years returns a 109-byte payload from a trivial GROUP BY, yet still took ~2.0s. A big driver was that the awswrangler client polled Athena for completion only once per second (athena_query_wait_polling_delay = 1.0), so each query wasted up to a full second just waiting.
The question was whether "some endpoints take raw data and process it, some just straight query." Confirmed — and the split is lopsided:
SELECT <per-tree columns> FROM dataset_gus_1_new_2ha WHERE … with no aggregation, pull every matching per-tree row, then enrich/filter/aggregate in pandas: overview, trends, trends-v2, species-breakdown, flora, summary, bds/simulation.forecast-by-class-view, forecast-by-summary-view, forecast (alias)./v3/bds/years — SELECT year … GROUP BY year./v3/bds/compartments (static constant), /, /species (reads species.csv).Note: species metadata (name, group, redlist, timber group) is enriched locally from species.csv rather than via an Athena JOIN — a deliberate choice to avoid INNER-JOIN row loss.
| Endpoint | Type | Athena queries | How it gets & processes data |
|---|---|---|---|
/v3/overview |
raw + model | 2 | Pulls raw per-tree rows (filtered by year/loc/species/dbh) → enrich species → per-ha summary + group-by species_group & DBH class. Then a 2nd query fetches baseline-year rows to run the forecast projection. The double query is why it is the slowest. |
/v3/trends |
raw | 1 (large) | Always pulls ALL years (skip_year_filter) because census metrics compare consecutive censuses. compute_census_metrics() does per-location, per-year Python set-diffs for growth/mortality/recruitment, then aggregates per year × metric × group. |
/v3/trends-v2 |
raw | 1 | Same shape as trends but gated by metrics. Only pulls all years if a census metric is requested; otherwise filters years in SQL. Faster than trends for non-census metrics. |
/v3/species-breakdown |
raw | 1 | Raw rows → enrich → groupby('species') → per-species per-ha metrics → top-N ranking per metric + a red-list breakdown by DBH class. |
/v3/flora |
raw | 1 | Pulls all matching rows (incl. coordinates), enriches, then paginates in memory (df.iloc[start:end]). Fetches the full result set even for page 1 — inefficient for large filters. |
/v3/summary |
raw | 1 | Raw rows (year-filtered unless census) → enrich → optional census → groupby('dbh_size_class') per requested metric, with percentage-of-total. |
/v3/forecast-by-class-view |
model | 1 | Fetch baseline-year raw rows → build_stand_table() → run_projection() iterates +5y…+30y applying DIF growth / mortality / ingrowth coefficients in NumPy. Large payload (~230 KB). |
/v3/forecast-by-summary-view |
model | 1 | Same projection, then groupby(forecast_year, species_group, is_protected) and sum to stand totals. |
/v3/forecast |
model | 1 | Legacy alias → forecast-by-class-view. |
/v3/bds/simulation |
raw | 1 | Fetch one location+year raw → enrich → price → pre-felling filter (DBH ≥ 30 cm) → distributions + candidate-tree logic (commercial groups, cutting limits, red-list exclusion). |
/v3/bds/years |
SQL | 1 | The only SQL-aggregated endpoint. SELECT year … GROUP BY year ORDER BY year — returns just the distinct census years. |
/v3/bds/compartments |
none | 0 | Pure Python constant from LOCATION_AREA. No I/O. |
3 runs each; cold-start & API-Gateway/Cognito overhead excluded (this isolates app + Athena time). "Before" = original query layer; "After" = optimized query layer (see below). Values are the fastest of 3 runs, in seconds.
| Endpoint | Before | After | Improvement | Payload |
|---|---|---|---|---|
/ (no Athena) | 0.003 | 0.003 | — | 50 B |
/species (CSV) | 0.015 | 0.013 | — | 189 KB |
/v3/overview (loc 1, 2 ha) | 4.11 | 2.49 | −39% | 6.8 KB |
/v3/overview (loc 2, 50 ha) | 5.84 | 4.18 | −28% | 7.3 KB |
/v3/species-breakdown | 2.08 | 1.35 | −35% | 6.5 KB |
/v3/trends (all years, census) | 3.07 | 2.01 | −35% | 12.8 KB |
/v3/trends-v2 (no census) | 2.08 | 0.98 | −53% | 353 B |
/v3/trends-v2 (census) | 2.60 | 1.95 | −25% | 332 B |
/v3/flora (page 10) | 2.11 | 1.07 | −49% | 4.6 KB |
/v3/summary (basal_area) | 2.11 | 1.33 | −37% | 637 B |
/v3/summary (census) | 2.57 | 1.62 | −37% | 641 B |
/v3/forecast-by-class-view | 2.12 | 1.39 | −34% | 230 KB |
/v3/forecast-by-summary-view | 2.12 | 1.35 | −36% | 11.4 KB |
/v3/bds/simulation | 2.05 | 1.14 | −44% | 1.7 KB |
/v3/bds/years | 1.97 | 0.83 | −58% | 109 B |
Payload size barely correlates with latency — further proof the cost is Athena's fixed per-query overhead, not data transfer or pandas.
/v3/trends is the slowest — and will a bigger Lambda help?Trend is the worst case. Profiled against compartment 50 (the 50 ha plot), pulling all census years:
| Stage | Time | CPU/mem sensitive? |
|---|---|---|
| Rows fetched | 215,926 | 50 ha × 6 census years, raw per-tree rows |
| 1 — Athena fetch | 5.26s | No — server-side query + paginated row transfer |
| 2 — enrich + price | 0.31s | single-threaded pandas |
| 3 — census metrics | 0.83s | Barely — single-threaded pandas |
| 4 — aggregation loop | 1.90s | Barely — single-threaded pandas |
| TOTAL | 8.30s |
Two root causes: (1) /v3/trends forces skip_year_filter=True — it pulls every census year regardless of the year picker, because census metrics compare consecutive censuses tree-by-tree; (2) the aggregation is a triple nested loop (~6 years × 23 metrics × 8 groups ≈ 1,100 passes) that re-filters and re-sums slices of a 215k-row DataFrame.
Lambda CPU scales with memory: the function is at 3008 MB ≈ 1.78 vCPU; the ceiling is 10240 MB ≈ 6 vCPU. But 63% of the time is the Athena fetch, which is server-side and completely insensitive to Lambda size. The remaining 37% is single-threaded pandas — and at 3008 MB a single thread already gets a full core, so adding cores doesn't make it faster. Estimate: ~3.4× the cost for maybe 10–20% latency improvement. Not worth it.
What actually fixes Trend, in order of value-for-effort:
GROUP BY year, dbh_size_class returns ~42 rows instead of 215,926, cutting both Athena transfer and pandas time.| Stage | Before | After |
|---|---|---|
| Athena fetch | 5.26s | ~5.0s |
| census metrics | 0.83s | ~0.8s |
| aggregation loop | 1.66–1.90s | 0.19s |
| TOTAL | 8.30s | 6.45s |
A two-line change in the shared query_athena() helper (src/query.py), affecting all endpoints at once:
athena_query_wait_polling_delay from the default 1.0s → 0.25s. Pure win, no downside: saves up to ~0.75s on every query.ResultReuseByAgeConfiguration. The dataset is static between Glue ETL runs, so identical queries return Athena's cached result instead of re-scanning. Window is bounded by ATHENA_RESULT_REUSE_MINUTES (default 60) to cap staleness.| Configuration | Min latency |
|---|---|
A — current (polling=1.0, no reuse) | 2.01s |
B — polling=0.25 | 1.55s |
C — polling=0.25 + result reuse | 1.02s |
Result reuse means a re-run of the ETL won't be reflected until the reuse window expires (default 60 min). If you need fresh data immediately after re-ingest, lower ATHENA_RESULT_REUSE_MINUTES or set it to 0 to disable. The poll-interval change has no such tradeoff.
Most-called endpoints: /v3/overview (184), /v3/species-breakdown (183), /years-by-location (146), /v3/trends (106), /v3/flora (103). A handful of 500s were seen on /v3/forecast-by-*.
| Impact | Recommendation | Notes |
|---|---|---|
| HIGH | Drop Athena from the hot path — bake the dataset into the image as Parquet and query with DuckDB/pandas, or pre-compute endpoint responses to S3/DynamoDB. | Data is tiny (2 ha + 50 ha plots) and static. Removing the ~2s Athena floor entirely would put most endpoints in the tens-to-low-hundreds of ms. |
| HIGH | Address cold starts (23.5% of traffic, ~1.75s each). | Provisioned concurrency, or slim the package / lazy-import awswrangler (pandas+pyarrow dominate init). |
| MED | /v3/overview issues two Athena queries (main + forecast). |
Make the forecast block lazy/optional, or reuse already-fetched rows. Would roughly halve overview latency — the most-called endpoint. |
| MED | /v3/flora fetches the full result set then paginates in memory. |
Push LIMIT/OFFSET (and a COUNT(*)) into SQL so page 1 doesn't scan everything. |
| MED | Push aggregation into SQL for overview / summary / species-breakdown. |
GROUP BY in Athena instead of pulling raw rows. Smaller transfer; complements (not replaces) caching. |
| LOW | Steer clients from /v3/trends to /v3/trends-v2 with an explicit metrics list. |
trends always pulls all years; trends-v2 only does so when a census metric is requested. |
| LOW | Investigate the 500s on /v3/forecast-by-*. |
Seen in the 7-day window; likely a param/edge-case path. |