Tutorial¶
A guided, end-to-end tour of Firefly DataScience — from booting the app to serving a model.
This tutorial mirrors the runnable script samples/tutorial.py,
which is covered by a test (tests/samples/test_tutorial.py), so everything here is guaranteed to
work. It runs offline with no LLM key — the GenAI steps use deterministic stand-ins, and we show
how to switch on a real LLM at the end.
We use a synthetic credit-risk dataset whose default risk is driven by debt-to-income — a ratio deliberately withheld from the model, so feature engineering has something real to discover.
The pattern every step rests on — the LLM proposes; the classical engine decides
Generative AI only ever proposes candidates here: feature code (step 4) and model choices (step 5). A deterministic classical engine cross-validates each one and a cost-benefit gate keeps it only if it measurably beats the current baseline. The LLM never touches the score — the data does. That is why the tour runs identically with or without an API key.
1. Boot the application¶
from fireflyframework_datascience import FireflyDataScienceApplication
app = FireflyDataScienceApplication.run() # (1)!
- Pass
print_output=Falseto suppress the banner (the script does this so its test output stays clean).
This prints the banner and a wiring summary, loads configuration, builds the dependency-injection
container, and discovers every adapter via entry-point auto-configuration. app.bean_count and
app.applied_auto_configurations tell you what got wired.
Expected
On a fresh [tabular] install the container wires a couple dozen beans from roughly a dozen
auto-configurations (exact counts depend on which extras are installed):
See Architecture.
2. Build, load, and validate the data¶
The script generates the credit dataset with make_credit_dataset(). Default risk is a logistic
function of debt_to_income = loan_amount / income, plus prior defaults and employment, but only the
four raw columns are handed to the model — debt_to_income is the hidden driver feature engineering
must rediscover.
import numpy as np
import pandas as pd
from fireflyframework_datascience.core.types import TaskType
from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.validation.adapters import BasicValidator
def make_credit_dataset(n: int = 800, seed: int = 11) -> Dataset:
rng = np.random.RandomState(seed)
income = rng.normal(60_000, 18_000, n).clip(15_000, None)
loan_amount = rng.normal(18_000, 10_000, n).clip(1_000, None)
employment_years = rng.uniform(0, 30, n).round(1)
num_prior_defaults = rng.poisson(0.6, n)
dti = loan_amount / income # (1)!
logit = -2.6 + 5.0 * dti + 1.3 * num_prior_defaults - 0.05 * employment_years + rng.normal(0, 0.25, n)
default = (rng.uniform(0, 1, n) < 1.0 / (1.0 + np.exp(-logit))).astype(int)
X = pd.DataFrame(
{
"income": income.round(2),
"loan_amount": loan_amount.round(2),
"employment_years": employment_years,
"num_prior_defaults": num_prior_defaults, # (2)!
}
)
return Dataset(
"credit_applicants",
X,
pd.Series(default, name="default"),
task=TaskType.BINARY,
target_name="default",
feature_names=list(X.columns),
)
dataset = make_credit_dataset()
report = BasicValidator().validate(dataset.X, dataset.y)
assert report.ok # no all-null columns, no null target, etc.
train, test = dataset.train_test_split(test_size=0.25, random_state=0) # (3)!
dtidrives the label but is never put intoX— that is the signal step 4 has to recover.- Only these four raw columns reach the model;
debt_to_incomeis deliberately absent. train_test_splitstratifies on the target for classification; here it yields 600 train / 200 test rows.
The BasicValidator catches empty data, all-null/constant columns, duplicate rows, and null targets
before you waste time training.
Expected
The dataset is 800 rows × 4 features, a TaskType.BINARY task; the split gives 600 train / 200 test.
See Datasets.
3. Classical AutoML¶
from fireflyframework_datascience.automl import AutoML
result = AutoML(cv=4).fit(train)
print(result.leaderboard_table())
print(result.evaluate(test)) # holdout metrics
AutoML cross-validates each candidate trainer (RandomForestTrainer, LinearTrainer,
HistGradientBoostingTrainer, plus the boosting libraries if installed), ranks them on a
task-appropriate metric (roc_auc for binary), and refits the winner on the full training set. Each
candidate is wrapped in an impute-and-scale preprocessing pipeline before scoring.
Expected
A leaderboard topped by linear, and a holdout roc_auc ≈ 0.85:
Note
The leaderboard prints cross-validation scores on the training data, while evaluate(test)
reports metrics on the untouched holdout — so the headline roc_auc (≈0.85) is higher than the
CV figure (≈0.79). Both are real; they measure different things.
See Classical AutoML.
4. GenAI feature engineering¶
from fireflyframework_datascience.features import StaticFeatureProposer, FeatureProposal
from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
proposer = StaticFeatureProposer([
FeatureProposal("debt_to_income", "df['debt_to_income'] = df['loan_amount'] / (df['income'] + 1)", "DTI"), # (1)!
FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected"), # (2)!
])
engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
print(engineered.summary())
- The hidden driver: its CV lift clears
CostBenefitGate(min_gain=0.0), so it is accepted. - A constant column adds nothing, so the gate rejects it — the LLM never overrides that decision.
The loop is propose → execute (safely) → measure CV lift → gate. debt_to_income is accepted
because it lifts the score; the constant noise feature is rejected. The LLM never decides — the
measured score does.
Expected
engineered.accepted lists debt_to_income; engineered.rejected lists noise with the reason
no lift (0.7889 <= 0.7889). The lift is small but positive and real — the gate rejects
anything that does not strictly beat the running baseline.
StaticFeatureProposer stands in for the LLM so the tutorial runs offline with a fixed,
reproducible set of proposals — exactly what the snippet above uses.
With a real model you swap in AgentFeatureProposer, which wraps a FireflyAgent and is built
lazily (no LLM client is created at startup):
from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer
proposer = AgentFeatureProposer(model="openai:gpt-4o")
engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
See Configuring the LLM.
See GenAI Feature Engineering.
5. The agentic ML-engineering loop¶
from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate
from fireflyframework_datascience.engineering.loop import AgenticAutoML
proposer = SequenceProposer([SolutionCandidate("linear"), SolutionCandidate("random_forest"),
SolutionCandidate("hist_gradient_boosting")])
run = AgenticAutoML(proposer, cv=3, max_iterations=4).solve(train) # (1)!
print(run.summary())
AgenticAutoMLseeds the population, then reflects on the attempt history up tomax_iterationstimes; apatiencebudget (default 3) stops the search once it stalls.
Each candidate is trained, cross-validated, and verified by a DeterministicVerifier — it must
beat a trivial DummyClassifier(strategy="prior") baseline, not merely run (the "correctness ≠ ran"
principle) — before the best one is selected. run.attempts is the full audited trail and
run.valid_attempts are the ones that passed verification.
Expected
All three seeded candidates clear the roc_auc=0.5000 trivial baseline, so all three are verified;
linear wins.
See Agentic Loop.
6. Serve the model¶
from fireflyframework_datascience.serving import LocalModelServer
server = LocalModelServer()
server.load(result.best_model)
prediction = server.predict(test.X.iloc[[0]]) # score one applicant
print(int(prediction[0]))
LocalModelServer is the default, dependency-free server: it loads a fitted Model in the host
process and answers predict / predict_proba. Heavier servers (e.g. BentoMLModelServer) live
behind the serving extra.
Expected
The first holdout applicant is scored as a non-default:
See Serving & Lineage.
Turn on a real LLM¶
Steps 4 and 5 ran offline with deterministic stand-ins. To let a real model do the proposing, set your key and enable GenAI, then swap in the agent-backed proposers:
export OPENAI_API_KEY=sk-... # or ANTHROPIC_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__ENABLED=true
export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=openai:gpt-4o # or anthropic:claude-sonnet-4-5
from fireflyframework_datascience.features.genai import AgentFeatureProposer
from fireflyframework_datascience.engineering.loop import AgentSolutionProposer
Use AgentFeatureProposer in place of StaticFeatureProposer (step 4) and AgentSolutionProposer in
place of SequenceProposer (step 5). Nothing else changes — the cost-benefit gate and the verifier
still decide. The full guide, including providers, keys, cost gating, and secure execution, is in
Configuring the LLM.