Recipe API

AI Ingredient Extraction Needs Product-Grade Evals

Recent Open Food Facts LLM evaluation work shows why recipe, grocery, and nutrition APIs should test ingredient extraction with multilingual ground truth, invalid-image handling, exact-span preservation, and model/version metadata before trusting AI-generated food data.

food-aiingredientsapi-designevaluation

The trend: food AI is moving from demos to measured extraction

The most useful food-AI systems are not the ones that can describe a package photo in fluent prose. They are the ones that can turn difficult food inputs into stable, auditable fields: ingredient lists, languages, allergens, additives, quantities, serving sizes, and confidence signals that downstream products can safely use.

That is why a small set of recent Open Food Facts AI commits is worth paying attention to. On June 23, 2026, the project added new standalone LLM evaluations for OCR and ingredient extraction. On June 24, it improved the standalone evaluation harness, added an ingredient_extraction.yaml dataset, added a generated JSON schema, and then updated the ingredient-extraction eval again. The files are not a glossy product announcement, but they are a strong signal: serious food AI work is becoming evaluation-first.

For developers building recipe, nutrition, grocery, meal-planning, or food-data APIs, this matters because ingredient extraction is often treated as a single AI feature. Upload an image, receive ingredients, move on. In production, it is a chain of decisions: detect whether an ingredient list exists, identify its language, preserve the visible text, avoid hallucinating occluded text, handle multilingual labels, separate ingredient-list allergens from unrelated package claims, and expose enough provenance for later correction.

If your API returns only ingredients: "...", you have hidden most of the risk from your customers.

What the recent Open Food Facts evals reveal

The June 24 Open Food Facts ingredient-extraction schema defines a compact but instructive target shape. Each expected output contains a list of ingredient lists. Each list has:

  • language: an ISO 639-1 language code.
  • ingredients: the full ingredient-list text, without prefixes like “Ingredients:”.
  • invalid: whether the list is truncated, occluded, or too blurry to read accurately.

The prompt instructions embedded in the evaluator are equally important. The model is told to extract all ingredient lists from the image, return an empty list if no food package is present, include allergen mentions only when they appear immediately after the ingredient list, avoid reconstructing missing parts of an occluded list, and avoid modifying the ingredient list text.

Those constraints are not arbitrary. They map directly to product failures that recipe and grocery APIs need to prevent:

Failure mode Why it matters for an API Better contract
Hallucinated missing text Creates false allergens, additives, or nutrition assumptions Mark list invalid or partial; do not complete from prior knowledge
Language collapse Breaks localization, search, and regulatory display Return one ingredient list per detected language
Prefix/noise capture Pollutes normalization and ingredient matching Preserve the list text but strip label boilerplate consistently
Allergen overreach Confuses ingredient allergens with package-wide claims Track where the allergen mention appeared
OCR punctuation drift Changes parsing of nested additives and sub-ingredients Preserve spans and expose extraction/version metadata

The dataset itself includes multilingual examples and comments about OCR errors: missing semicolons, substituted words, and package variants. Those are exactly the cases that make food data harder than generic document extraction. A semicolon can separate additives. A language-specific plural can change an ingredient entity. A variant marker can represent ambiguity that should survive evaluation instead of being flattened away.

Recipe APIs should evaluate tasks, not vibes

A common anti-pattern in food AI product design is to test a model by manually inspecting a handful of outputs. That can be useful during exploration, but it is not enough for an API that other products will build on. The moment your extraction result feeds search, meal planning, allergy warnings, grocery substitution, or nutrition estimation, the evaluation target needs to become part of the product architecture.

A recipe-data API should distinguish at least four layers:

  1. Raw input: image URL, OCR text, scraped recipe text, receipt photo, or user-entered ingredient line.
  2. Extracted text: what the model believes is visibly present.
  3. Normalized entities: canonical ingredients, quantities, units, preparation forms, allergens, and product links.
  4. Product decisions: search facets, dietary flags, meal-plan suitability, shopping-list items, warnings, or recommendations.

LLM extraction belongs mostly in layers 1 and 2. Normalization and product decisions should not silently inherit the model’s uncertainty. If an ingredient list is flagged as blurry, truncated, or multilingual, downstream endpoints should be able to see that and decide whether to continue, degrade gracefully, or ask for review.

That is the deeper lesson from the recent Open Food Facts work: an eval file is not just a test suite. It is a statement about the API contract you wish the model would satisfy.

A practical schema sketch for AI-extracted ingredients

Here is a minimal shape that recipe and grocery APIs can adapt when accepting AI-derived ingredient data:

{
  "source": {
    "type": "image",
    "url": "https://example.com/package.jpg",
    "captured_at": "2026-06-30T00:00:00Z"
  },
  "extraction": {
    "task": "ingredient_extraction",
    "model": "vendor:model-name",
    "model_version": "2026-06-24",
    "prompt_version": "ingredients-v3",
    "eval_dataset_version": "openfoodfacts-ai-2026-06-24",
    "created_at": "2026-06-30T00:00:02Z"
  },
  "ingredient_lists": [
    {
      "language": "fr",
      "text": "Farine de BLE..., huile d'ARACHIDE...",
      "invalid": false,
      "partial": false,
      "text_preservation": "verbatim_visible_text",
      "confidence": 0.91
    }
  ],
  "warnings": [
    {
      "code": "ocr_punctuation_uncertain",
      "message": "One separator near an additive list may be uncertain."
    }
  ]
}

The exact field names matter less than the separation of concerns. The extraction block makes the AI step auditable. The ingredient list block keeps text and language together. The warning block gives clients a non-binary way to handle uncertainty.

For Recipe API-style use cases, this structure also protects adjacent features. A meal planner can exclude uncertain allergy data from hard safety claims. A grocery workflow can still create a shopping-list draft while marking low-confidence entities for review. A nutrition estimator can avoid pretending that a partial ingredient list supports precise macro calculations.

What to put in your ingredient-extraction eval set

If you are building or buying food AI infrastructure, do not start with a huge benchmark. Start with the cases that break your product. A useful first evaluation set should include:

  • Clear single-language ingredient labels.
  • Multilingual labels where each language needs its own extracted list.
  • Blurry, cropped, or occluded ingredient panels that must be marked invalid or partial.
  • Images with no food package or no ingredient list.
  • Ingredient lists followed by allergen traces that should be included.
  • Allergen, nutrition, marketing, or recycling text elsewhere on the package that should not be included.
  • Nested additives and parenthetical sub-ingredients.
  • Punctuation-sensitive examples where commas, semicolons, and colons affect parsing.
  • Known OCR errors from your existing pipeline.
  • Region-specific language, spelling, and additive notation.

Then define expected outputs as structured data, not prose. The Open Food Facts evals use YAML cases and a JSON schema. That is a sensible pattern because product managers, data reviewers, and engineers can all inspect the cases, while CI can still validate the output shape.

Evaluation metrics should match downstream risk

Exact string match is tempting, but food extraction needs more nuance. Some fields should be strict; others should allow controlled variants.

For example, language codes should usually be exact. The invalid flag should be exact because it controls downstream trust. Ingredient text should be checked for preservation, but your evaluator may need to tolerate known image ambiguity or accepted variants. The Open Food Facts examples include variant markers in expected strings, which is a practical way to represent ambiguity without giving the model unlimited freedom.

A production API team can combine several measures:

  • Schema validity: did the model return parseable structured output?
  • Task completeness: did it return all ingredient lists present in the image?
  • Language accuracy: did it correctly label each list?
  • Text preservation: did it avoid paraphrasing, translating, or completing missing text?
  • Invalid/partial accuracy: did it correctly flag unreadable inputs?
  • Downstream parse impact: did any extraction error change normalized ingredients, allergens, additives, or nutrition assumptions?
  • Latency and cost: can the extraction run within the product’s SLA?

The final metric is especially important for API buyers. A model that is 2% better on exact text but 5x slower may be wrong for interactive recipe capture. A model that is slightly worse on punctuation but much better at detecting invalid images may be better for allergy-sensitive workflows.

Build-versus-buy questions for technical buyers

When evaluating a recipe API, grocery API, or food AI provider, ask questions that reveal whether extraction is measured or merely demonstrated:

  • What is the target schema for AI-extracted ingredient data?
  • Are model name, version, prompt version, and extraction timestamp exposed?
  • Can the API represent multiple ingredient lists from one image?
  • Does it return invalid, partial, or equivalent uncertainty states?
  • Does it distinguish visible ingredient text from normalized ingredient entities?
  • How are allergens adjacent to ingredient lists handled?
  • What happens when the image is not a food package?
  • Are evaluation datasets multilingual and refreshed with real failures?
  • Can customers see confidence, warnings, or review status?
  • Are breaking model changes versioned like API changes?

These questions are not procurement theater. They determine whether a food-data API can safely support real product features. Without eval-backed contracts, AI extraction becomes an invisible dependency that can change when a model vendor updates weights, a prompt changes, or an OCR component improves one class of images while regressing another.

The Recipe API implication

Recipe and food-data products increasingly need AI assistance, but the winning APIs will not be the ones that merely wrap a vision model. They will be the ones that turn AI output into durable structured data: versioned, reviewable, confidence-aware, and connected to stable ingredient and nutrition models.

The recent Open Food Facts evaluation work is a useful reminder that food AI quality is operational. You need cases, schemas, task-specific evaluators, and downstream risk checks. That discipline is just as relevant to generated recipes as it is to package photos. If an AI creates or extracts a recipe, the API still needs to know which ingredients are real, which quantities are inferred, which allergens are supported by evidence, and which fields are safe to expose as filters or claims.

For developers, the practical move is simple: treat AI extraction as a first-class data source, not a magic parser. Give it a schema. Give it provenance. Give it evals. Then let product features consume it according to their risk tolerance.

Sources

Start Building

One consistent schema on every response. Get a free key and ship in minutes.