Skip to main content

Source Data

All source data originates from the U.S. Alcohol and Tobacco Tax and Trade Bureau (TTB), a bureau of the U.S. Department of the Treasury.
  • Primary source: TTB COLA Public Registry — the federal database of approved alcohol product labels
  • Supplementary sources: TTB permit holder records, TTB production statistics
  • Legal status: All source data is in the public domain under 17 U.S.C. Section 105 (U.S. government works)
Customers are free to access the TTB’s public data directly. COLA Cloud’s value is in the collection, structuring, enrichment, and ongoing maintenance of this data at scale.

Coverage

~2.9M COLAs

Label approval records dating back to 2005

~5M Images

Front and back label images (typically 2 per COLA)

~575K Barcodes

Extracted from label images for product matching

~2,500/week

New records added from TTB on a daily basis
Permit holder records are continuously updated from TTB filings.

Collection

COLA Cloud collects new and updated records from the TTB on a daily basis, including label images and all associated structured fields — brand name, product type, origin, alcohol content, approval dates, applicant information, and more. Raw source materials are retained in durable storage for auditability and reprocessing.

Enrichments

Each record is enriched through a proprietary pipeline that produces the following additional fields:

Text Extraction

Label images are processed with optical character recognition to extract all visible text from the label artwork. This captures information printed on the label that isn’t part of the TTB’s structured data — tasting notes, marketing copy, ABV declarations, volume statements, and more.

Barcode Identification

Label images are scanned for standard barcode formats. When found, the decoded value, barcode type, and position on the label are recorded.
Barcode coverage is approximately 30% of records. This reflects the fact that barcodes appear incidentally on label submissions, not that 70% of products lack barcodes.

AI-Powered Feature Extraction

Each label is processed through proprietary models to extract structured fields that would otherwise require manual review:
CategoryExtracted Fields
ClassificationHierarchical product category (e.g., Spirits > Whiskey > Bourbon), container type
DescriptionFree-text product description, tasting notes, flavor profile
WineAppellation, vintage, varietal, designation (Reserve, Estate, etc.)
BeerIBU, hop varieties
SpiritsAge statement, finishing process, grain bill
OtherBrand established year, artwork credits, certifications (organic, kosher, etc.)

Address Normalization

Applicant addresses from TTB records are parsed and normalized into structured components for consistent matching and geocoding.

Data Freshness

CommitmentDetail
New recordsAvailable within days of TTB publication
EnrichmentsApplied to new records within one week
Data license deliveryWeekly incremental updates

Known Limitations

We believe in transparency about what automated enrichment can and cannot do:
  1. Source data quality — The TTB registry occasionally contains errors, omissions, or delays. COLA Cloud reflects what the TTB publishes.
  2. Text extraction accuracy — Extraction from label images is automated and imperfect. Stylized fonts, low-resolution images, and overlapping text can reduce accuracy. Field-level accuracy is not guaranteed.
  3. AI classification — Product categorization and feature extraction are probabilistic. Sub-category precision varies, and products may be misclassified when label information is ambiguous or incomplete.
  4. Barcode coverage — ~30% of records have extractable barcodes. This is a function of what appears on the label submission, not a gap in our processing.
  5. Methodology evolution — COLA Cloud continuously improves its enrichment methods. This may cause field values to change between updates as accuracy improves.

Data Delivery

Licensed data is available through multiple channels:
  • Snowflake Data Share — Direct access in your Snowflake account, updated weekly (learn more)
  • S3 Export — Parquet or CSV files delivered to your S3 bucket
  • REST API — Programmatic access with structured queries and filtering (quickstart)
  • MCP Server — AI-native access for LLM-powered applications (learn more)

Questions?

Contact help@colacloud.us or explore the API reference for full schema details.