Data Provenance & Methodology

Source Data

All source data originates from the U.S. Alcohol and Tobacco Tax and Trade Bureau (TTB), a bureau of the U.S. Department of the Treasury.

Primary source: TTB COLA Public Registry — the federal database of approved alcohol product labels
Supplementary sources: TTB permit holder records, TTB production statistics
Legal status: All source data is in the public domain under 17 U.S.C. Section 105 (U.S. government works)

Customers are free to access the TTB’s public data directly. COLA Cloud’s value is in the collection, structuring, enrichment, and ongoing maintenance of this data at scale.

Coverage

~2.9M COLAs

Label approval records dating back to 2005

~5M Images

Front and back label images (typically 2 per COLA)

~575K Barcodes

Extracted from label images for product matching

~2,500/week

New records added from TTB on a daily basis

Permit holder records are continuously updated from TTB filings.

Collection

COLA Cloud collects new and updated records from the TTB on a daily basis, including label images and all associated structured fields — brand name, product type, origin, alcohol content, approval dates, applicant information, and more. Raw source materials are retained in durable storage for auditability and reprocessing.

Enrichments

Each record is enriched through a proprietary pipeline that produces the following additional fields:

Text Extraction

Label images are processed with optical character recognition to extract all visible text from the label artwork. This captures information printed on the label that isn’t part of the TTB’s structured data — tasting notes, marketing copy, ABV declarations, volume statements, and more.

Barcode Identification

Label images are scanned for standard barcode formats. When found, the decoded value, barcode type, and position on the label are recorded.

Barcode coverage is approximately 30% of records. This reflects the fact that barcodes appear incidentally on label submissions, not that 70% of products lack barcodes.

AI-Powered Feature Extraction

Each label is processed through proprietary models to extract structured fields that would otherwise require manual review:

Category	Extracted Fields
Classification	Hierarchical product category (e.g., Spirits > Whiskey > Bourbon), container type
Description	Free-text product description, tasting notes, flavor profile
Wine	Appellation, vintage, varietal, designation (Reserve, Estate, etc.)
Beer	IBU, hop varieties
Spirits	Age statement, finishing process, grain bill
Other	Brand established year, artwork credits, certifications (organic, kosher, etc.)

Address Normalization

Applicant addresses from TTB records are parsed and normalized into structured components for consistent matching and geocoding.

Data Freshness

Commitment	Detail
New records	Available within days of TTB publication
Enrichments	Applied to new records within one week
Data license delivery	Weekly incremental updates

Known Limitations

We believe in transparency about what automated enrichment can and cannot do:

Source data quality — The TTB registry occasionally contains errors, omissions, or delays. COLA Cloud reflects what the TTB publishes.
Text extraction accuracy — Extraction from label images is automated and imperfect. Stylized fonts, low-resolution images, and overlapping text can reduce accuracy. Field-level accuracy is not guaranteed.
AI classification — Product categorization and feature extraction are probabilistic. Sub-category precision varies, and products may be misclassified when label information is ambiguous or incomplete.
Barcode coverage — ~30% of records have extractable barcodes. This is a function of what appears on the label submission, not a gap in our processing.
Methodology evolution — COLA Cloud continuously improves its enrichment methods. This may cause field values to change between updates as accuracy improves.

Data Delivery

Licensed data is available through multiple channels:

Snowflake Data Share — Direct access in your Snowflake account, updated weekly (learn more)
S3 Export — Parquet or CSV files delivered to your S3 bucket
REST API — Programmatic access with structured queries and filtering (quickstart)
MCP Server — AI-native access for LLM-powered applications (learn more)

Questions?

Contact help@colacloud.us or explore the API reference for full schema details.

​Source Data

​Coverage

~2.9M COLAs

~5M Images

~575K Barcodes

~2,500/week

​Collection

​Enrichments

​Text Extraction

​Barcode Identification

​AI-Powered Feature Extraction

​Address Normalization

​Data Freshness

​Known Limitations

​Data Delivery

​Questions?

Source Data

Coverage

Collection

Enrichments

Text Extraction

Barcode Identification

AI-Powered Feature Extraction

Address Normalization

Data Freshness

Known Limitations

Data Delivery

Questions?