Code
library(tidyverse)
library(magrittr)
library(ggplot2)
library(scales)
library(reticulate)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(scales)
library(reticulate)
import os
import polars as pl
import re
import ast
from tqdm import tqdm
import math
import itertools
import openai
import pickle
from dotenv import load_dotenv
from openai import OpenAI
from joblib import Memory
#
= load_dotenv()
tmp
# save the results from the LLM model to a cache
= Memory(location="cache_summary", verbose=0)
memory_summary = Memory(location="cache_pricing", verbose=0) memory_pricing
This notebook presents a two-stage pipeline for eliciting plausible online retail price ranges for consumer products using large language models (LLMs). It supports pricing analysts in making initial pricing decisions when historical data or comparable products are unavailable or unreliable.
When pricing a new product for the first time, or re-evaluating an existing product’s pricing, pricing analysts are often flying blind. In many cases, there is no information about competitor pricing, limited or no sales history, and little variability in past price points. As a result, prices can’t be based on competitive benchmarks or estimated elasticities. Yet a price still needs to be set.
In these situations, the goal is not precision: it is to make a well-reasoned initial decision. The price should be high enough to protect margin, low enough to be competitive, and aligned with how a reasonably informed customer would perceive the product: what it does, what it’s made of, who it’s for, and how the brand signals quality or positioning.
To support this task, the notebook demonstrates a two-stage pipeline using large language models (LLMs) for eliciting reasonable price ranges. The method is applied to consumer product data from Mercado Libre sourced from here:
Product Attribute Extraction GPT-3.5-turbo is used to extract a structured set of pricing-relevant attributes — such as product type, components, materials, intended use, and product tier — from unstructured titles and descriptions. The output is a standardized schema of factual product properties.
Price Range Elicitation GPT-4o is then used on attributes to elicit a plausible retail price range — minimum, typical, and maximum — conditioned on regional market context (e.g., country, currency, and year).
The two-step approach has several advantages over one-shot elicitation:
Explicit control over inputs: By decoupling attribute extraction from price judgment, we can precisely define which aspects of the product are considered — such as function, materials, target user, or technical features — rather than letting the model base its pricing on irrelevant or misleading text. This reduces hallucinations and improves consistency.
Cost efficiency: Structured product summaries can be generated using a much smaller model and inexpensive model than the one used for eliciting price points, such as a local mt5-small or GPT-3.5-turbo, with only the pricing step, which benefits from broader and more recent market knowledge, being handled by GPT-4o. This separation helps control token usage and reduces overall API costs.
Transparency and auditability: Having an explicit intermediate representation allows analysts to inspect what the model “saw” before making a pricing judgment. This makes the process easier to debug, validate, or even override with human input if needed.
Both stages are implemented with batched processing, caching (joblib.Memory), and fallback logic to ensure reliability and scalability. The resulting price ranges can be used to support pricing diagnostics, detect potential over- or underpricing, and provide a structured starting point for analyst review.
The dataset (sourced from here) contains \(5,859\) product listings from Mercado Libre, a major Latin American e-commerce platform. The data was likely obtained via web scraping. The presence of both original and discounted prices suggests that pricing metadata was parsed directly from the listing structure, possibly through structured HTML extraction.
In the present notebook, we use the following columns:
product_title
: a free-text product title from the listing (in Spanish)product_description
: a potentially marketing description (in Spanish)price_usd
: the listed retail price, converted to USDprice_discounted_usd
: the discounted price (if applicable), also converted to USDproduct_url
: a direct link to the product listing (mainly used for debugging)product_id
: a synthetic ID assigned during preprocessing (mainly used for debugging)= pl.read_csv("./data/mercado_libre_products_cleaned.csv")
products = (products
products 'Product': 'product_title', 'Description': 'product_description',
.rename({'Product URL': 'product_url',
'MXN': 'price_mxn', 'USD': 'price_usd',
'Sale Price USD': 'price_discounted_usd'})
0, products.height).alias('product_id') )
.with_columns( pl.arange('product_id', 'product_title', 'product_description', 'price_usd', 'price_discounted_usd', 'product_url'])
.select([ )
$products$to_pandas() py