// article
lead-gen-pipeline is a Python tool that extracts business records from Chamber of Commerce directories. Each chamber site is laid out differently — by category, by letter, by page — so brittle CSS selectors don’t work. The pipeline asks a local 7B-parameter model (Qwen2) to read the HTML and return structured JSON instead.
The trade is straightforward: a single LLM call replaces a custom adapter per site. On Palo Alto Chamber of Commerce it extracts 296 businesses across 26 categories in 9 minutes, with 100% capture on name and phone, 90% on email, 85% on website. Nothing leaves the local machine.
Source: github.com/Burton-David/lead-gen-pipeline
Three reasons, in order of how much they actually matter:
0.1 plus a strict prompt produces consistent JSON. The model is good at structured extraction; it doesn’t need to be creative.Chamber URL → LLM Analysis → Navigation → Extraction → Validation → SQLite
Four components, one per file:
llm_processor.py — sends HTML to the model, parses JSON, repairs malformed outputchamber_parser.py — walks the directory; handles category, alphabetical, paginated layoutscrawler.py — fetches pages with retry + backoff; respects robots.txtbulk_database.py — batched inserts, dedup, CSV exportThe core idea: instead of a CSS selector that breaks the moment the chamber redesigns, ask the model to find the data.
def extract_business_data(self, html_content: str) -> dict:
"""Use LLM to extract structured business data from HTML."""
prompt = """
Analyze this HTML and extract business information.
Return JSON with: name, website, phone, email, address, categories.
If a field is missing, use null.
"""
response = self.llm.generate(
prompt + html_content,
max_tokens=1000,
temperature=0.1 # Low temperature for consistent extraction
)
try:
return json.loads(response)
except json.JSONDecodeError:
return self._repair_json(response)
Temperature 0.1 is the load-bearing parameter — high enough that the model handles varied HTML, low enough that the JSON structure stays consistent.
Chamber sites organize their directories one of three ways:
categorical: /directory → [Categories] → [Businesses]
alphabetical: /directory → [A-Z] → [Businesses by letter]
paginated: /directory?page=1 → /directory?page=2 → ...
Rather than hardcode rules for each, the LLM classifies the layout from the main page:
def detect_directory_structure(self, main_page_html: str) -> str:
"""Ask LLM to identify directory organization pattern."""
prompt = """
Analyze this Chamber of Commerce directory page.
Identify the structure: 'categorical', 'alphabetical', or 'paginated'.
Return only the structure type.
"""
structure = self.llm.generate(prompt + main_page_html, max_tokens=20)
return structure.strip().lower()
The LLM returns plausible records; the pipeline has to make them trustworthy.
Dedup on the union of website + name:
def deduplicate_businesses(self, businesses: List[dict]) -> List[dict]:
"""Remove duplicates based on website or name."""
seen = set()
unique = []
for biz in businesses:
key = (
biz.get('website', '').lower().strip(),
biz.get('name', '').lower().strip()
)
if key not in seen and any(key):
seen.add(key)
unique.append(biz)
return unique
Validate that the record has at least one way to contact:
def validate_business(self, business: dict) -> bool:
"""Ensure minimum required fields are present."""
required = ['name']
has_contact = any([
business.get('phone'),
business.get('email'),
business.get('website')
])
return all(business.get(field) for field in required) and has_contact
Normalize phone numbers to a single format:
def normalize_phone(self, phone: str) -> str:
"""Standardize phone number format."""
if not phone:
return None
digits = ''.join(c for c in phone if c.isdigit())
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
elif len(digits) == 11 and digits[0] == '1':
return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
return phone # Return original if format unknown
Individual inserts to SQLite run at ~10–20 records/second. executemany inside one transaction runs at 500+. The whole crawl is small enough that this still doesn’t matter — but at any larger scale it’s the one change that pays back the most.
def bulk_insert_businesses(self, businesses: List[dict]):
"""Insert multiple businesses efficiently."""
records = [
(
biz['name'],
biz.get('website'),
biz.get('phone'),
biz.get('email'),
biz.get('address'),
json.dumps(biz.get('categories', []))
)
for biz in businesses
]
self.cursor.executemany('''
INSERT OR IGNORE INTO businesses
(name, website, phone, email, address, categories)
VALUES (?, ?, ?, ?, ?, ?)
''', records)
self.conn.commit()
git clone https://github.com/Burton-David/lead-gen-pipeline
cd lead-gen-pipeline
./setup.sh
python cli.py init
python cli.py chambers --url https://www.paloaltochamber.com
python cli.py export --output leads.csv
Palo Alto Chamber of Commerce, single run:
Per-page timing breakdown:
The model is not the bottleneck. Network is. That means parallelizing across chambers should scale close to linearly — the LLM has headroom while the next request is in flight.
A few things worth doing next: