// article

A local-LLM scraper for Chamber of Commerce directories

August 14, 2024 Article

lead-gen-pipeline is a Python tool that extracts business records from Chamber of Commerce directories. Each chamber site is laid out differently — by category, by letter, by page — so brittle CSS selectors don’t work. The pipeline asks a local 7B-parameter model (Qwen2) to read the HTML and return structured JSON instead.

The trade is straightforward: a single LLM call replaces a custom adapter per site. On Palo Alto Chamber of Commerce it extracts 296 businesses across 26 categories in 9 minutes, with 100% capture on name and phone, 90% on email, 85% on website. Nothing leaves the local machine.

Source: github.com/Burton-David/lead-gen-pipeline

Why a local model

Three reasons, in order of how much they actually matter:

No per-call cost. Scraping a directory means hundreds to thousands of LLM calls. At API prices that adds up; at $0 it doesn’t.
No data leaves the machine. B2B contact data has its own sensitivity profile.
Deterministic enough. Temperature 0.1 plus a strict prompt produces consistent JSON. The model is good at structured extraction; it doesn’t need to be creative.

Pipeline shape

Chamber URL → LLM Analysis → Navigation → Extraction → Validation → SQLite

Four components, one per file:

llm_processor.py — sends HTML to the model, parses JSON, repairs malformed output
chamber_parser.py — walks the directory; handles category, alphabetical, paginated layouts
crawler.py — fetches pages with retry + backoff; respects robots.txt
bulk_database.py — batched inserts, dedup, CSV export

Letting the LLM read the page

The core idea: instead of a CSS selector that breaks the moment the chamber redesigns, ask the model to find the data.

def extract_business_data(self, html_content: str) -> dict:
    """Use LLM to extract structured business data from HTML."""

    prompt = """
    Analyze this HTML and extract business information.
    Return JSON with: name, website, phone, email, address, categories.
    If a field is missing, use null.
    """

    response = self.llm.generate(
        prompt + html_content,
        max_tokens=1000,
        temperature=0.1  # Low temperature for consistent extraction
    )

    try:
        return json.loads(response)
    except json.JSONDecodeError:
        return self._repair_json(response)

Temperature 0.1 is the load-bearing parameter — high enough that the model handles varied HTML, low enough that the JSON structure stays consistent.

Three directory layouts, one detector

Chamber sites organize their directories one of three ways:

categorical:   /directory → [Categories] → [Businesses]
alphabetical:  /directory → [A-Z]        → [Businesses by letter]
paginated:     /directory?page=1 → /directory?page=2 → ...

Rather than hardcode rules for each, the LLM classifies the layout from the main page:

def detect_directory_structure(self, main_page_html: str) -> str:
    """Ask LLM to identify directory organization pattern."""

    prompt = """
    Analyze this Chamber of Commerce directory page.
    Identify the structure: 'categorical', 'alphabetical', or 'paginated'.
    Return only the structure type.
    """

    structure = self.llm.generate(prompt + main_page_html, max_tokens=20)
    return structure.strip().lower()

Cleaning the output

The LLM returns plausible records; the pipeline has to make them trustworthy.

Dedup on the union of website + name:

def deduplicate_businesses(self, businesses: List[dict]) -> List[dict]:
    """Remove duplicates based on website or name."""
    seen = set()
    unique = []

    for biz in businesses:
        key = (
            biz.get('website', '').lower().strip(),
            biz.get('name', '').lower().strip()
        )

        if key not in seen and any(key):
            seen.add(key)
            unique.append(biz)

    return unique

Validate that the record has at least one way to contact:

def validate_business(self, business: dict) -> bool:
    """Ensure minimum required fields are present."""
    required = ['name']
    has_contact = any([
        business.get('phone'),
        business.get('email'),
        business.get('website')
    ])

    return all(business.get(field) for field in required) and has_contact

Normalize phone numbers to a single format:

def normalize_phone(self, phone: str) -> str:
    """Standardize phone number format."""
    if not phone:
        return None

    digits = ''.join(c for c in phone if c.isdigit())

    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    elif len(digits) == 11 and digits[0] == '1':
        return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"

    return phone  # Return original if format unknown

Bulk insert, single commit

Individual inserts to SQLite run at ~10–20 records/second. executemany inside one transaction runs at 500+. The whole crawl is small enough that this still doesn’t matter — but at any larger scale it’s the one change that pays back the most.

def bulk_insert_businesses(self, businesses: List[dict]):
    """Insert multiple businesses efficiently."""

    records = [
        (
            biz['name'],
            biz.get('website'),
            biz.get('phone'),
            biz.get('email'),
            biz.get('address'),
            json.dumps(biz.get('categories', []))
        )
        for biz in businesses
    ]

    self.cursor.executemany('''
        INSERT OR IGNORE INTO businesses
        (name, website, phone, email, address, categories)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', records)

    self.conn.commit()

Running it

git clone https://github.com/Burton-David/lead-gen-pipeline
cd lead-gen-pipeline
./setup.sh

python cli.py init
python cli.py chambers --url https://www.paloaltochamber.com
python cli.py export --output leads.csv

Results on one chamber

Palo Alto Chamber of Commerce, single run:

296 businesses across 26 categories
100% had name and phone
90% had email (266), 85% had website (252)
Top categories: Professional Services (42), Technology (38), Restaurants & Food (34), Retail (29), Healthcare (23)

Per-page timing breakdown:

Average total: 2.1s
Network: 0.9s (43%)
LLM inference: 0.8s (38%)
Validation + write: 0.4s (19%)

The model is not the bottleneck. Network is. That means parallelizing across chambers should scale close to linearly — the LLM has headroom while the next request is in flight.

What I’d change

A few things worth doing next:

Run multiple chambers concurrently (the timing breakdown above is the argument).
Track per-record hashes so re-scrapes only update what changed.
Test 3B-parameter models — the 7B may be larger than this job needs.
Entity-resolve across chambers so the same business in two cities collapses.

Repo: github.com/Burton-David/lead-gen-pipeline.