Finding qualified B2B leads is time-consuming and expensive. Chamber of Commerce directories contain thousands of local businesses, but manually extracting this data is impractical. Traditional web scrapers fail because each chamber website has a unique structure.
I built lead-gen-pipeline - an AI-powered data extraction system that solves this problem using a local LLM (Qwen2-7B) to intelligently navigate and scrape Chamber of Commerce directories.
GitHub Repository: lead-gen-pipeline
Chamber of Commerce directories present unique scraping challenges:
Traditional scrapers require custom code for each website. I needed a solution that could adapt to any directory structure automatically.
The pipeline uses a local 7B parameter language model (Qwen2) to analyze page structure and extract business data. Running locally means:
Testing on Palo Alto Chamber of Commerce:
The system follows a classic ETL (Extract, Transform, Load) pattern with AI-powered extraction:
Chamber URL → LLM Analysis → Navigation → Extraction → Validation → SQLite Database
1. LLM Processor (llm_processor.py)
2. Chamber Parser (chamber_parser.py)
3. Web Crawler (crawler.py)
4. Database Layer (bulk_database.py)
Traditional scrapers use CSS selectors that break when websites change. My approach asks the LLM to understand the page:
def extract_business_data(self, html_content: str) -> dict:
"""Use LLM to extract structured business data from HTML."""
prompt = """
Analyze this HTML and extract business information.
Return JSON with: name, website, phone, email, address, categories.
If a field is missing, use null.
"""
response = self.llm.generate(
prompt + html_content,
max_tokens=1000,
temperature=0.1 # Low temperature for consistent extraction
)
# Parse JSON with fallback repair
try:
return json.loads(response)
except json.JSONDecodeError:
return self._repair_json(response)
Key insight: Setting temperature to 0.1 ensures consistent JSON structure while allowing the LLM to adapt to different HTML formats.
Chambers organize directories in three main patterns:
1. Category-based:
/directory → [Categories] → [Businesses in Category]
2. Alphabetical:
/directory → [A-Z Letters] → [Businesses starting with Letter]
3. Paginated:
/directory?page=1 → /directory?page=2 → ...
The LLM identifies which pattern is in use:
def detect_directory_structure(self, main_page_html: str) -> str:
"""Ask LLM to identify directory organization pattern."""
prompt = """
Analyze this Chamber of Commerce directory page.
Identify the structure: 'categorical', 'alphabetical', or 'paginated'.
Return only the structure type.
"""
structure = self.llm.generate(prompt + main_page_html, max_tokens=20)
return structure.strip().lower()
This approach eliminates hardcoded navigation logic.
Raw scraped data requires cleaning and validation:
1. Deduplication
def deduplicate_businesses(self, businesses: List[dict]) -> List[dict]:
"""Remove duplicates based on website or name."""
seen = set()
unique = []
for biz in businesses:
# Create composite key
key = (
biz.get('website', '').lower().strip(),
biz.get('name', '').lower().strip()
)
if key not in seen and any(key):
seen.add(key)
unique.append(biz)
return unique
2. Data Validation
def validate_business(self, business: dict) -> bool:
"""Ensure minimum required fields are present."""
required = ['name']
has_contact = any([
business.get('phone'),
business.get('email'),
business.get('website')
])
return all(business.get(field) for field in required) and has_contact
3. Field Normalization
def normalize_phone(self, phone: str) -> str:
"""Standardize phone number format."""
if not phone:
return None
# Extract digits only
digits = ''.join(c for c in phone if c.isdigit())
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
elif len(digits) == 11 and digits[0] == '1':
return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
return phone # Return original if format unknown
Bulk inserts dramatically improve performance:
def bulk_insert_businesses(self, businesses: List[dict]):
"""Insert multiple businesses efficiently."""
# Prepare data for executemany()
records = [
(
biz['name'],
biz.get('website'),
biz.get('phone'),
biz.get('email'),
biz.get('address'),
json.dumps(biz.get('categories', []))
)
for biz in businesses
]
# Single transaction for all inserts
self.cursor.executemany('''
INSERT OR IGNORE INTO businesses
(name, website, phone, email, address, categories)
VALUES (?, ?, ?, ?, ?, ?)
''', records)
self.conn.commit()
Performance improvement: 500+ records/second vs 10-20 with individual inserts.
# Clone and setup
git clone https://github.com/Burton-David/lead-gen-pipeline
cd lead-gen-pipeline
./setup.sh
# Initialize database
python cli.py init
# Extract from chamber
python cli.py chambers --url https://www.paloaltochamber.com
# Export results
python cli.py export --output leads.csv
After processing Palo Alto Chamber of Commerce:
Data Completeness:
Most Common Categories:
Performance Characteristics:
Key Insight: The LLM accounts for only 38% of processing time. Network latency (43%) is the actual bottleneck, meaning concurrent processing of multiple chambers would scale nearly linearly.
Qwen2-7B performs remarkably well for structured data extraction:
Switching from individual inserts to bulk operations improved database performance by 25x. When building data pipelines, always:
Web scraping is inherently unreliable. The pipeline includes:
Potential improvements to explore:
AI-powered data pipelines represent a paradigm shift in web scraping. Instead of brittle CSS selectors, we can use language models to understand and extract data like humans do.
This approach scales to any directory structure without custom code per site. The 9-minute extraction time for 296 businesses proves the concept works at practical speeds.
For B2B lead generation, business intelligence, or market research, combining local LLMs with solid pipeline engineering creates powerful, privacy-respecting data collection systems.
Try it yourself:
Build data pipelines that adapt and scale. Your leads are waiting to be discovered.
All code and performance metrics are from the lead-gen-pipeline GitHub repository.