← Home

// project

Lead Generation Engine

September 2025
Technologies
PythonGoogle Cloud PlatformSeleniumNLTKspaCyTesseract OCRGoogle Cloud SQLMongoDB

I built this for the Florida sales team at BTL Industries. They needed enriched contact information on decision-makers at companies that fit their target profile — size, location, industry, the tech their websites were running. Doing it manually wasn’t feasible at the volume they needed.

The pipeline scrapes Google Business Profiles, extracts contact and structural data with NLP and OCR, validates and de-duplicates, and pushes the result into the CRM. The team using it won President’s Club 2023 — $30M in revenue, double the second-place team.

What the system does

Five stages, each pluggable:

  1. Discovery — parameterized search via the Google Business Profiles API across company size, geography, industry, product categories, and detected web technologies
  2. Collection — Selenium-driven scraping of company websites in parallel browser instances, adaptive to varied site structures
  3. NLP extraction — entity extraction, contact-pattern matching, data standardization, confidence scoring
  4. OCR fill-in — Tesseract on page screenshots to recover fields the DOM-based scraper missed
  5. Storage and sync — structured records to Cloud SQL, raw documents and screenshots to MongoDB, real-time sync into SmartSheets for the sales team

Parallel scraping

from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
import queue

class ParallelScraper:
    def __init__(self, max_workers=5):
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        self.driver_pool = queue.Queue()

    def scrape_company(self, company_url):
        driver = self.driver_pool.get()
        try:
            driver.get(company_url)
            return extracted_data
        finally:
            self.driver_pool.put(driver)

A driver pool reuses browser instances rather than spinning up a new Selenium session per company. Companies are processed concurrently across the pool; the pool size is tuned to the headroom of the GCP instance.

Polyglot storage

Two storage backends because the data really is two kinds of data:

  • Cloud SQL (PostgreSQL) — the canonical record. Normalized schema, indexed on the fields the sales team queries, automated backups, read replicas for scaling.
  • MongoDB — raw scraped HTML, page screenshots, OCR output, confidence-score breakdowns. Flexible schema, GridFS for the large binaries, useful when a question comes up later about why a particular field was extracted the way it was.

Forcing scraped HTML into a relational schema would either be lossy or build a maze of nullable columns. Forcing the structured contact data into MongoDB would lose the queryability the sales team needs. Two stores, two jobs.

What the sales team actually used

SmartSheets is what the team interacted with day-to-day. The pipeline pushed enriched leads into a sheet with role-based access, real-time API sync, and an audit trail. From the team’s perspective the pipeline was invisible — leads showed up in their sheet, qualified and contactable.

Behind that, Tableau and Power BI dashboards tracked the metrics the manager cared about: lead volume, data quality, processing status, time-from-discovery-to-CRM.

Results

  • 10,000+ companies processed across the engagement
  • Zero manual data entry by the sales team for these accounts
  • 50% more qualified leads vs. the prior manual workflow
  • 94% data accuracy by spot-check audit
  • 99.8% uptime on the production pipeline

The business outcome:

  • President’s Club 2023 for the Florida team
  • $30M revenue, ~2× the second-place team
  • Sales reps spent their time selling instead of building lists

What I’d build differently

A few things I’d change with the benefit of hindsight:

  • Streaming instead of batch. The pipeline ran on a schedule. A streaming variant would update records as websites change, not on a polling interval.
  • Better duplicate detection. Same company under slightly different names from different sources. Entity resolution would have caught those — embedding-based matching against a canonical index.
  • Predictive scoring. The pipeline produced leads of equal weight. Scoring them by expected conversion probability — based on past won/lost deals — would have given the team a prioritized queue.

Specific client details and proprietary algorithms have been omitted.