// project
I built this for the Florida sales team at BTL Industries. They needed enriched contact information on decision-makers at companies that fit their target profile — size, location, industry, the tech their websites were running. Doing it manually wasn’t feasible at the volume they needed.
The pipeline scrapes Google Business Profiles, extracts contact and structural data with NLP and OCR, validates and de-duplicates, and pushes the result into the CRM. The team using it won President’s Club 2023 — $30M in revenue, double the second-place team.
Five stages, each pluggable:
from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
import queue
class ParallelScraper:
def __init__(self, max_workers=5):
self.executor = ThreadPoolExecutor(max_workers=max_workers)
self.driver_pool = queue.Queue()
def scrape_company(self, company_url):
driver = self.driver_pool.get()
try:
driver.get(company_url)
return extracted_data
finally:
self.driver_pool.put(driver)
A driver pool reuses browser instances rather than spinning up a new Selenium session per company. Companies are processed concurrently across the pool; the pool size is tuned to the headroom of the GCP instance.
Two storage backends because the data really is two kinds of data:
Forcing scraped HTML into a relational schema would either be lossy or build a maze of nullable columns. Forcing the structured contact data into MongoDB would lose the queryability the sales team needs. Two stores, two jobs.
SmartSheets is what the team interacted with day-to-day. The pipeline pushed enriched leads into a sheet with role-based access, real-time API sync, and an audit trail. From the team’s perspective the pipeline was invisible — leads showed up in their sheet, qualified and contactable.
Behind that, Tableau and Power BI dashboards tracked the metrics the manager cared about: lead volume, data quality, processing status, time-from-discovery-to-CRM.
The business outcome:
A few things I’d change with the benefit of hindsight:
Specific client details and proprietary algorithms have been omitted.