PyCore Solutions - Corporate And Consulting Company

Pharma Web: Centralized Data Solution for NDC Management

18 January, 2024

The project involved developing a sophisticated web scraping solution to extract comprehensive data for each National Drug Code (NDC) from multiple sources including Sam.gov, DailyMed, and the FDA Orange Book. The project aimed to deliver an adaptable, robust solution to handle dynamic content, varied data formats, and challenging anti-scraping mechanisms, ensuring the latest and most accurate data was consistently available for querying and analysis.

Key Features:

Advanced 'element-hunting' mechanism for dynamic layout handling
PDF and Word document data extraction
IP proxy rotation and anti-bot measures
Scheduled bi-monthly automated scraping
REST API integration for secure data access

Project Challenges:

Handling Dynamic Content and Diverse Layouts: Each source displayed unique layouts and data structures. To address this, we developed an AI-powered 'element-hunting' system that intelligently detected relevant data points based on text patterns, proximity, and hierarchy.
Extracting Data from PDFs and Word Files: Some data was embedded in downloadable files. We implemented a solution to detect and retrieve these files, using libraries like PyMuPDF, pdfplumber, and python-docx to extract relevant text, which required custom parsing to isolate the necessary information.
Managing IP Blocking and Proxy Rotation: To avoid IP bans, we established a proxy rotation system to balance requests across multiple IPs, with dynamic delays and retry mechanisms to mimic human behavior.
Automating Scheduled Scraping: To ensure regular updates, we set up cron jobs that triggered the scripts bi-monthly, reducing the need for manual intervention.
Storing and Serving Data: We stored all extracted data in a PostgreSQL database and created a Django REST API to serve data securely to a React frontend.
Frontend Responsiveness and API Integration: Building a responsive frontend that integrated seamlessly with the backend APIs posed challenges. Ensuring smooth data loading, handling API response delays, and creating a consistent, user-friendly experience across devices required optimizing both the frontend code and API response handling.

Technologies Used:

Python: Core language for backend data processing and automation
Selenium: Used for automated browser interaction to handle dynamic content
Beautiful Soup: HTML and XML parsing library for data extraction
Django: Backend framework to build secure APIs and manage data
React.js: Frontend framework for an interactive and user-friendly interface.

Services:

Custom Software Solution, Web Scraping, API Development, Frontend Development
Client:

None
Location:

None

Need help? Call us:

(+91) 0311 0864105

Project Details

Pharma Web: Centralized Data Solution for NDC Management

Key Features:

Project Challenges:

Technologies Used:

Services:

Client:

Location:

Newsletter