Building an AI-Powered CV Parsing System with Multi-Format Support

During my time at HRFLOW.AI, I implemented an AI-powered CV parsing and candidate data extraction system that processes multiple file formats and helps recruiters identify the best candidates from a pool of 850+ clients. In this article, I'll share the technical architecture and lessons learned.

The Challenge

Recruiters deal with hundreds of CVs daily in various formats - PDF, DOCX, TXT, and even scanned images. Manually reviewing each CV is time-consuming and prone to human bias. We needed a system that could:

Parse CVs in multiple formats (PDF, DOCX, TXT, images)
Extract structured data (skills, experience, education)
Score candidates automatically
Provide AI-powered recommendations
Scale to handle thousands of CVs per day

System Architecture

1. File Upload & Processing Pipeline

The first challenge was handling different file formats. We built a processing pipeline using Python:

from typing import Dict, Any
import pypdf
from docx import Document
import pytesseract
from PIL import Image
 
class CVParser:
    def __init__(self):
        self.supported_formats = ['pdf', 'docx', 'txt', 'png', 'jpg']
    
    async def parse_cv(self, file_path: str) -> Dict[str, Any]:
        file_extension = file_path.split('.')[-1].lower()
        
        if file_extension == 'pdf':
            return await self.parse_pdf(file_path)
        elif file_extension == 'docx':
            return await self.parse_docx(file_path)
        elif file_extension == 'txt':
            return await self.parse_txt(file_path)
        elif file_extension in ['png', 'jpg', 'jpeg']:
            return await self.parse_image(file_path)
        else:
            raise ValueError(f"Unsupported format: {file_extension}")
    
    async def parse_pdf(self, file_path: str) -> str:
        text = ""
        with open(file_path, 'rb') as file:
            pdf_reader = pypdf.PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text()
        return text
    
    async def parse_image(self, file_path: str) -> str:
        # OCR for scanned documents
        image = Image.open(file_path)
        text = pytesseract.image_to_string(image)
        return text

2. AI-Powered Data Extraction

Once we have the raw text, we use NLP models to extract structured information:

from transformers import pipeline
import spacy
 
class DataExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_lg")
        self.ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")
    
    def extract_entities(self, text: str) -> Dict[str, Any]:
        doc = self.nlp(text)
        
        entities = {
            'name': self.extract_name(doc),
            'email': self.extract_email(text),
            'phone': self.extract_phone(text),
            'skills': self.extract_skills(doc),
            'experience': self.extract_experience(doc),
            'education': self.extract_education(doc)
        }
        
        return entities
    
    def extract_skills(self, doc) -> list:
        # Use custom NER model trained on technical skills
        skills = []
        skill_keywords = ['python', 'javascript', 'react', 'django', 
                         'node.js', 'docker', 'aws', 'sql']
        
        for token in doc:
            if token.text.lower() in skill_keywords:
                skills.append(token.text)
        
        return list(set(skills))

3. Candidate Scoring System

We implemented an automated scoring system that evaluates candidates based on multiple criteria:

from dataclasses import dataclass
from typing import List
 
@dataclass
class ScoringCriteria:
    required_skills: List[str]
    preferred_skills: List[str]
    min_experience_years: int
    education_level: str
 
class CandidateScorer:
    def __init__(self, criteria: ScoringCriteria):
        self.criteria = criteria
    
    def calculate_score(self, candidate: Dict[str, Any]) -> float:
        score = 0.0
        max_score = 100.0
        
        # Skills match (40 points)
        skills_score = self.score_skills(
            candidate['skills'],
            self.criteria.required_skills,
            self.criteria.preferred_skills
        )
        score += skills_score * 0.4
        
        # Experience (30 points)
        exp_score = self.score_experience(
            candidate['experience'],
            self.criteria.min_experience_years
        )
        score += exp_score * 0.3
        
        # Education (20 points)
        edu_score = self.score_education(
            candidate['education'],
            self.criteria.education_level
        )
        score += edu_score * 0.2
        
        # Additional factors (10 points)
        bonus_score = self.calculate_bonus_score(candidate)
        score += bonus_score * 0.1
        
        return min(score, max_score)
    
    def score_skills(self, candidate_skills: List[str], 
                    required: List[str], preferred: List[str]) -> float:
        required_match = len(set(candidate_skills) & set(required))
        preferred_match = len(set(candidate_skills) & set(preferred))
        
        required_score = (required_match / len(required)) * 70
        preferred_score = (preferred_match / len(preferred)) * 30
        
        return required_score + preferred_score

4. Building the Recruiter Copilot

The Recruiter Copilot uses the parsed data to provide intelligent recommendations:

from typing import List, Tuple
import openai
 
class RecruiterCopilot:
    def __init__(self, api_key: str):
        openai.api_key = api_key
    
    async def generate_candidate_summary(
        self, 
        candidate: Dict[str, Any]
    ) -> str:
        prompt = f"""
        Generate a concise summary for this candidate:
        
        Name: {candidate['name']}
        Skills: {', '.join(candidate['skills'])}
        Experience: {candidate['experience_years']} years
        Education: {candidate['education']}
        Score: {candidate['score']}/100
        
        Highlight strengths and potential concerns.
        """
        
        response = await openai.ChatCompletion.acreate(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content
    
    async def suggest_interview_questions(
        self,
        candidate: Dict[str, Any],
        job_description: str
    ) -> List[str]:
        prompt = f"""
        Based on this candidate profile and job description,
        suggest 5 technical interview questions:
        
        Candidate Skills: {', '.join(candidate['skills'])}
        Job Requirements: {job_description}
        """
        
        response = await openai.ChatCompletion.acreate(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        questions = response.choices[0].message.content.split('\n')
        return [q.strip() for q in questions if q.strip()]

Database Schema with Prisma

We used Prisma ORM for type-safe database access:

model Candidate {
  id          String   @id @default(uuid())
  name        String
  email       String   @unique
  phone       String?
  resumePath  String
  parsedData  Json
  score       Float?
  createdAt   DateTime @default(now())
  updatedAt   DateTime @updatedAt
  
  skills      Skill[]
  experiences Experience[]
  education   Education[]
  applications Application[]
}
 
model Skill {
  id          String     @id @default(uuid())
  name        String
  level       String?    // beginner, intermediate, advanced
  candidateId String
  candidate   Candidate  @relation(fields: [candidateId], references: [id])
}
 
model Experience {
  id          String    @id @default(uuid())
  company     String
  position    String
  startDate   DateTime
  endDate     DateTime?
  description String?
  candidateId String
  candidate   Candidate @relation(fields: [candidateId], references: [id])
}

API Design with FastAPI

We built RESTful APIs to serve the frontend:

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from typing import List
 
app = FastAPI()
 
@app.post("/api/cv/upload")
async def upload_cv(file: UploadFile = File(...)):
    try:
        # Save file
        file_path = await save_upload_file(file)
        
        # Parse CV
        parser = CVParser()
        raw_text = await parser.parse_cv(file_path)
        
        # Extract data
        extractor = DataExtractor()
        candidate_data = extractor.extract_entities(raw_text)
        
        # Calculate score
        scorer = CandidateScorer(default_criteria)
        score = scorer.calculate_score(candidate_data)
        
        # Save to database
        candidate = await save_candidate(candidate_data, score)
        
        return {
            "id": candidate.id,
            "score": score,
            "data": candidate_data
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
 
@app.get("/api/candidates/search")
async def search_candidates(
    skills: List[str] = None,
    min_score: float = 0,
    limit: int = 20
):
    candidates = await search_candidates_db(
        skills=skills,
        min_score=min_score,
        limit=limit
    )
    
    return {"candidates": candidates}

Performance Optimization

1. Batch Processing

For handling bulk CV uploads, we implemented batch processing with queues:

from celery import Celery
from redis import Redis
 
celery_app = Celery('cv_parser', broker='redis://localhost:6379')
 
@celery_app.task
def process_cv_batch(file_paths: List[str]):
    results = []
    for file_path in file_paths:
        result = process_single_cv(file_path)
        results.append(result)
    return results

2. Caching

We used Redis for caching frequently accessed data:

import redis
import json
 
redis_client = redis.Redis(host='localhost', port=6379, db=0)
 
def get_candidate_cache(candidate_id: str):
    cached = redis_client.get(f"candidate:{candidate_id}")
    if cached:
        return json.loads(cached)
    return None
 
def set_candidate_cache(candidate_id: str, data: dict):
    redis_client.setex(
        f"candidate:{candidate_id}",
        3600,  # 1 hour TTL
        json.dumps(data)
    )

Results & Impact

After implementing this system at HRFLOW.AI:

Processing Speed: Reduced CV processing time from 5 minutes to 30 seconds
Accuracy: Achieved 92% accuracy in data extraction
Scale: Successfully serving 850+ clients
ROI: Reduced recruiter screening time by 60%

Key Takeaways

Multi-format support is crucial - Real-world CVs come in many formats
AI enhances but doesn't replace - Human review is still important
Type safety matters - Using TypeScript/Prisma prevented many bugs
Performance optimization is essential - Caching and batch processing are key
Continuous improvement - AI models need regular retraining

Tech Stack Summary

Backend: Python, Django, FastAPI
AI/ML: Transformers, spaCy, OpenAI GPT-4
Database: PostgreSQL with Prisma ORM
Queue: Celery with Redis
Frontend: React, Next.js
Infrastructure: Docker, AWS

Conclusion

Building an AI-powered CV parsing system requires careful consideration of file formats, data extraction accuracy, and scalability. By combining traditional NLP with modern AI models, we created a system that significantly improves the recruitment process.

The key is to start simple, measure results, and iterate based on real-world feedback from recruiters.

Want to learn more about building AI-powered applications? Follow me for more technical deep dives!