Building an AI-Powered CV Parsing System with Multi-Format Support
Learn how to build a scalable CV parsing system that extracts candidate data from multiple formats using AI and helps recruiters identify the best candidates
During my time at HRFLOW.AI, I implemented an AI-powered CV parsing and candidate data extraction system that processes multiple file formats and helps recruiters identify the best candidates from a pool of 850+ clients. In this article, I'll share the technical architecture and lessons learned.
The Challenge
Recruiters deal with hundreds of CVs daily in various formats - PDF, DOCX, TXT, and even scanned images. Manually reviewing each CV is time-consuming and prone to human bias. We needed a system that could:
- Parse CVs in multiple formats (PDF, DOCX, TXT, images)
- Extract structured data (skills, experience, education)
- Score candidates automatically
- Provide AI-powered recommendations
- Scale to handle thousands of CVs per day
System Architecture
1. File Upload & Processing Pipeline
The first challenge was handling different file formats. We built a processing pipeline using Python:
from typing import Dict, Any
import pypdf
from docx import Document
import pytesseract
from PIL import Image
class CVParser:
def __init__(self):
self.supported_formats = ['pdf', 'docx', 'txt', 'png', 'jpg']
async def parse_cv(self, file_path: str) -> Dict[str, Any]:
file_extension = file_path.split('.')[-1].lower()
if file_extension == 'pdf':
return await self.parse_pdf(file_path)
elif file_extension == 'docx':
return await self.parse_docx(file_path)
elif file_extension == 'txt':
return await self.parse_txt(file_path)
elif file_extension in ['png', 'jpg', 'jpeg']:
return await self.parse_image(file_path)
else:
raise ValueError(f"Unsupported format: {file_extension}")
async def parse_pdf(self, file_path: str) -> str:
text = ""
with open(file_path, 'rb') as file:
pdf_reader = pypdf.PdfReader(file)
for page in pdf_reader.pages:
text += page.extract_text()
return text
async def parse_image(self, file_path: str) -> str:
# OCR for scanned documents
image = Image.open(file_path)
text = pytesseract.image_to_string(image)
return text2. AI-Powered Data Extraction
Once we have the raw text, we use NLP models to extract structured information:
from transformers import pipeline
import spacy
class DataExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_lg")
self.ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")
def extract_entities(self, text: str) -> Dict[str, Any]:
doc = self.nlp(text)
entities = {
'name': self.extract_name(doc),
'email': self.extract_email(text),
'phone': self.extract_phone(text),
'skills': self.extract_skills(doc),
'experience': self.extract_experience(doc),
'education': self.extract_education(doc)
}
return entities
def extract_skills(self, doc) -> list:
# Use custom NER model trained on technical skills
skills = []
skill_keywords = ['python', 'javascript', 'react', 'django',
'node.js', 'docker', 'aws', 'sql']
for token in doc:
if token.text.lower() in skill_keywords:
skills.append(token.text)
return list(set(skills))3. Candidate Scoring System
We implemented an automated scoring system that evaluates candidates based on multiple criteria:
from dataclasses import dataclass
from typing import List
@dataclass
class ScoringCriteria:
required_skills: List[str]
preferred_skills: List[str]
min_experience_years: int
education_level: str
class CandidateScorer:
def __init__(self, criteria: ScoringCriteria):
self.criteria = criteria
def calculate_score(self, candidate: Dict[str, Any]) -> float:
score = 0.0
max_score = 100.0
# Skills match (40 points)
skills_score = self.score_skills(
candidate['skills'],
self.criteria.required_skills,
self.criteria.preferred_skills
)
score += skills_score * 0.4
# Experience (30 points)
exp_score = self.score_experience(
candidate['experience'],
self.criteria.min_experience_years
)
score += exp_score * 0.3
# Education (20 points)
edu_score = self.score_education(
candidate['education'],
self.criteria.education_level
)
score += edu_score * 0.2
# Additional factors (10 points)
bonus_score = self.calculate_bonus_score(candidate)
score += bonus_score * 0.1
return min(score, max_score)
def score_skills(self, candidate_skills: List[str],
required: List[str], preferred: List[str]) -> float:
required_match = len(set(candidate_skills) & set(required))
preferred_match = len(set(candidate_skills) & set(preferred))
required_score = (required_match / len(required)) * 70
preferred_score = (preferred_match / len(preferred)) * 30
return required_score + preferred_score4. Building the Recruiter Copilot
The Recruiter Copilot uses the parsed data to provide intelligent recommendations:
from typing import List, Tuple
import openai
class RecruiterCopilot:
def __init__(self, api_key: str):
openai.api_key = api_key
async def generate_candidate_summary(
self,
candidate: Dict[str, Any]
) -> str:
prompt = f"""
Generate a concise summary for this candidate:
Name: {candidate['name']}
Skills: {', '.join(candidate['skills'])}
Experience: {candidate['experience_years']} years
Education: {candidate['education']}
Score: {candidate['score']}/100
Highlight strengths and potential concerns.
"""
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def suggest_interview_questions(
self,
candidate: Dict[str, Any],
job_description: str
) -> List[str]:
prompt = f"""
Based on this candidate profile and job description,
suggest 5 technical interview questions:
Candidate Skills: {', '.join(candidate['skills'])}
Job Requirements: {job_description}
"""
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
questions = response.choices[0].message.content.split('\n')
return [q.strip() for q in questions if q.strip()]Database Schema with Prisma
We used Prisma ORM for type-safe database access:
model Candidate {
id String @id @default(uuid())
name String
email String @unique
phone String?
resumePath String
parsedData Json
score Float?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
skills Skill[]
experiences Experience[]
education Education[]
applications Application[]
}
model Skill {
id String @id @default(uuid())
name String
level String? // beginner, intermediate, advanced
candidateId String
candidate Candidate @relation(fields: [candidateId], references: [id])
}
model Experience {
id String @id @default(uuid())
company String
position String
startDate DateTime
endDate DateTime?
description String?
candidateId String
candidate Candidate @relation(fields: [candidateId], references: [id])
}API Design with FastAPI
We built RESTful APIs to serve the frontend:
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from typing import List
app = FastAPI()
@app.post("/api/cv/upload")
async def upload_cv(file: UploadFile = File(...)):
try:
# Save file
file_path = await save_upload_file(file)
# Parse CV
parser = CVParser()
raw_text = await parser.parse_cv(file_path)
# Extract data
extractor = DataExtractor()
candidate_data = extractor.extract_entities(raw_text)
# Calculate score
scorer = CandidateScorer(default_criteria)
score = scorer.calculate_score(candidate_data)
# Save to database
candidate = await save_candidate(candidate_data, score)
return {
"id": candidate.id,
"score": score,
"data": candidate_data
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/candidates/search")
async def search_candidates(
skills: List[str] = None,
min_score: float = 0,
limit: int = 20
):
candidates = await search_candidates_db(
skills=skills,
min_score=min_score,
limit=limit
)
return {"candidates": candidates}Performance Optimization
1. Batch Processing
For handling bulk CV uploads, we implemented batch processing with queues:
from celery import Celery
from redis import Redis
celery_app = Celery('cv_parser', broker='redis://localhost:6379')
@celery_app.task
def process_cv_batch(file_paths: List[str]):
results = []
for file_path in file_paths:
result = process_single_cv(file_path)
results.append(result)
return results2. Caching
We used Redis for caching frequently accessed data:
import redis
import json
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def get_candidate_cache(candidate_id: str):
cached = redis_client.get(f"candidate:{candidate_id}")
if cached:
return json.loads(cached)
return None
def set_candidate_cache(candidate_id: str, data: dict):
redis_client.setex(
f"candidate:{candidate_id}",
3600, # 1 hour TTL
json.dumps(data)
)Results & Impact
After implementing this system at HRFLOW.AI:
- Processing Speed: Reduced CV processing time from 5 minutes to 30 seconds
- Accuracy: Achieved 92% accuracy in data extraction
- Scale: Successfully serving 850+ clients
- ROI: Reduced recruiter screening time by 60%
Key Takeaways
- Multi-format support is crucial - Real-world CVs come in many formats
- AI enhances but doesn't replace - Human review is still important
- Type safety matters - Using TypeScript/Prisma prevented many bugs
- Performance optimization is essential - Caching and batch processing are key
- Continuous improvement - AI models need regular retraining
Tech Stack Summary
- Backend: Python, Django, FastAPI
- AI/ML: Transformers, spaCy, OpenAI GPT-4
- Database: PostgreSQL with Prisma ORM
- Queue: Celery with Redis
- Frontend: React, Next.js
- Infrastructure: Docker, AWS
Conclusion
Building an AI-powered CV parsing system requires careful consideration of file formats, data extraction accuracy, and scalability. By combining traditional NLP with modern AI models, we created a system that significantly improves the recruitment process.
The key is to start simple, measure results, and iterate based on real-world feedback from recruiters.
Want to learn more about building AI-powered applications? Follow me for more technical deep dives!