Build a Python tool that:
Multi-format parsing (PDF, DOCX, HTML, CSV, JSON)
OCR capabilities with Arabic text support (EasyOCR + Tesseract)
Structured JSON output with metadata preservation
GPU-accelerated processing
Comprehensive error handling and logging
Tech:
Core Functionality:
PDF text extraction (OCR and native)
DOCX/HTML/Markdown parsing
Arabic language support with RTL handling
Metadata preservation and enhancement
Performance Features:
GPU acceleration via EasyOCR
Batch processing capabilities
Configurable output formats
Integration Ready:
Compatible with LangChain/LlamaIndex
Clean API for extension
Modular architecture
Deliverables:
✅ Fully functional Python package
✅ Documentation (usage examples, API reference)
✅ Sample test files
✅ Benchmarking results
مراحل الوظيفة
-
Project delivery
To deliver the project as agreed
المهارات المطلوبة
Artificial Intelligence
Data Science
Data Integration
تاريخ الموعد النهائي
03-05-2025
ميزانية العميل
300 EGP