Latest Update: August 2023 - Added OpenAI integration for automated cover letter generation
A sophisticated Python application that streamlines your job search by intelligently scraping and filtering LinkedIn job postings. Features a web-based dashboard for job management with AI-powered cover letter generation capabilities.
LinkedIn Job Scraper addresses the common frustrations of job searching on LinkedIn by providing:
- Intelligent Filtering: Remove irrelevant job postings based on keywords in titles and descriptions
- Duplicate Prevention: Automatic detection and removal of duplicate listings
- Smart Sorting: Jobs sorted by actual posting date, not LinkedIn's relevance algorithm
- No Sponsored Content: Focus only on genuine job postings
- AI-Powered Cover Letters: Automated cover letter generation using OpenAI
- Web Dashboard: Intuitive interface for job management and tracking
Disclaimer: This application scrapes LinkedIn's website, which may violate their Terms of Service. Use at your own risk and consider implementing proxy servers to avoid potential IP blocking.
- Automated Job Scraping: Multi-threaded scraping with configurable search parameters
- Advanced Filtering: Filter by keywords, company names, job types, and languages
- Database Storage: SQLite-based storage with efficient querying
- Web Interface: Flask-based dashboard for job management
- Status Tracking: Mark jobs as applied, rejected, interview, or hidden
- Cover Letter Generation: OpenAI-powered automated cover letter creation
- Resume Analysis: PDF resume parsing for personalized content
- Smart Matching: AI-driven job-resume compatibility assessment
- Python 3.6 or higher
- Flask
- Requests
- BeautifulSoup
- Pandas
- SQLite3
- Pysocks
-
Clone the repository
git clone https://github.com/bigdata5911/Linked-in-Scraping.git cd Linked-in-Scraping
-
Install dependencies
pip install -r requirements.txt
-
Configure the application
- Copy
config_example.json
toconfig.json
- Update configuration parameters (see Configuration section below)
- Copy
-
Initialize the database
python main.py
-
Launch the web interface
python app.py
-
Access the dashboard Open your browser and navigate to
http://127.0.0.1:5000
The scraper component handles LinkedIn job data extraction:
python main.py
Key Features:
- Configurable search queries and filters
- Duplicate detection and removal
- Multi-round scraping for comprehensive coverage
- Proxy support for enhanced reliability
The Flask-based web interface provides job management capabilities:
python app.py
Dashboard Features:
- Job Status Management: Mark jobs as applied (blue), rejected (red), interview (green), or hidden
- Real-time Updates: Immediate database updates for all actions
- Filtered Views: Focus on relevant job postings
- Status Persistence: All changes saved to SQLite database
The config.json
file controls all application behavior:
{
"proxies": {
"http": "http://proxy-server:port",
"https": "https://proxy-server:port"
},
"headers": {
"User-Agent": "Your User Agent String"
}
}
{
"OpenAI_API_KEY": "your-openai-api-key",
"OpenAI_Model": "gpt-4",
"resume_path": "/path/to/your/resume.pdf"
}
{
"search_queries": [
{
"keywords": "software engineer",
"location": "San Francisco, CA",
"f_WT": "2"
}
]
}
desc_words
: Keywords to exclude from job descriptionstitle_include
: Required keywords in job titlestitle_exclude
: Keywords to exclude from job titlescompany_exclude
: Companies to filter outlanguages
: Allowed job posting languages (e.g., ["en", "de"])
timespan
: Time range for job postings"r604800"
: Past week"r86400"
: Last 24 hours
pages_to_scrape
: Number of pages per search queryrounds
: Number of scraping iterationsdays_toscrape
: Maximum age of job postings to scrape
jobs_tablename
: Table for raw job datafiltered_jobs_tablename
: Table for filtered job datadb_path
: SQLite database file path
Value | Description |
---|---|
0 |
On-site positions |
1 |
Hybrid positions |
2 |
Remote positions |
"" |
Any position type |
For enhanced reliability, configure proxy servers in your config.json
:
{
"proxies": {
"http": "http://username:password@proxy-server:port",
"https": "https://username:password@proxy-server:port"
}
}
Set up cron jobs for regular scraping:
# Run every hour during business days
0 9-17 * * 1-5 cd /path/to/Linked-in-Scraping && python main.py
- Job Status Reversal: Add functionality to unhide and un-apply jobs
- Enhanced Sorting: Sort by database entry date for better job discovery
- Web Configuration: Frontend interface for search configuration
- Export Functionality: Export job data to various formats
- Advanced Analytics: Job application tracking and analytics
- Some job postings (~1-5%) may not appear in search results immediately due to LinkedIn's indexing delays
- Manual database modification required to reverse job status changes
- Configuration currently limited to JSON file editing
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
For major changes, please open an issue first to discuss the proposed changes.
This project is licensed under the MIT License - see the LICENSE file for details.
bigdata5911
Built with β€οΈ for job seekers everywhere