# MSD Broken Link Checker (Daemon Version)

A command-line tool for checking broken links on MSD portal websites. This version is designed to run as a scheduled job or cronjob.

## Features

- Checks links within MSD portal pages using Playwright for browser automation
- Supports two modes:
  - Authenticated checking (requires login to the portal)
  - Public access checking (without login)
- Multithreaded processing for faster link checking
- Session management with automatic re-authentication when needed
- Email delivery of reports as the primary output method
- Detailed Excel report generation with formatted output
- Command-line interface for easy integration with scheduling tools
- Test mode for quick validation with limited URL sets
- Verbose output option for monitoring real-time progress
- Retry mechanism to reduce false negatives for broken links

## Requirements

- Python 3.7 or higher
- Required Python packages:
  - pandas
  - playwright
  - python-dotenv
  - requests
  - beautifulsoup4
  - xlsxwriter

## Installation

1. Clone the repository or download the script files

2. Install dependencies:
```bash
pip install pandas playwright python-dotenv requests beautifulsoup4 xlsxwriter
```

3. Install Playwright browsers:
```bash
python -m playwright install
```

4. Create a `.env` file in the same directory as the script with your credentials:
```
# MSD Portal credentials
MSD_USERNAME=your_username
MSD_PASSWORD=your_password

# Email settings
SMTP_SERVER=smtp.example.com
SMTP_PORT=587
EMAIL_SENDER=sender@example.com
EMAIL_USERNAME=login_username  # Username for SMTP authentication (if different from sender)
EMAIL_PASSWORD=your_email_password
EMAIL_RECIPIENTS=recipient1@example.com,recipient2@example.com
```

## Usage

The script can be run from the command line with various options:

```bash
python app_daemon.py --excel path/to/input.xlsx [options]
```

### Required Arguments

- `--excel`: Path to Excel file containing URLs to check (must have a 'URL' column)

### Optional Arguments

- `--output`: Path to save the output report (only needed for testing or if you want a local copy)
- `--without-login`: Run in public access check mode (no authentication)
- `--wait-time`: Wait time in seconds after page loads (default: 4)
- `--username`: Username for login (overrides .env file)
- `--password`: Password for login (overrides .env file)
- `--test`: Test mode - specify the number of URLs to process (e.g., `--test 5` to only check the first 5 URLs)
- `--verbose`, `-v`: Enable verbose output showing real-time progress in the console
- `--max-retries`: Maximum number of retries for potential false negatives (default: 2)
- `--retry-timeout-multiplier`: Multiplier for timeout on each retry (default: 2.0)

### Email Arguments

- `--send-email`: Explicitly enable sending report via email (not needed if --output is omitted)
- `--smtp-server`: SMTP server address (overrides .env file)
- `--smtp-port`: SMTP server port (overrides .env file)
- `--email-sender`: Sender email address (overrides .env file)
- `--email-username`: Username for SMTP authentication (overrides .env file)
- `--email-password`: Sender email password (overrides .env file)
- `--email-recipients`: Comma-separated list of recipient email addresses (overrides .env file)

### Examples

Basic usage with email delivery (using credentials from `.env` file):
```bash
python app_daemon.py --excel urls.xlsx
```

Test mode with verbose output:
```bash
python app_daemon.py --excel urls.xlsx --test 5 --verbose
```

More aggressive retry settings to reduce false negatives:
```bash
python app_daemon.py --excel urls.xlsx --max-retries 3 --retry-timeout-multiplier 3.0
```

Public access checking with email delivery:
```bash
python app_daemon.py --excel urls.xlsx --without-login
```

Save report to local file (no email):
```bash
python app_daemon.py --excel urls.xlsx --output report.xlsx
```

Save report to local file AND send via email:
```bash
python app_daemon.py --excel urls.xlsx --output report.xlsx --send-email
```

Full custom configuration with different SMTP username:
```bash
python app_daemon.py --excel urls.xlsx --username myuser --password mypass --wait-time 10 --smtp-server smtp.gmail.com --smtp-port 587 --email-sender your@gmail.com --email-username username@gmail.com --email-password yourpassword --email-recipients recipient1@example.com,recipient2@example.com
```

## Retry Mechanism

The script includes a retry mechanism to reduce false negatives when checking links:

```bash
python app_daemon.py --excel urls.xlsx --max-retries 3 --retry-timeout-multiplier 2.5
```

This feature helps distinguish between genuinely broken links and those that might just need more time to load or are experiencing temporary issues:

- When a link initially fails, the script will retry checking it with progressively longer timeouts
- Each retry uses a timeout that's multiplied by the `--retry-timeout-multiplier` value
- A link is only marked as "Broken" after all retry attempts have failed

For example, with default settings (2 retries, 2.0x multiplier):
1. Initial check: 15-second timeout
2. First retry: 30-second timeout (15s × 2.0)
3. Second retry: 60-second timeout (30s × 2.0)

This helps reduce false positives, especially for:
- Slow servers or pages with heavy content
- Temporary network congestion
- Rate-limited resources
- Pages with complex loading processes

## Verbose Mode

The script includes a verbose mode that displays detailed progress information in the console:

```bash
python app_daemon.py --excel urls.xlsx --verbose
# Or using the short form:
python app_daemon.py --excel urls.xlsx -v
```

When verbose mode is enabled, you'll see:
- Progress indicators for each URL being processed
- Information about link extraction
- Real-time updates on link checking progress
- Status summaries for each parent URL
- Details about any broken or problematic links found
- Connection and authentication status updates
- Retry attempts for potentially broken links

This mode is useful for:
- Monitoring long-running processes
- Debugging issues with specific URLs
- Watching the progress in real-time
- Understanding what's happening at each step of the process

Verbose mode provides console output similar to the progress bars shown in the Streamlit version.

## Test Mode

The script includes a test mode that allows you to process only a subset of URLs for testing purposes:

```bash
python app_daemon.py --excel urls.xlsx --test 5
```

This will:
- Process only the first 5 URLs from the Excel file
- Mark the reports and emails as "[TEST]" to indicate they're not a full run
- Significantly reduce execution time for testing (the full run may take 1-2 hours)

This mode is useful for:
- Validating configuration changes
- Testing email delivery
- Verifying login functionality
- Debugging issues with specific URLs

Reports generated in test mode will be prefixed with "test_" in their filenames, and emails will have "[TEST]" in the subject line.

## Excel File Format

The input Excel file must have a column named 'URL' containing the URLs to check. Example format:

| URL                                |
|------------------------------------|
| https://profesionales.msd.es/page1 |
| https://profesionales.msd.es/page2 |
| https://profesionales.msd.es/page3 |

## Setting Up a Cronjob

To run the script automatically on a schedule, you can set up a cronjob:

1. Open your crontab file:
```bash
crontab -e
```

2. Add a line to schedule the script. For example, to run daily at 2 AM with email delivery:
```
0 2 * * * /usr/bin/python3 /full/path/to/app_daemon.py --excel /path/to/urls.xlsx
```

3. For Windows systems, use Task Scheduler instead.

## Email Notifications

By default, the script will send results via email when no --output parameter is provided. When email delivery is enabled, the script will:

1. Check for email configuration in the command line arguments or .env file
2. Generate a summary of the link checking results
3. Create an in-memory Excel report (no files saved to disk)
4. Attach the Excel report to the email
5. Send the email to the specified recipients

The email includes:
- A summary of URLs checked
- Counts of links by status (OK, Broken, etc.)
- Warnings about any broken links found
- The full Excel report as an attachment
- Information about retry settings used

### Email Authentication

For many SMTP servers, the username for authentication might be different from the sender's email address:

- **EMAIL_SENDER**: The email address that will appear in the "From" field
- **EMAIL_USERNAME**: The username used for SMTP authentication (defaults to EMAIL_SENDER if not specified)

This is useful for services where you authenticate with a username that differs from your email address.

## Report Output

The script generates a formatted Excel report with the following columns:

- Parent URL: The URL from the input Excel file
- Child URL: Links found within the parent URL
- Link Type: Type of link (Article/Product or Image)
- Status: Status of the link (OK, Broken, Requires Authentication, etc.)
- Notes: Additional information

## Troubleshooting

- **Login Issues**: Ensure your credentials are correct. Check the log file for detailed error messages.
- **Timeout Errors**: Increase the `--wait-time` parameter if pages are slow to load.
- **Missing Libraries**: Make sure all dependencies are installed correctly.
- **Permission Errors**: When running as a cronjob, make sure the script has permissions to access files.
- **Email Issues**: Verify your SMTP settings and credentials. Some email providers may require enabling "Less secure app access" or generating app-specific passwords.
- **Fallback Reports**: If email sending fails and no output file was specified, the script will save a report to the current directory as a fallback.
- **False Positive Broken Links**: Increase the `--max-retries` value and/or the `--retry-timeout-multiplier` to give links more chances and longer timeouts.
- **Long Execution Times**: For testing configuration changes, use the `--test` parameter to limit the number of URLs processed.
- **Progress Monitoring**: Use the `--verbose` flag to see real-time progress information.

## Logging

The script logs its activities to `link_checker_daemon.log`. Check this file for debugging information if you encounter issues.

## License

This tool is for internal use only.

## Contact

Contact the development team for support or feature requests.

## Deploying on Dokku

For automated deployment on a Dokku server, see the main project README. Here's a quick reference for the key commands:

### Setup and Deployment

```bash
# Create the app
dokku apps:create msd-broken-link-checker

# Set resource limits
dokku resource:limit msd-broken-link-checker --memory 1G --memory-swap 2G

# Set up storage for logs
dokku storage:ensure-directory msd-broken-link-checker
dokku storage:mount msd-broken-link-checker /var/lib/dokku/data/storage/msd-broken-link-checker:/app/logs

# Deploy from GitHub
dokku git:sync --build msd-broken-link-checker https://github.com/josemedina/msd_broken_link_checker
```

### Environment Variables

Configure all required environment variables:

```bash
dokku config:set msd-broken-link-checker \
  MSD_USERNAME=your_username \
  MSD_PASSWORD=your_password \
  SMTP_SERVER=smtp.example.com \
  SMTP_PORT=587 \
  EMAIL_SENDER=sender@example.com \
  EMAIL_USERNAME=login_username \
  EMAIL_PASSWORD=your_email_password \
  EMAIL_RECIPIENTS=recipient1@example.com,recipient2@example.com
```

### Checking Status

Verify cron jobs are set up:

```bash
dokku enter msd-broken-link-checker
crontab -l
```

Check application logs:

```bash
dokku logs msd-broken-link-checker
```

### Forcing Rebuilds

If you encounter issues during deployment:

```bash
dokku cleanup
dokku ps:rebuild msd-broken-link-checker
```

For more detailed instructions, see the main project README.