}# MSD Multiuser Homepage Broken Link Checker

## Overview

The `app_daemon_multiuser.py` script is a specialized version of the MSD Broken Link Checker designed to handle personalized content across multiple user accounts. It checks the homepage of MSD's professional portal (`https://profesionales.msd.es/`) with different user credentials to verify that each audience sees the correct content, then optionally checks additional URLs with default credentials.

## Features

- **Multi-user Homepage Verification**: Tests the homepage with multiple username:password combinations from a text file
- **Parallel Processing**: Uses multi-threading to process multiple users concurrently for faster execution
- **Flexible URL Input**: Accepts additional URLs from Excel (.xlsx/.xls) or text (.txt) files
- **External Link Handling**: Option to use browser navigation for external links to bypass popup confirmations
- **Link Caching**: In-memory cache prevents re-checking the same URL multiple times
- **Comprehensive Reporting**: Generates Excel reports with detailed link status information
- **Email Integration**: Automatically sends reports via email or saves locally

## Requirements

- Python 3.7+
- Required packages:
  - pandas
  - playwright
  - requests
  - beautifulsoup4
  - xlsxwriter
  - python-dotenv
  - smtplib (built-in)

## Installation

1. Install dependencies:
```bash
pip install pandas playwright requests beautifulsoup4 xlsxwriter python-dotenv
```

2. Install Playwright browsers:
```bash
playwright install
```

3. Copy `.env.example` to `.env` and configure environment variables

## Configuration

Create a `.env` file in the project root with the following variables:

```env
# Default credentials for additional URL checks
MSD_USERNAME=your_default_username
MSD_PASSWORD=your_default_password

# Email configuration (optional, for sending reports)
SMTP_SERVER=smtp.yourserver.com
SMTP_PORT=587
EMAIL_SENDER=your_email@domain.com
EMAIL_USERNAME=your_smtp_username
EMAIL_PASSWORD=your_smtp_password
EMAIL_RECIPIENTS=user1@domain.com,user2@domain.com
```

## Usage

### Basic Command

```bash
python app/app_daemon_multiuser.py --user-pass-file users.txt --additional-urls urls.xlsx --output report.xlsx --verbose
```

### Full Command with All Options

```bash
python app/app_daemon_multiuser.py \
  --user-pass-file PortalWPAs.txt \
  --additional-urls URLlist.txt \
  --output report.xlsx \
  --default-username admin \
  --default-password secret \
  --wait-time 5 \
  --max-retries 3 \
  --retry-timeout-multiplier 2.5 \
  --use-browser-for-external \
  --verbose \
  --send-email
```

## Arguments

### Required Arguments

- `--user-pass-file`: Path to text file containing username:password pairs (one per line, format: `username:password`)

### Optional Arguments

- `--additional-urls`: Path to Excel (.xlsx/.xls) or text (.txt) file with additional URLs to check
- `--output`: Path to save the Excel report (if omitted, report is sent via email)
- `--without-login`: Check additional URLs without authentication
- `--wait-time`: Seconds to wait after page loads (default: 4)
- `--default-username`: Override default username from .env
- `--default-password`: Override default password from .env
- `--test`: Limit processing to N URLs for testing
- `--verbose`: Enable detailed console output
- `--max-retries`: Maximum retry attempts for failed links (default: 2)
- `--retry-timeout-multiplier`: Timeout multiplier for retries (default: 2.0)
- `--use-browser-for-external`: Use browser navigation for external links
- `--send-email`: Force email sending (even with --output)
- Email arguments: `--smtp-server`, `--smtp-port`, `--email-sender`, etc.

## Input Files

### User Credentials File

Text file with one username:password pair per line:
```
user1@msd.com:password1
user2@msd.com:password2
```

### URL List Files

**Excel Format**: Column 'URL' containing URLs
**Text Format**: One URL per line

## Output

### Excel Report Structure

The report contains the following columns:
- **User**: Username or 'Default' for additional URLs
- **Parent URL**: The page where the link was found
- **Child URL**: The checked link
- **Link Type**: 'Article/Product' or 'Image'
- **Status**: OK, Broken, Requires Authentication, etc.
- **Notes**: Additional information

### Status Values

- **OK**: Link accessible
- **Broken**: Link returns error status
- **Requires Authentication**: Link redirects to login
- **Login Failed**: User authentication failed
- **No Links Found**: Page has no checkable links

## Examples

### Check Homepages Only

```bash
python app/app_daemon_multiuser.py --user-pass-file users.txt --verbose
```

### Check Homepages + Additional URLs

```bash
python app/app_daemon_multiuser.py \
  --user-pass-file users.txt \
  --additional-urls urls.xlsx \
  --use-browser-for-external \
  --output results.xlsx
```

### Test Mode

```bash
python app/app_daemon_multiuser.py \
  --user-pass-file users.txt \
  --additional-urls urls.txt \
  --test 5 \
  --verbose
```

## Developer Documentation

### Code Structure

- `login_to_portal()`: Handles MSD portal authentication
- `extract_links()`: Scrapes links from authenticated pages
- `compile_results_for_homepage()`: Multi-threaded homepage checking
- `compile_results()`: Processes additional URLs
- `check_link_status()`: Checks individual link status with caching
- `check_link_status_browser()`: Browser-based external link checking
- `generate_report()`: Creates Excel output
- `send_email()`: Email delivery functionality

### Key Classes/Functions

#### Global Variables
- `checked_links`: Thread-safe cache for link statuses
- `checked_links_lock`: Threading lock for cache access

#### Threading
- Homepage checks: 5 concurrent threads per user batch
- Link checks: 5 concurrent threads per page

#### Caching Strategy
- Links are cached globally to avoid redundant checks
- Thread-safe with read/write locks
- Persists for script execution duration

### Link Checking Logic

1. **Internal Links** (MSD domain): HTTP HEAD requests with session cookies
2. **External Links** (without flag): HTTP HEAD requests without cookies
3. **External Links** (with `--use-browser-for-external`): Browser navigation

### Error Handling

- Login failures are logged and skip to next user
- Network timeouts trigger retries with exponential backoff
- Browser crashes are caught and marked as Broken
- All exceptions are logged with stack traces

### Logging

- File: `link_checker_daemon_multiuser.log`
- Level: INFO
- Includes timestamps, user actions, and errors

### Performance Considerations

- Multi-threading for concurrent user/homepage processing
- Link caching to avoid duplicate checks
- Browser reuse not implemented (each external check gets fresh browser)
- Memory usage scales with cache size and concurrent threads

### Security Notes

- Credentials stored in .env (gitignored)
- Temporary browser storage files cleaned up
- No sensitive data logged
- HTTPS enforced for all communications

## Troubleshooting

### Common Issues

1. **Login Failures**: Check credentials and network connectivity
2. **Browser Not Found**: Run `playwright install`
3. **Email Not Sent**: Verify SMTP settings in .env
4. **External Links Broken**: Try `--use-browser-for-external`
5. **Memory Issues**: Reduce thread counts or disable caching

### Debug Mode

Use `--verbose` and check the log file for detailed information.

### Performance Tuning

- Adjust thread counts in ThreadPoolExecutor calls
- Modify retry counts and timeouts
- Use caching strategically for large URL sets

## Changelog

- **v1.0**: Initial multiuser version
- Added parallel processing
- Implemented link caching
- Added browser-based external link checking
- Enhanced error handling and logging

## License

[Add license information here]
