# MSD Portal Screenshot Capture

A Streamlit application that captures full-page screenshots of MSD Portal pages and converts them to PDF format. This tool is designed to work with the profesionales.msd.es portal, which requires authentication.

## Features

- Upload URLs from an Excel file
- Automatic login to the MSD Portal
- Full-page screenshot capture
- PDF conversion of screenshots
- Progress tracking with Streamlit interface
- Comprehensive error logging
- Automatic cookie acceptance handling
- Python 3.13 compatibility with performance improvements

## Prerequisites

- Python 3.13 (recommended) or Python 3.8+
- Poppler (required for PDF conversion)
  - On macOS: `brew install poppler`
  - On Ubuntu/Debian: `sudo apt-get install poppler-utils`
  - On Windows: Download and install from [poppler releases](http://blog.alivate.com.au/poppler-windows/)

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd msd_broken_link_checker
```

2. Install Python dependencies:
```bash
pip install -r requirements.txt
```

3. Install Playwright browsers:
```bash
playwright install
```

## Usage

1. Prepare an Excel file with a column named "URL" containing the list of URLs to capture.

2. Run the Streamlit application:
```bash
streamlit run app/app_screenshot.py
```

3. In the web interface:
   - Upload your Excel file
   - Enter your MSD Portal credentials
   - Click "Start Capture"

4. The application will:
   - Process each URL in the Excel file
   - Take full-page screenshots
   - Convert them to PDF format
   - Save them in the "PDF" folder
   - Show progress and status updates

## Output

- PDFs are saved in the "PDF" folder
- Filenames are generated from page titles (converted to lowercase with underscores)
- A log file (screenshot_capture.log) tracks all operations and errors

## Error Handling

- The application logs all operations and errors to 'screenshot_capture.log'
- Failed captures are logged but don't stop the entire process
- Session management and cookie handling are automated

## Dependencies

- streamlit>=1.24.0
- pandas>=1.5.0
- playwright>=1.40.0
- Pillow>=9.5.0
- pdf2image>=1.16.3
- openpyxl>=3.1.0

## Notes

- The application runs in headless mode by default
- Make sure you have valid MSD Portal credentials
- The Excel file must contain a column named "URL"
- All URLs should be from profesionales.msd.es domain
- Python 3.13 provides potential performance improvements with the removal of the Global Interpreter Lock (GIL) in threaded operations

## Python 3.13 Compatibility

This application has been updated to support Python 3.13. Notable considerations:

- Python 3.13 removes several deprecated modules (PEP 594), but none that affect this application
- The removal of the Global Interpreter Lock (GIL) in Python 3.13's free-threaded mode can improve performance for our multithreaded link checking
- This version benefits from general performance improvements in Python 3.13
- All dependencies have been tested and verified to work with Python 3.13

## Troubleshooting

1. If PDF conversion fails:
   - Verify Poppler is installed correctly
   - Check if the path to Poppler is in your system's PATH

2. If login fails:
   - Verify credentials are correct
   - Check if the portal is accessible
   - Review the log file for specific error messages

3. If screenshots are incomplete:
   - Check your internet connection
   - Verify the page has loaded completely
   - Review the log file for any timeout errors

## Deployment on Dokku

This application is designed to deploy easily on a Dokku server. Follow these steps to set up the application:

### Prerequisites

- A Dokku server
- Git installed on your local machine
- Access to the GitHub repository

### Deployment Steps

1. On your Dokku server, create a new app:

```bash
dokku apps:create msd-broken-link-checker
```

2. Set resource limits to ensure sufficient memory and swap space:

```bash
dokku resource:limit msd-broken-link-checker --memory 1G --memory-swap 2G
```

3. Set up the required environment variables:

```bash
dokku config:set msd-broken-link-checker \
  PYTHON_VERSION=3.13.0 \
  MSD_USERNAME=your_username \
  MSD_PASSWORD=your_password \
  SMTP_SERVER=smtp.example.com \
  SMTP_PORT=587 \
  EMAIL_SENDER=sender@example.com \
  EMAIL_USERNAME=login_username \
  EMAIL_PASSWORD=your_email_password \
  EMAIL_RECIPIENTS=recipient1@example.com,recipient2@example.com
```

4. Create a persistent storage volume for logs:

```bash
dokku storage:ensure-directory msd-broken-link-checker
dokku storage:mount msd-broken-link-checker /var/lib/dokku/data/storage/msd-broken-link-checker:/app/logs
```

5. Deploy the application from GitHub (recommended method):

```bash
# On your Dokku server
dokku git:sync --build msd-broken-link-checker https://github.com/josemedina/msd_broken_link_checker
```

Alternative method using a local clone:

```bash
# On your local machine
git clone https://github.com/josemedina/msd_broken_link_checker.git
cd msd_broken_link_checker
git remote add dokku dokku@your-dokku-server:msd-broken-link-checker
git push dokku main
```

6. Verify the cronjobs are set up properly:

```bash
dokku enter msd-broken-link-checker
crontab -l
```

You should see two cronjobs scheduled to run at 5:00 AM and 5:30 AM on Monday, Wednesday, and Friday.

### Updating the Application

To update the application:

```bash
# On your Dokku server (recommended method)
dokku git:sync --build msd-broken-link-checker https://github.com/josemedina/msd_broken_link_checker
```

Alternative method using a local clone:

```bash
# On your local machine, in the repository directory
git pull origin main
git push dokku main
```

### Checking Logs

To check the application logs:

```bash
dokku logs msd-broken-link-checker
```

To check the link checker logs:

```bash
dokku enter msd-broken-link-checker
cat logs/link_checker_daemon.log
```

### Troubleshooting

- If the cronjobs are not running, check if cron is running in the container:
  ```bash
  dokku enter msd-broken-link-checker
  service cron status
  ```

- If the email delivery is not working, check the SMTP settings and test them manually:
  ```bash
  dokku enter msd-broken-link-checker
  cd app
  python -c "import smtplib; server = smtplib.SMTP('your-smtp-server', 587); server.starttls(); server.login('your-username', 'your-password'); server.quit()"
  ```

- If you encounter Python 3.13 compatibility issues:
  ```bash
  dokku enter msd-broken-link-checker
  python --version  # Verify Python version
  pip list  # Check installed packages
  ```

- If the deployment fails due to disk space issues:
  ```bash
  dokku cleanup
  dokku ps:rebuild msd-broken-link-checker
  ```

## Cron Jobs

**Important**: The application is designed to run scheduled link checking via cron jobs. For Dokku deployments, the recommended approach is to use server-side cron rather than container-managed cron.

### Setting Up Server-Side Cron Jobs (One-Time Setup)

The cron jobs need to be installed on the Dokku host server, not inside the container. Follow these steps:

1. After deploying your application, SSH into your Dokku server.

2. Enter the container to view the crontab file:
   ```bash
   # Enter the container
   dokku enter msd-broken-link-checker

   # Once inside the container, view the crontab file
   cat dokku-crontab
   ```

3. Exit the container (type `exit` or press Ctrl+D).

4. On the Dokku host (not inside the container), create the crontab file:
   ```bash
   # As the dokku user on the host server (NOT inside the container)
   cat > /tmp/msd-crontab << 'EOF'
   # MSD Broken Link Checker server cron jobs
   MAILTO="root@localhost"
   PATH=/usr/local/bin:/usr/bin:/bin
   SHELL=/bin/bash

   # Run the link checker with URLs.xlsx (Monday, Wednesday, Friday at 5:00 AM)
   0 5 * * 1,3,5 dokku dokku --rm run msd-broken-link-checker env bash -c "cd /app/app && python app_daemon.py --excel URLs.xlsx --max-retries 3 --retry-timeout-multiplier 2.0"

   # Run the link checker with URLs_without_login.xlsx (Monday, Wednesday, Friday at 5:30 AM)
   30 5 * * 1,3,5 dokku dokku --rm run msd-broken-link-checker env bash -c "cd /app/app && python app_daemon.py --excel URLs_without_login.xlsx --without-login"
   EOF

   # Use sudo to install it (this runs on the host, not in the container)
   sudo cp /tmp/msd-crontab /etc/cron.d/msd-broken-link-checker
   sudo chmod 644 /etc/cron.d/msd-broken-link-checker
   sudo chown root:root /etc/cron.d/msd-broken-link-checker
   ```

5. Test that the cron jobs work:
   ```bash
   # On the Dokku host (not in the container)
   sudo dokku run msd-broken-link-checker env bash -c "cd /app/app && python app_daemon.py --excel URLs.xlsx --test 1 --max-retries 1"
   ```

### Important Notes:

- This setup is only needed **once** after initial deployment.
- You do not need to reinstall the cron jobs after each deployment.
- You only need to reinstall if you change the cron schedule or move to a new server.
- The cron jobs run on the host and automatically use the latest version of your app. 