# MSD Product Inventory Script

## Overview

`app_product_inventory.py` is a Python script that extracts and inventories products displayed on the MSD profesionales.msd.es website across three levels:

1. **Homes** - Same homepage URL but with different content per user/audience
2. **Therapeutic Areas** - Parent pages like `/areas_terapeuticas/oncologia/`
3. **Pathologies** - Child pages like `/areas_terapeuticas/oncologia/cancer-de-cervix/`

The script uses Playwright for browser automation and extracts product information from the HTML structure.

## Features

- **Multi-user Home Inventory**: Tests the homepage with different user credentials, each representing a different audience (e.g., WPAAnatomiaPatologica, WPAOncologia)
- **Automatic Audience Extraction**: Extracts audience name from usernames like `WPVipSpain+WPAAnatomiaPatologica@msd.com`
- **URL Structure Parsing**: Automatically identifies whether a URL is a Therapeutic Area or Pathology
- **Product Extraction**: Scrapes products from the specific HTML structure with classes like `mhh-mcn-columns`, `mhh-mcn-v1-columns`, and extracts product names from `/productos/` URLs
- **JSON Output**: Generates a structured JSON report with timestamp

## Requirements

- Python 3.7+
- playwright
- beautifulsoup4
- pandas
- python-dotenv

Install dependencies:
```bash
pip install playwright beautifulsoup4 pandas python-dotenv
playwright install chromium
```

## Input Files

### 1. User/Password File (e.g., `PortalWPAs.txt`)

Contains username:password pairs, one per line:

```
WPVipSpain+WPAAnatomiaPatologica@msd.com:Portal1234*
WPVipSpain+WPAOncologia@msd.com:Portal1234*
WPVipSpain+WPAGinecologia@msd.com:Portal1234*
```

### 2. URLs File (e.g., `URLlist.txt`)

Contains therapeutic area and pathology URLs, one per line:

```
https://profesionales.msd.es/areas_terapeuticas/oncologia/
https://profesionales.msd.es/areas_terapeuticas/oncologia/cancer-de-cervix/
https://profesionales.msd.es/areas_terapeuticas/oncologia/cancer-de-colon-y-recto/
https://profesionales.msd.es/areas_terapeuticas/enfermedades-inmunomediadas/
```

## Usage

### Basic Usage

```bash
python app_product_inventory.py \
  --user-pass-file PortalWPAs.txt \
  --urls-file URLlist.txt \
  --default-username "your_default_user@msd.com" \
  --default-password "your_password"
```

### With Environment Variables

Create a `.env` file:
```
MSD_USERNAME=your_default_user@msd.com
MSD_PASSWORD=your_password
```

Then run:
```bash
python app_product_inventory.py \
  --user-pass-file PortalWPAs.txt \
  --urls-file URLlist.txt
```

### With Custom Output Path

```bash
python app_product_inventory.py \
  --user-pass-file PortalWPAs.txt \
  --urls-file URLlist.txt \
  --output reports/product_inventory.json \
  --verbose
```

### With Verbose Output

```bash
python app_product_inventory.py \
  --user-pass-file PortalWPAs.txt \
  --urls-file URLlist.txt \
  --verbose
```

## Command Line Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| `--user-pass-file` | Yes | Path to txt file with username:password pairs for home pages |
| `--urls-file` | Yes | Path to txt file with therapeutic area and pathology URLs |
| `--output` | No | Path to save the JSON output (default: `product_inventory_TIMESTAMP.json`) |
| `--wait-time` | No | Wait time in seconds after page loads (default: 4) |
| `--default-username` | No | Default username for therapeutic areas/pathologies (overrides .env) |
| `--default-password` | No | Default password for therapeutic areas/pathologies (overrides .env) |
| `--verbose`, `-v` | No | Enable verbose output showing progress |

## Output Format

The script generates a JSON file with the following structure:

```json
{
  "timestamp": "2024-10-09T12:00:00.000000",
  "homes": [
    {
      "url": "https://profesionales.msd.es/",
      "user": "WPVipSpain+WPAAnatomiaPatologica@msd.com",
      "audience": "WPAAnatomiaPatologica",
      "products": ["keytruda", "bridion", "cubicin"]
    },
    {
      "url": "https://profesionales.msd.es/",
      "user": "WPVipSpain+WPAOncologia@msd.com",
      "audience": "WPAOncologia",
      "products": ["keytruda", "lynparza"]
    }
  ],
  "therapeutic_areas": [
    {
      "url": "https://profesionales.msd.es/areas_terapeuticas/oncologia/",
      "name": "oncologia",
      "products": ["keytruda", "lynparza", "lenvima"]
    }
  ],
  "pathologies": [
    {
      "url": "https://profesionales.msd.es/areas_terapeuticas/oncologia/cancer-de-cervix/",
      "therapeutic_area": "oncologia",
      "pathology_name": "cancer-de-cervix",
      "products": ["keytruda"]
    },
    {
      "url": "https://profesionales.msd.es/areas_terapeuticas/oncologia/cancer-de-colon-y-recto/",
      "therapeutic_area": "oncologia",
      "pathology_name": "cancer-de-colon-y-recto",
      "products": ["keytruda", "erbitux"]
    }
  ]
}
```

## How It Works

### 1. Home Page Inventory

- Logs in with each user from the user/password file
- Extracts the audience name from the username (e.g., `WPVipSpain+WPAAnatomiaPatologica@msd.com` → `WPAAnatomiaPatologica`)
- Navigates to the homepage `https://profesionales.msd.es/`
- Extracts products displayed for that specific audience
- Saves results with user, audience, and product list

### 2. Therapeutic Areas & Pathologies Inventory

- Logs in with the default user credentials
- For each URL in the URLs file:
  - Parses the URL structure to identify:
    - Therapeutic Area: The segment after `/areas_terapeuticas/` (e.g., `oncologia`)
    - Pathology: The segment after the therapeutic area (e.g., `cancer-de-cervix`)
  - Extracts products from the page
  - Classifies as either Therapeutic Area (if no pathology) or Pathology

### 3. Product Extraction

The script looks for the following HTML structure:

```html
<div class="mhh-mcn-columns mhh-mcn-v1-columns--07053b7240cb45e3c06b994b401def3a ...">
  <div class="mhh-mcn-columns-inner">
    <div class="mhh-mcn-v1-column mhh-mcn-column--6 ...">
      <div class="mhh-mcn-v1-image ...">
        <a class="mhh-mcn-anchor-for-image" href="https://profesionales.msd.es/productos/cubicin/">
          <img src="..." alt="">
        </a>
      </div>
    </div>
  </div>
</div>
```

It extracts the product name from the URL: `/productos/cubicin/` → `cubicin`

## Logging

The script generates a log file: `product_inventory.log` in the same directory as the script.

All operations, errors, and extracted data are logged for debugging purposes.

## Example Workflow

```bash
# 1. Ensure dependencies are installed
pip install playwright beautifulsoup4 pandas python-dotenv
playwright install chromium

# 2. Create .env file with default credentials
echo "MSD_USERNAME=your_user@msd.com" > .env
echo "MSD_PASSWORD=your_password" >> .env

# 3. Run the script with verbose output
python app_product_inventory.py \
  --user-pass-file PortalWPAs.txt \
  --urls-file URLlist.txt \
  --output product_inventory_$(date +%Y%m%d).json \
  --verbose

# 4. View the generated JSON
cat product_inventory_*.json | jq .
```

## Troubleshooting

### Login Issues
- Check that credentials in the user/password file are correct
- Ensure the login URL hasn't changed
- Check the log file for detailed error messages

### Product Extraction Issues
- Use `--verbose` flag to see what's being extracted
- Check if the HTML structure on the website has changed
- Verify that the pages load completely (increase `--wait-time` if needed)

### URL Parsing Issues
- Ensure URLs follow the pattern: `/areas_terapeuticas/{therapeutic_area}/{pathology}/`
- Check the log file for warnings about unparseable URLs

## Differences from app_daemon_multiuser.py

This script is based on `app_daemon_multiuser.py` but with key differences:

1. **Purpose**: Product inventory instead of broken link checking
2. **Output**: JSON format instead of Excel report
3. **Extraction**: Focuses on product names from `/productos/` links instead of checking link status
4. **Structure**: Organizes results by homes, therapeutic areas, and pathologies
5. **No Email**: Simplified output to JSON file only (no email functionality)

## Notes

- The script uses headless Chromium browser for all operations
- Session state is saved and reused to avoid multiple logins
- Products are deduplicated within each page
- The default user credentials are used for all therapeutic area and pathology pages
