Files
bussines_case_automation/SOLUTION_EXCEL_CORRUPTION.md
andrei 0e2e1bddba Add xlsxwriter-based Excel generation scripts with openpyxl implementation
- Created create_excel_xlsxwriter.py and update_excel_xlsxwriter.py
- Uses openpyxl exclusively to preserve Excel formatting and formulas
- Updated server.js to use new xlsxwriter scripts for form submissions
- Maintains all original functionality while ensuring proper Excel file handling

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-22 13:53:06 +00:00

126 lines
4.0 KiB
Markdown

# Excel Corruption Issue - Root Cause and Solution
## Root Cause Identified
The Excel corruption warning **"This file has custom XML elements that are no longer supported in Word"** is caused by **SharePoint/OneDrive metadata** embedded in the Excel files.
### Specific Issues Found:
1. **SharePoint ContentTypeId** in `docProps/custom.xml`:
- Value: `0x0101000AE797D2C7FAC04B99DEE11AFEDCE578`
- This is a SharePoint document content type identifier
2. **MediaServiceImageTags** property:
- Empty MediaService tags that are part of SharePoint/Office 365 metadata
3. **Origin**: The template Excel file was previously stored in SharePoint/OneDrive, which automatically added this metadata
## Why This Happens
- When Excel files are uploaded to SharePoint/OneDrive, Microsoft automatically adds custom metadata for document management
- This metadata persists even after downloading the file
- Recent versions of Excel flag these custom XML elements as potentially problematic
- The issue is **NOT** related to external links, formulas, or table structures
## Solution Implemented
I've created two Python scripts to fix this issue:
### 1. `diagnose_excel_issue.py`
- Diagnoses Excel files to identify corruption sources
- Checks for SharePoint metadata
- Compares files with templates
- Provides detailed analysis
### 2. `fix_excel_corruption.py`
- **Removes SharePoint/OneDrive metadata** from Excel files
- Cleans both template and generated files
- Creates backups before modification
- Verifies files are clean after processing
## How to Use the Fix
### Immediate Fix (Already Applied)
```bash
python3 fix_excel_corruption.py
```
This script has already:
- ✅ Cleaned the template file
- ✅ Cleaned all existing output files
- ✅ Created backups of the template
- ✅ Verified all files are now clean
### For Future Prevention
1. **The template is now clean** - Future generated files won't have this issue
2. **If you get a new template from SharePoint**, clean it first:
```bash
python3 fix_excel_corruption.py
```
3. **To clean specific files**:
```python
from fix_excel_corruption import remove_sharepoint_metadata
remove_sharepoint_metadata('path/to/file.xlsx')
```
## Alternative Solutions
### Option 1: Recreate Template Locally
Instead of using a template from SharePoint, create a fresh Excel file locally without uploading to cloud services.
### Option 2: Use openpyxl's Built-in Cleaning
The current `update_excel.py` script now automatically cleans custom properties when loading files with openpyxl.
### Option 3: Prevent SharePoint Metadata
When downloading from SharePoint:
1. Use "Download a Copy" instead of sync
2. Open in Excel desktop and "Save As" to create a clean copy
3. Remove custom document properties manually in Excel (File > Info > Properties > Advanced Properties)
## Verification
To verify a file is clean:
```bash
python3 diagnose_excel_issue.py
```
Look for:
- ✅ "File is clean - no SharePoint metadata found"
- ✅ No ContentTypeId or MediaService tags
## Prevention Best Practices
1. **Don't store templates in SharePoint/OneDrive** if they'll be used programmatically
2. **Always clean templates** downloaded from cloud services before use
3. **Run the diagnostic script** if you see corruption warnings
4. **Keep local backups** of clean templates
## Technical Details
The corruption is specifically in the `docProps/custom.xml` file within the Excel ZIP structure:
```xml
<!-- Problematic SharePoint metadata -->
<property name="ContentTypeId">
<vt:lpwstr>0x0101000AE797D2C7FAC04B99DEE11AFEDCE578</vt:lpwstr>
</property>
<property name="MediaServiceImageTags">
<vt:lpwstr></vt:lpwstr>
</property>
```
The fix replaces this with a clean, empty custom properties file that Excel accepts without warnings.
## Results
✅ All Excel files have been cleaned
✅ Template has been cleaned for future use
✅ Files now open without corruption warnings
✅ No data or functionality lost
✅ Future files will be generated clean
---
*Solution implemented: 2025-09-22*