DOC Input Normalization Protocol
Use this guide when the user provides a Word file that may be legacy binary .doc.
Objective
Normalize all template inputs to .docx before analysis or rewrite so downstream tooling always operates on OOXML.
Detect Container by Signature
Do not trust extension names alone.
| Format | Signature (hex) | Meaning |
|---|---|---|
OOXML ZIP (.docx) | 50 4B 03 04 | Zip package with XML parts |
OLE CFB (.doc) | D0 CF 11 E0 A1 B1 1A E1 | Legacy binary Word container |
Quick check:
xxd -l 8 <input-file>
Required Workflow
- If signature is ZIP, treat input as
.docxand continue normally. - If signature is OLE, convert to
.docxbefore any content analysis. - Keep original
.docread-only and preserve it for traceability. - Run
auditandresidualon converted.docxbefore rewrite. - Use only the converted
.docxinfrom-templatemode.
Conversion Command
soffice --headless --convert-to docx --outdir <tmp-dir> <input.doc>
Do not use textutil for template-driven .doc normalization. It is not accepted in this skill because structure fidelity is insufficient for downstream map/apply and gate checks.
Validation after conversion:
python3 <skill-path>/docx_engine.py audit <tmp-dir>/<converted>.docx
python3 <skill-path>/docx_engine.py residual <tmp-dir>/<converted>.docx
Failure Handling
- If
sofficeis unavailable, install LibreOffice or request user-provided.docx. - In restricted sandbox environments, LibreOffice conversion may require elevated execution permission.
- If conversion fails or produced
.docxcannot passaudit, stop and request a clean.docx. - If content is visibly damaged after conversion, do not continue template rewrite on that file.
Practical Caveats
- Old
.docwith embedded objects/macros may partially degrade during conversion. - Formatting drift is expected in some edge cases; prioritize structural fidelity and user-specified edits.
- Do not convert final deliverable back to
.docunless user explicitly asks.