Enterprise-grade optimeringer: Windows Long Path, High-Performance Hashing og Dokumentation

- Tilføjer 'get_long_path' for at understøtte Windows-stier over 260 tegn
- Implementerer dual-mode hashing: Bruger 'quickxorhash' C-bibliotek hvis muligt, ellers manual Python fallback
- Opdaterer requirements.txt med quickxorhash
- Opdaterer README.md og GEMINI.md med de seneste funktioner og tekniske specifikationer
This commit is contained in:
Martin Tranberg
2026-03-29 19:33:31 +02:00
parent 367d31671d
commit c5d4ddaab0
4 changed files with 74 additions and 74 deletions

View File

@@ -13,19 +13,21 @@ A production-ready Python utility for robust synchronization of SharePoint Onlin
## Core Features (Production Ready)
1. **Timestamp Synchronization:** Intelligent sync logic that compares SharePoint `lastModifiedDateTime` with local file `mtime`. Only downloads if the remote source is newer, significantly reducing sync time.
2. **Optimized Integrity Validation:** Implements the official Microsoft **QuickXorHash** (160-bit circular XOR). Includes a configurable threshold (default 30MB) and a global toggle to balance security and performance for large assets.
3. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets.
4. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors.
5. **Robust Library Discovery:** Automatic resolution of document library IDs with built-in fallbacks for localized names (e.g., "Delte dokumenter" to "Documents").
6. **Self-Healing Sessions:** Automatically detects and resolves 401 Unauthorized errors by refreshing both expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process.
7. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading.
8. **Pagination:** Full support for OData pagination, ensuring complete folder traversal regardless of item count.
1. **Windows Long Path Support:** Automatically handles Windows path limitations by using `get_long_path` and `\\?\` absolute path prefixing.
2. **High-Performance Integrity:** Uses the `quickxorhash` C-library if available for fast validation of large files. Includes a manual 160-bit circular XOR fallback implementation.
3. **Timestamp Synchronization:** Compares SharePoint `lastModifiedDateTime` with local file `mtime`. Only downloads if the remote source is newer, significantly reducing sync time.
4. **Optimized Integrity Validation:** Includes a configurable threshold (default 30MB) and a global toggle to balance security and performance for large assets.
5. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets.
6. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors.
7. **Robust Library Discovery:** Automatic resolution of document library IDs with built-in fallbacks for localized names.
8. **Self-Healing Sessions:** Automatically refreshes expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process.
9. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading.
10. **Pagination:** Full support for OData pagination, ensuring complete folder traversal.
## Building and Running
### Setup
1. **Dependencies:** `pip install -r requirements.txt`
1. **Dependencies:** `pip install -r requirements.txt` (Installing `quickxorhash` via C-compiler is recommended for best performance).
2. **Configuration:** Settings are managed via `connection_info.txt` or the GUI.
* `ENABLE_HASH_VALIDATION`: (True/False)
* `HASH_THRESHOLD_MB`: (Size limit for hashing)
@@ -37,7 +39,8 @@ A production-ready Python utility for robust synchronization of SharePoint Onlin
## Development Conventions
* **QuickXorHash:** When implementing/updating hashing, ensure the file length is XORed into the **last 64 bits** (bits 96-159) of the 160-bit state per MS spec.
* **Timezone Handling:** Always use UTC (ISO8601) when comparing timestamps with SharePoint to avoid daylight savings mismatches.
* **Long Paths:** Always use `get_long_path()` when interacting with local file system (open, os.path.exists, etc.).
* **Timezone Handling:** Always use UTC (ISO8601) when comparing timestamps with SharePoint.
* **Error Handling:** Always use the `safe_get` (retry-wrapped) method for Graph API calls. For item-specific operations, use `get_fresh_download_url`.
* **Authentication:** Use `get_headers(app, force_refresh=True)` when a 401 error is encountered.
* **Logging:** Prefer `logger.info()` or `logger.error()` over `print()`.