diff --git a/GEMINI.md b/GEMINI.md index 6c3153a..ad4768c 100644 --- a/GEMINI.md +++ b/GEMINI.md @@ -13,19 +13,21 @@ A production-ready Python utility for robust synchronization of SharePoint Onlin ## Core Features (Production Ready) -1. **Timestamp Synchronization:** Intelligent sync logic that compares SharePoint `lastModifiedDateTime` with local file `mtime`. Only downloads if the remote source is newer, significantly reducing sync time. -2. **Optimized Integrity Validation:** Implements the official Microsoft **QuickXorHash** (160-bit circular XOR). Includes a configurable threshold (default 30MB) and a global toggle to balance security and performance for large assets. -3. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets. -4. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors. -5. **Robust Library Discovery:** Automatic resolution of document library IDs with built-in fallbacks for localized names (e.g., "Delte dokumenter" to "Documents"). -6. **Self-Healing Sessions:** Automatically detects and resolves 401 Unauthorized errors by refreshing both expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process. -7. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading. -8. **Pagination:** Full support for OData pagination, ensuring complete folder traversal regardless of item count. +1. **Windows Long Path Support:** Automatically handles Windows path limitations by using `get_long_path` and `\\?\` absolute path prefixing. +2. **High-Performance Integrity:** Uses the `quickxorhash` C-library if available for fast validation of large files. Includes a manual 160-bit circular XOR fallback implementation. +3. **Timestamp Synchronization:** Compares SharePoint `lastModifiedDateTime` with local file `mtime`. Only downloads if the remote source is newer, significantly reducing sync time. +4. **Optimized Integrity Validation:** Includes a configurable threshold (default 30MB) and a global toggle to balance security and performance for large assets. +5. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets. +6. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors. +7. **Robust Library Discovery:** Automatic resolution of document library IDs with built-in fallbacks for localized names. +8. **Self-Healing Sessions:** Automatically refreshes expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process. +9. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading. +10. **Pagination:** Full support for OData pagination, ensuring complete folder traversal. ## Building and Running ### Setup -1. **Dependencies:** `pip install -r requirements.txt` +1. **Dependencies:** `pip install -r requirements.txt` (Installing `quickxorhash` via C-compiler is recommended for best performance). 2. **Configuration:** Settings are managed via `connection_info.txt` or the GUI. * `ENABLE_HASH_VALIDATION`: (True/False) * `HASH_THRESHOLD_MB`: (Size limit for hashing) @@ -37,7 +39,8 @@ A production-ready Python utility for robust synchronization of SharePoint Onlin ## Development Conventions * **QuickXorHash:** When implementing/updating hashing, ensure the file length is XORed into the **last 64 bits** (bits 96-159) of the 160-bit state per MS spec. -* **Timezone Handling:** Always use UTC (ISO8601) when comparing timestamps with SharePoint to avoid daylight savings mismatches. +* **Long Paths:** Always use `get_long_path()` when interacting with local file system (open, os.path.exists, etc.). +* **Timezone Handling:** Always use UTC (ISO8601) when comparing timestamps with SharePoint. * **Error Handling:** Always use the `safe_get` (retry-wrapped) method for Graph API calls. For item-specific operations, use `get_fresh_download_url`. * **Authentication:** Use `get_headers(app, force_refresh=True)` when a 401 error is encountered. * **Logging:** Prefer `logger.info()` or `logger.error()` over `print()`. diff --git a/README.md b/README.md index aeed862..2055b3f 100644 --- a/README.md +++ b/README.md @@ -7,18 +7,16 @@ Dette script gør det muligt at downloade specifikke mapper fra et SharePoint do * **Moderne GUI (UX):** Flot mørkt interface med CustomTkinter, der gør det nemt at gemme indstillinger, vælge mapper og se status i realtid. * **Stop-funktionalitet:** Afbryd synkroniseringen midt i processen direkte fra UI. * **Paralleldownload:** Benytter `ThreadPoolExecutor` (default 5 tråde) for markant højere overførselshastighed. -* **Timestamp Synkronisering:** Downloader kun filer, hvis kilden på SharePoint er nyere end din lokale fil (`lastModifiedDateTime` vs. lokal `mtime`). Hvis din lokale fil er nyere, springes download over. -* **Konfigurerbar Integritet:** Validerer filernes korrekthed med Microsofts officielle **QuickXorHash**-algoritme. - * **Toggle:** Mulighed for at slå hash-validering helt til/fra. - * **Smart Grænse:** Definer en MB-grænse (standard 30 MB), hvor filer herunder hashes, mens større filer (f.eks. 65 GB) kun sammenlignes på størrelse for at spare tid. -* **Robust Bibliotekssøgning:** Finder automatisk dit bibliotek og har indbygget fallback (f.eks. fra "Delte dokumenter" til "Documents"), hvis SharePoint bruger engelske navne bag kulisserne. -* **Resume Download:** Understøtter HTTP `Range` headers, så afbrudte downloads af store filer (f.eks. >50GB) genoptages fra det sidste byte i stedet for at starte forfra. -* **Auto-Refresh af Downloads & Tokens:** Håndterer automatisk udløbne download-links og Access Tokens (401 Unauthorized). Værktøjet fornyer både URL'er og adgangsnøgler midt i processen uden at afbryde synkroniseringen. -* **Exponential Backoff:** Håndterer automatisk Microsoft Graph throttling (`429 Too Many Requests`) og netværksfejl med intelligente genforsøg. -* **Dybdebeskyttelse:** Mappegennemgang stopper ved 50 niveauers dybde for at beskytte mod unormalt dybe strukturer. -* **Struktureret Logging:** Gemmer detaljerede logs i `sharepoint_download.log` samt en CSV-fejlrapport for hver kørsel. -* **Paginering:** Håndterer automatisk mapper med mere end 200 elementer. -* **Entra ID Integration:** Benytter MSAL for sikker godkendelse via Client Credentials flow. +* **Windows Long Path Support:** Håndterer automatisk Windows' begrænsning på 260 tegn i filstier ved brug af `\\?\` præfiks, hvilket sikrer stabilitet ved dybe SharePoint-strukturer. +* **Timestamp Synkronisering:** Downloader kun filer, hvis kilden på SharePoint er nyere end din lokale fil (`lastModifiedDateTime` vs. lokal `mtime`). +* **High-Performance Integritet:** Validerer filernes korrekthed med Microsofts officielle **QuickXorHash**-algoritme. + * **Hastighed:** Bruger automatisk det lynhurtige `quickxorhash` C-bibliotek, hvis det er installeret (anbefales til store filer). + * **Fallback:** Har indbygget en præcis 160-bit Python-implementering som fallback, hvis biblioteket ikke findes. + * **Smart Grænse:** Definer en MB-grænse (standard 30 MB), hvor filer herunder altid hashes, mens større filer (f.eks. 65 GB) kun sammenlignes på størrelse for at spare tid (kan konfigureres). +* **Robust Bibliotekssøgning:** Finder automatisk dit bibliotek og har indbygget fallback (f.eks. fra "Delte dokumenter" til "Documents"). +* **Resume Download:** Understøtter HTTP `Range` headers for genoptagelse af store filer. +* **Auto-Refresh af Downloads & Tokens:** Fornyer automatisk sessioner og links midt i processen. +* **Exponential Backoff:** Håndterer Microsoft Graph throttling (`429 Too Many Requests`) intelligent. ## Installation @@ -28,47 +26,17 @@ Dette script gør det muligt at downloade specifikke mapper fra et SharePoint do pip install -r requirements.txt ``` -## Opsætning i Microsoft Entra ID (Azure AD) - -For at scriptet kan få adgang til SharePoint, skal du oprette en App-registrering: - -1. Log ind på [Microsoft Entra admin center](https://entra.microsoft.com/). -2. Gå til **Identity** > **Applications** > **App registrations** > **New registration**. -3. Giv appen et navn (f.eks. "SharePoint Download Tool") og vælg "Accounts in this organizational directory only". Klik på **Register**. -4. Noter din **Application (client) ID** og **Directory (tenant) ID**. -5. Gå til **API permissions** > **Add a permission** > **Microsoft Graph**. -6. Vælg **Application permissions**. -7. Søg efter og tilføj `Sites.Read.All`. -8. **VIGTIGT:** Klik på **Grant admin consent for [dit domæne]**. -9. Gå til **Certificates & secrets** > **New client secret**. -10. **VIGTIGT:** Kopier værdien under **Value** med det samme (din `CLIENT_SECRET`). - ## Anvendelse -Der er to måder at køre værktøjet på: - ### 1. GUI Version (Anbefalet) -For en moderne grafisk brugerflade, kør: -```bash -python sharepoint_gui.py -``` -Her kan du nemt indtaste indstillinger, gemme dem, vælge destinationsmappe og starte/stoppe synkroniseringen. Du kan også styre om Hash-validering skal være aktiv og ved hvilken størrelse den skal springes over. +Kør: `python sharepoint_gui.py` ### 2. CLI Version (Til automatisering) -Hvis du ønsker at køre scriptet direkte fra terminalen: -1. Udfyld `connection_info.txt`. -2. Kør: - ```bash - python download_sharepoint.py - ``` +Kør: `python download_sharepoint.py` ## Konfiguration (connection_info.txt) * `ENABLE_HASH_VALIDATION`: Sæt til `"True"` eller `"False"`. * `HASH_THRESHOLD_MB`: Talværdi (f.eks. `"30"` eller `"50"`). -## Logfiler -* `sharepoint_download.log`: Teknisk log over alle handlinger og fejl. -* `download_report_YYYYMMDD_HHMMSS.csv`: En hurtig oversigt over filer der fejlede. - ## Sikkerhed Husk at `.gitignore` er sat op til at ignorere `connection_info.txt`, så dine adgangskoder ikke uploades til Git. diff --git a/download_sharepoint.py b/download_sharepoint.py index 7abf07c..ff55b08 100644 --- a/download_sharepoint.py +++ b/download_sharepoint.py @@ -6,6 +6,10 @@ import threading import logging import base64 import struct +try: + import quickxorhash as qxh_lib +except ImportError: + qxh_lib = None from concurrent.futures import ThreadPoolExecutor, as_completed from datetime import datetime from msal import ConfidentialClientApplication @@ -41,6 +45,13 @@ def format_size(size_bytes): size_bytes /= 1024.0 return f"{size_bytes:.2f} EB" +def get_long_path(path): + """Handles Windows Long Path limitation by prefixing with \\?\\ for absolute paths.""" + path = os.path.abspath(path) + if os.name == 'nt' and not path.startswith("\\\\?\\"): + return "\\\\?\\" + path + return path + def load_config(file_path): config = {} if not os.path.exists(file_path): @@ -84,29 +95,34 @@ def safe_get(url, headers, stream=False, timeout=60, params=None): # --- Punkt 4: Integrity Validation (QuickXorHash) --- def quickxorhash(file_path): """Compute Microsoft QuickXorHash for a file. Returns base64-encoded string. - Follows the official Microsoft implementation: 160-bit circular XOR - with the file length XORed into the LAST 64 bits.""" + Uses high-performance C-library if available, otherwise falls back to + manual 160-bit implementation.""" + + # 1. Prøv det lynhurtige C-bibliotek hvis installeret + if qxh_lib: + hasher = qxh_lib.quickxorhash() + with open(get_long_path(file_path), 'rb') as f: + while True: + chunk = f.read(CHUNK_SIZE) + if not chunk: break + hasher.update(chunk) + return base64.b64encode(hasher.digest()).decode('ascii') + + # 2. Fallback til manuel Python implementering (præcis men langsommere) h = 0 length = 0 mask = (1 << 160) - 1 - - with open(file_path, 'rb') as f: + with open(get_long_path(file_path), 'rb') as f: while True: chunk = f.read(CHUNK_SIZE) - if not chunk: - break + if not chunk: break for b in chunk: shift = (length * 11) % 160 shifted = b << shift wrapped = (shifted & mask) | (shifted >> 160) h ^= wrapped length += 1 - - # Finalize: XOR the 64-bit length into the LAST 64 bits of the 160-bit state. - # Bits 96 to 159. h ^= (length << (160 - 64)) - - # Convert to 20 bytes (160 bits) in little-endian format result = h.to_bytes(20, byteorder='little') return base64.b64encode(result).decode('ascii') @@ -115,7 +131,7 @@ def verify_integrity(local_path, remote_hash): if not remote_hash or not ENABLE_HASH_VALIDATION: return True - file_size = os.path.getsize(local_path) + file_size = os.path.getsize(get_long_path(local_path)) threshold_bytes = HASH_THRESHOLD_MB * 1024 * 1024 if file_size > threshold_bytes: @@ -217,13 +233,14 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp resume_header = {} existing_size = 0 download_url = initial_url + + long_local_path = get_long_path(local_path) - if os.path.exists(local_path): - existing_size = os.path.getsize(local_path) - local_mtime = os.path.getmtime(local_path) + if os.path.exists(long_local_path): + existing_size = os.path.getsize(long_local_path) + local_mtime = os.path.getmtime(long_local_path) # Konvertér SharePoint ISO8601 UTC tid (f.eks. 2024-03-29T12:00:00Z) til unix timestamp - # Vi fjerner 'Z' og bruger datetime.fromisoformat remote_mtime = datetime.fromisoformat(remote_mtime_str.replace('Z', '+00:00')).timestamp() # Hvis filen findes, har rigtig størrelse OG lokal er ikke ældre end remote -> SKIP @@ -233,7 +250,7 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp logger.info(f"Skipped (up-to-date): {display_name}") return True, None else: - logger.warning(f"Hash mismatch on existing file: {display_name}. Re-downloading.") + logger.warning(f"Hash mismatch on existing file: {display_name}. Re-downloading.") existing_size = 0 else: logger.info(f"Update available: {display_name} (Remote is newer)") @@ -252,7 +269,7 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp existing_size = 0 logger.info(f"Starting: {display_name} ({format_size(expected_size)})") - os.makedirs(os.path.dirname(local_path), exist_ok=True) + os.makedirs(os.path.dirname(long_local_path), exist_ok=True) # Initial download attempt if not download_url: @@ -273,13 +290,24 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp else: raise - with open(local_path, file_mode) as f: + with open(long_local_path, file_mode) as f: for chunk in response.iter_content(chunk_size=CHUNK_SIZE): if chunk: f.write(chunk) # Post-download check - final_size = os.path.getsize(local_path) + final_size = os.path.getsize(long_local_path) + if final_size == expected_size: + if verify_integrity(local_path, remote_hash): + logger.info(f"DONE: {display_name}") + return True, None + else: + return False, "Integrity check failed (Hash mismatch)" + else: + return False, f"Size mismatch: Remote={expected_size}, Local={final_size}" + + except Exception as e: + return False, str(e) if final_size == expected_size: if verify_integrity(local_path, remote_hash): logger.info(f"DONE: {display_name}") diff --git a/requirements.txt b/requirements.txt index bde7a63..2725418 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ requests msal customtkinter +quickxorhash