Compare commits
20 Commits
7fab89cbbb
...
dev
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d15b9afc03 | ||
|
|
8e8bb3baa1 | ||
|
|
8899afabbc | ||
|
|
9e40abcfd8 | ||
|
|
03a766be63 | ||
|
|
1a97ca3d53 | ||
|
|
8e837240b5 | ||
|
|
f5e54b185e | ||
|
|
c5d4ddaab0 | ||
|
|
367d31671d | ||
|
|
acede4a867 | ||
|
|
ba968ab70e | ||
|
|
790ca91339 | ||
|
|
ed508302a6 | ||
|
|
33fbdc244d | ||
|
|
ad4166fb03 | ||
|
|
39a3aff495 | ||
|
|
634b5ff151 | ||
|
|
3bb2b44477 | ||
|
|
a8048ae74d |
49
CLAUDE.md
49
CLAUDE.md
@@ -1,49 +0,0 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
A Python utility that synchronizes SharePoint Online folders to local storage using the Microsoft Graph API. Offers both a CLI (`download_sharepoint.py`) and a modern GUI (`sharepoint_gui.py`).
|
||||
|
||||
## Running the Tool
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# GUI mode (recommended for interactive use)
|
||||
python sharepoint_gui.py
|
||||
|
||||
# CLI mode (for automation/scripting)
|
||||
python download_sharepoint.py
|
||||
```
|
||||
|
||||
Configuration is read from `connection_info.txt` (gitignored — copy from `connection_info.template.txt` and fill in credentials).
|
||||
|
||||
## Architecture
|
||||
|
||||
Two-file structure with clear separation of concerns:
|
||||
|
||||
**`download_sharepoint.py`** — Core engine with four logical layers:
|
||||
1. **Authentication** — MSAL `ConfidentialClientApplication` using OAuth 2.0 Client Credentials flow. Tokens are refreshed via `force_refresh=True` when a 401 is received.
|
||||
2. **Graph API navigation** — `get_site_id()` → `get_drive_id()` → `process_item_list()` (recursive, handles `@odata.nextLink` pagination).
|
||||
3. **Download & resilience** — `download_single_file()` with Range header support for resumable downloads. `get_fresh_download_url()` handles expired pre-signed URLs. The `@retry_request` decorator provides exponential backoff (up to 5 retries, 2^n seconds) for 429s and network errors.
|
||||
4. **Concurrency** — `ThreadPoolExecutor` (max 5 workers). A `report_lock` guards the shared error list. A `stop_event` allows the GUI stop button to cancel in-flight work.
|
||||
|
||||
**`sharepoint_gui.py`** — CustomTkinter wrapper that:
|
||||
- Persists settings to a local JSON file
|
||||
- Spawns the core engine in a background thread
|
||||
- Patches `requests.get` to route through the GUI's log display
|
||||
- Provides a folder browser for `LOCAL_PATH`
|
||||
|
||||
## Key Behaviors to Preserve
|
||||
|
||||
- **Self-healing sessions**: On 401, the code refreshes both the MSAL access token *and* the pre-signed Graph download URL before retrying — these are two separate expiry mechanisms.
|
||||
- **Resumable downloads**: Files are downloaded in 1 MB chunks using HTTP Range headers. Existing files are skipped if their size matches; partial files are resumed from the last byte.
|
||||
- **Stop signal**: `stop_event.is_set()` is checked in the download loop and recursive traversal — any new code that loops must respect this.
|
||||
|
||||
## Output
|
||||
|
||||
- `sharepoint_download.log` — Full operation log
|
||||
- `download_report_YYYYMMDD_HHMMSS.csv` — Per-run error report (gitignored)
|
||||
36
GEMINI.md
36
GEMINI.md
@@ -13,26 +13,34 @@ A production-ready Python utility for robust synchronization of SharePoint Onlin
|
||||
|
||||
## Core Features (Production Ready)
|
||||
|
||||
1. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets.
|
||||
2. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors.
|
||||
3. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading.
|
||||
4. **Pagination:** Full support for OData pagination, ensuring complete folder traversal regardless of item count.
|
||||
5. **Self-Healing Sessions:** Automatically detects and resolves 401 Unauthorized errors by refreshing both expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process.
|
||||
6. **Logging & Audit:** Integrated Python `logging` to `sharepoint_download.log` and structured CSV reports for error auditing.
|
||||
1. **Windows Long Path Support:** Automatically handles Windows path limitations by using `get_long_path` and `\\?\` absolute path prefixing.
|
||||
2. **High-Performance Integrity:** Uses the `quickxorhash` C-library if available for fast validation of large files. Includes a manual 160-bit circular XOR fallback implementation.
|
||||
3. **Timestamp Synchronization:** Compares SharePoint `lastModifiedDateTime` with local file `mtime`. Only downloads if the remote source is newer, significantly reducing sync time.
|
||||
4. **Optimized Integrity Validation:** Includes a configurable threshold (default 30MB) and a global toggle to balance security and performance for large assets.
|
||||
5. **Resumable Downloads:** Implements HTTP `Range` headers to resume partially downloaded files, critical for multi-gigabyte assets.
|
||||
6. **Reliability:** Includes a custom `retry_request` decorator for Exponential Backoff, handling throttling (429) and transient network errors.
|
||||
7. **Robust Library Discovery:** Automatic resolution of document library IDs with built-in fallbacks for localized names.
|
||||
8. **Self-Healing Sessions:** Automatically refreshes expiring Microsoft Graph Download URLs and MSAL Access Tokens mid-process.
|
||||
9. **Concurrency:** Multi-threaded architecture (5 workers) for simultaneous scanning and downloading.
|
||||
10. **Pagination:** Full support for OData pagination, ensuring complete folder traversal.
|
||||
|
||||
## Building and Running
|
||||
|
||||
### Setup
|
||||
1. **Dependencies:** `pip install -r requirements.txt`
|
||||
2. **Configuration:** Use `connection_info.template.txt` to create `connection_info.txt`.
|
||||
1. **Dependencies:** `pip install -r requirements.txt` (Installing `quickxorhash` via C-compiler is recommended for best performance).
|
||||
2. **Configuration:** Settings are managed via `connection_info.txt` or the GUI.
|
||||
* `ENABLE_HASH_VALIDATION`: (True/False)
|
||||
* `HASH_THRESHOLD_MB`: (Size limit for hashing)
|
||||
|
||||
### Execution
|
||||
`python download_sharepoint.py`
|
||||
* **GUI:** `python sharepoint_gui.py`
|
||||
* **CLI:** `python download_sharepoint.py`
|
||||
|
||||
## Development Conventions
|
||||
|
||||
* **Error Handling:** Always use the `safe_get` (retry-wrapped) method for Graph API calls. For item-specific operations, use `get_fresh_download_url` to handle token/URL expiry.
|
||||
* **Authentication:** Use `get_headers(app, force_refresh=True)` when a 401 error is encountered from Graph API to ensure session continuity.
|
||||
* **Thread Safety:** Use `report_lock` when updating the shared error list from worker threads.
|
||||
* **Logging:** Prefer `logger.info()` or `logger.error()` over `print()` to ensure persistence in `sharepoint_download.log`.
|
||||
* **Integrity:** Always verify file integrity using `size` and `quickXorHash` where available.
|
||||
* **QuickXorHash:** When implementing/updating hashing, ensure the file length is XORed into the **last 64 bits** (bits 96-159) of the 160-bit state per MS spec.
|
||||
* **Long Paths:** Always use `get_long_path()` when interacting with local file system (open, os.path.exists, etc.).
|
||||
* **Timezone Handling:** Always use UTC (ISO8601) when comparing timestamps with SharePoint.
|
||||
* **Error Handling:** Always use the `safe_get` (retry-wrapped) method for Graph API calls. For item-specific operations, use `get_fresh_download_url`.
|
||||
* **Authentication:** Use `get_headers(app, force_refresh=True)` when a 401 error is encountered.
|
||||
* **Logging:** Prefer `logger.info()` or `logger.error()` over `print()`.
|
||||
|
||||
61
README.md
61
README.md
@@ -5,15 +5,19 @@ Dette script gør det muligt at downloade specifikke mapper fra et SharePoint do
|
||||
## Funktioner
|
||||
|
||||
* **Moderne GUI (UX):** Flot mørkt interface med CustomTkinter, der gør det nemt at gemme indstillinger, vælge mapper og se status i realtid.
|
||||
* **Stop-funktionalitet:** Afbryd synkroniseringen midt i processen direkte fra UI.
|
||||
* **Stop-funktionalitet:** Afbryd synkroniseringen øjeblikkeligt direkte fra GUI. Systemet benytter nu eksplicit signalering (`threading.Event`), som afbryder igangværende downloads midt i en stream (chunk-level), hvilket sikrer en lynhurtig stop-respons uden ventetid.
|
||||
* **Paralleldownload:** Benytter `ThreadPoolExecutor` (default 5 tråde) for markant højere overførselshastighed.
|
||||
* **Resume Download:** Understøtter HTTP `Range` headers, så afbrudte downloads af store filer (f.eks. >50GB) genoptages fra det sidste byte i stedet for at starte forfra.
|
||||
* **Auto-Refresh af Downloads & Tokens:** Håndterer automatisk udløbne download-links og Access Tokens (401 Unauthorized). Værktøjet fornyer både URL'er og adgangsnøgler midt i processen uden at afbryde synkroniseringen.
|
||||
* **Exponential Backoff:** Håndterer automatisk Microsoft Graph throttling (`429 Too Many Requests`) og netværksfejl med intelligente genforsøg.
|
||||
* **Struktureret Logging:** Gemmer detaljerede logs i `sharepoint_download.log` samt en CSV-fejlrapport for hver kørsel.
|
||||
* **Paginering:** Håndterer automatisk mapper med mere end 200 elementer via `@odata.nextLink`.
|
||||
* **Smart Skip & Integritet:** Skipper filer der allerede findes lokalt med korrekt størrelse, og forbereder til hash-validering (QuickXorHash).
|
||||
* **Entra ID Integration:** Benytter MSAL for sikker godkendelse via Client Credentials flow med automatisk token-refresh.
|
||||
* **Windows Long Path Support:** Håndterer automatisk Windows' begrænsning på 260 tegn i filstier ved brug af `\\?\` præfiks. Systemet understøtter nu også korrekt **UNC-stier** (netværksdrev) via `\\?\UNC\` formatet, hvilket sikrer fuld kompatibilitet i enterprise-miljøer.
|
||||
* **Optimeret Synkronisering:** Hvis filstørrelse og tidsstempel matcher perfekt (indenfor 1 sekunds præcision), springer værktøjet automatisk over både download og den tunge hash-validering. Dette giver en markant hastighedsforbedring ved gentagne synkroniseringer af store biblioteker med mange små filer.
|
||||
* **Timestamp Synkronisering:** Downloader kun filer, hvis kilden på SharePoint er nyere end din lokale fil (`lastModifiedDateTime` vs. lokal `mtime`).
|
||||
* **Integritets-validering:** Validerer filernes korrekthed med Microsofts officielle **QuickXorHash**-algoritme (160-bit circular XOR).
|
||||
* **Fallback:** Har indbygget en præcis 160-bit Python-implementering som standard.
|
||||
* **Optimering:** Understøtter automatisk det lynhurtige `quickxorhash` C-bibliotek, hvis det er installeret (valgfrit).
|
||||
* **Smart Grænse:** Definer en MB-grænse (standard 30 MB), hvor filer herunder altid hashes, mens større filer (f.eks. 65 GB) kun sammenlignes på størrelse for at spare tid (kan konfigureres).
|
||||
* **Robust Bibliotekssøgning:** Finder automatisk dit bibliotek og har indbygget fallback (f.eks. fra "Delte dokumenter" til "Documents").
|
||||
* **Resume Download:** Understøtter HTTP `Range` headers for genoptagelse af store filer.
|
||||
* **Auto-Refresh af Downloads & Tokens:** Fornyer automatisk sessioner og links midt i processen uden unødig ventetid (Optimized 401 handling).
|
||||
* **Intelligent Fejlhåndtering:** Inkluderer retry-logik med exponential backoff og specialiseret håndtering af udløbne tokens (safe_graph_get).
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -23,44 +27,23 @@ Dette script gør det muligt at downloade specifikke mapper fra et SharePoint do
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Opsætning i Microsoft Entra ID (Azure AD)
|
||||
|
||||
For at scriptet kan få adgang til SharePoint, skal du oprette en App-registrering:
|
||||
|
||||
1. Log ind på [Microsoft Entra admin center](https://entra.microsoft.com/).
|
||||
2. Gå til **Identity** > **Applications** > **App registrations** > **New registration**.
|
||||
3. Giv appen et navn (f.eks. "SharePoint Download Tool") og vælg "Accounts in this organizational directory only". Klik på **Register**.
|
||||
4. Noter din **Application (client) ID** og **Directory (tenant) ID**.
|
||||
5. Gå til **API permissions** > **Add a permission** > **Microsoft Graph**.
|
||||
6. Vælg **Application permissions**.
|
||||
7. Søg efter og tilføj `Sites.Read.All` (eller `Sites.ReadWrite.All` hvis du har brug for skriveadgang).
|
||||
8. **VIGTIGT:** Klik på **Grant admin consent for [dit domæne]** for at godkende rettighederne.
|
||||
9. Gå til **Certificates & secrets** > **New client secret**. Tilføj en beskrivelse og vælg udløbsdato.
|
||||
10. **VIGTIGT:** Kopier værdien under **Value** med det samme (det er din `CLIENT_SECRET`). Du kan ikke se den igen senere.
|
||||
> **Bemærk:** Biblioteket `quickxorhash` er fjernet fra standard-requirements for at undgå problemer med C++ Build Tools på Windows. Værktøjet fungerer perfekt uden det, da det har en indbygget Python-fallback. Hvis du har brug for lynhurtig hash-validering af meget store filer (GB-klassen), kan du manuelt installere det med `pip install quickxorhash`.
|
||||
|
||||
## Anvendelse
|
||||
|
||||
Der er to måder at køre værktøjet på:
|
||||
|
||||
### 1. GUI Version (Anbefalet)
|
||||
For en moderne grafisk brugerflade, kør:
|
||||
```bash
|
||||
python sharepoint_gui.py
|
||||
```
|
||||
Her kan du nemt indtaste indstillinger, gemme dem, vælge destinationsmappe og starte/stoppe synkroniseringen.
|
||||
Kør: `python sharepoint_gui.py`
|
||||
|
||||
### 2. CLI Version (Til automatisering)
|
||||
Hvis du ønsker at køre scriptet direkte fra terminalen:
|
||||
1. Kopier `connection_info.template.txt` til `connection_info.txt`.
|
||||
2. Udfyld dine oplysninger.
|
||||
3. Kør:
|
||||
```bash
|
||||
python download_sharepoint.py
|
||||
```
|
||||
Kør: `python download_sharepoint.py`
|
||||
|
||||
## Logfiler
|
||||
* `sharepoint_download.log`: Teknisk log over alle handlinger og fejl.
|
||||
* `download_report_YYYYMMDD_HHMMSS.csv`: En hurtig oversigt over filer der fejlede.
|
||||
## Konfiguration (connection_info.txt)
|
||||
* `ENABLE_HASH_VALIDATION`: Sæt til `"True"` eller `"False"`.
|
||||
* `HASH_THRESHOLD_MB`: Talværdi (f.eks. `"30"` eller `"50"`).
|
||||
|
||||
## Status
|
||||
**Vurdering:** ✅ **Produktionsklar (Enterprise-grade)**
|
||||
Dette værktøj er gennemtestet og optimeret til professionel brug. Det håndterer komplekse scenarier som dybe mappestrukturer (Long Path), cloud-throttling, resumable downloads og intelligent tidsstempel-synkronisering med høj præcision.
|
||||
|
||||
## Sikkerhed
|
||||
Husk at `.gitignore` er sat op til at ignorere `connection_info.txt`, så dine adgangskoder ikke uploades til Git.
|
||||
|
||||
@@ -5,3 +5,7 @@ SITE_URL = "*** INPUT SHAREPOINT SITE URL HERE ***"
|
||||
DOCUMENT_LIBRARY = "*** INPUT DOCUMENT LIBRARY NAME HERE (e.g. Documents) ***"
|
||||
FOLDERS_TO_DOWNLOAD = "*** INPUT FOLDERS TO DOWNLOAD (Comma separated). LEAVE EMPTY TO DOWNLOAD ENTIRE LIBRARY ***"
|
||||
LOCAL_PATH = "*** INPUT LOCAL DESTINATION PATH HERE ***"
|
||||
|
||||
# Hash Validation Settings
|
||||
ENABLE_HASH_VALIDATION = "True"
|
||||
HASH_THRESHOLD_MB = "30"
|
||||
|
||||
@@ -6,6 +6,10 @@ import threading
|
||||
import logging
|
||||
import base64
|
||||
import struct
|
||||
try:
|
||||
import quickxorhash as qxh_lib
|
||||
except ImportError:
|
||||
qxh_lib = None
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime
|
||||
from msal import ConfidentialClientApplication
|
||||
@@ -15,6 +19,7 @@ from urllib.parse import urlparse, quote
|
||||
MAX_WORKERS = 5
|
||||
MAX_RETRIES = 5
|
||||
CHUNK_SIZE = 1024 * 1024 # 1MB Chunks
|
||||
MAX_FOLDER_DEPTH = 50
|
||||
LOG_FILE = "sharepoint_download.log"
|
||||
|
||||
# Setup Logging
|
||||
@@ -30,10 +35,21 @@ logger = logging.getLogger(__name__)
|
||||
report_lock = threading.Lock()
|
||||
|
||||
def format_size(size_bytes):
|
||||
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
|
||||
for unit in ['B', 'KB', 'MB', 'GB', 'TB', 'PB']:
|
||||
if size_bytes < 1024.0:
|
||||
return f"{size_bytes:.2f} {unit}"
|
||||
size_bytes /= 1024.0
|
||||
return f"{size_bytes:.2f} EB"
|
||||
|
||||
def get_long_path(path):
|
||||
r"""Handles Windows Long Path limitation by prefixing with \\?\ for absolute paths.
|
||||
Correctly handles UNC paths (e.g. \\server\share -> \\?\UNC\server\share)."""
|
||||
path = os.path.abspath(path)
|
||||
if os.name == 'nt' and not path.startswith("\\\\?\\"):
|
||||
if path.startswith("\\\\"):
|
||||
return "\\\\?\\UNC\\" + path[2:]
|
||||
return "\\\\?\\" + path
|
||||
return path
|
||||
|
||||
def load_config(file_path):
|
||||
config = {}
|
||||
@@ -44,6 +60,21 @@ def load_config(file_path):
|
||||
if '=' in line:
|
||||
key, value = line.split('=', 1)
|
||||
config[key.strip()] = value.strip().strip('"')
|
||||
|
||||
# Parse numeric and boolean values
|
||||
if 'ENABLE_HASH_VALIDATION' in config:
|
||||
config['ENABLE_HASH_VALIDATION'] = config['ENABLE_HASH_VALIDATION'].lower() == 'true'
|
||||
else:
|
||||
config['ENABLE_HASH_VALIDATION'] = True
|
||||
|
||||
if 'HASH_THRESHOLD_MB' in config:
|
||||
try:
|
||||
config['HASH_THRESHOLD_MB'] = int(config['HASH_THRESHOLD_MB'])
|
||||
except ValueError:
|
||||
config['HASH_THRESHOLD_MB'] = 30
|
||||
else:
|
||||
config['HASH_THRESHOLD_MB'] = 30
|
||||
|
||||
return config
|
||||
|
||||
# --- Punkt 1: Exponential Backoff & Retry Logic ---
|
||||
@@ -62,24 +93,84 @@ def retry_request(func):
|
||||
response.raise_for_status()
|
||||
return response
|
||||
except requests.exceptions.RequestException as e:
|
||||
# Hvis det er 401, skal vi ikke vente/retry her, da token/URL sandsynligvis er udløbet
|
||||
if isinstance(e, requests.exceptions.HTTPError) and e.response is not None and e.response.status_code == 401:
|
||||
raise e
|
||||
|
||||
retries += 1
|
||||
wait = 2 ** retries
|
||||
if retries >= MAX_RETRIES:
|
||||
raise e
|
||||
logger.error(f"Request failed: {e}. Retrying in {wait}s...")
|
||||
time.sleep(wait)
|
||||
return None
|
||||
raise requests.exceptions.RetryError(f"Max retries ({MAX_RETRIES}) exceeded.")
|
||||
return wrapper
|
||||
|
||||
@retry_request
|
||||
def safe_get(url, headers, stream=False, timeout=60, params=None):
|
||||
return requests.get(url, headers=headers, stream=stream, timeout=timeout, params=params)
|
||||
|
||||
# --- Punkt 4: Integrity Validation (QuickXorHash - Placeholder for full logic) ---
|
||||
def verify_integrity(local_path, remote_hash):
|
||||
"""Placeholder for QuickXorHash verification."""
|
||||
if not remote_hash:
|
||||
return True # Fallback to size check
|
||||
def safe_graph_get(app, url):
|
||||
"""Specialized helper for Graph API calls that handles 401 by refreshing tokens."""
|
||||
try:
|
||||
return safe_get(url, headers=get_headers(app))
|
||||
except requests.exceptions.HTTPError as e:
|
||||
if e.response is not None and e.response.status_code == 401:
|
||||
logger.info("Access Token expired during Graph call. Forcing refresh...")
|
||||
return safe_get(url, headers=get_headers(app, force_refresh=True))
|
||||
raise
|
||||
|
||||
# --- Punkt 4: Integrity Validation (QuickXorHash) ---
|
||||
def quickxorhash(file_path):
|
||||
"""Compute Microsoft QuickXorHash for a file. Returns base64-encoded string.
|
||||
Uses high-performance C-library if available, otherwise falls back to
|
||||
manual 160-bit implementation."""
|
||||
|
||||
# 1. Prøv det lynhurtige C-bibliotek hvis installeret
|
||||
if qxh_lib:
|
||||
hasher = qxh_lib.quickxorhash()
|
||||
with open(get_long_path(file_path), 'rb') as f:
|
||||
while True:
|
||||
chunk = f.read(CHUNK_SIZE)
|
||||
if not chunk: break
|
||||
hasher.update(chunk)
|
||||
return base64.b64encode(hasher.digest()).decode('ascii')
|
||||
|
||||
# 2. Fallback til manuel Python implementering (præcis men langsommere)
|
||||
h = 0
|
||||
length = 0
|
||||
mask = (1 << 160) - 1
|
||||
with open(get_long_path(file_path), 'rb') as f:
|
||||
while True:
|
||||
chunk = f.read(CHUNK_SIZE)
|
||||
if not chunk: break
|
||||
for b in chunk:
|
||||
shift = (length * 11) % 160
|
||||
shifted = b << shift
|
||||
wrapped = (shifted & mask) | (shifted >> 160)
|
||||
h ^= wrapped
|
||||
length += 1
|
||||
h ^= (length << (160 - 64))
|
||||
result = h.to_bytes(20, byteorder='little')
|
||||
return base64.b64encode(result).decode('ascii')
|
||||
|
||||
def verify_integrity(local_path, remote_hash, config):
|
||||
"""Verifies file integrity based on config settings."""
|
||||
if not remote_hash or not config.get('ENABLE_HASH_VALIDATION', True):
|
||||
return True
|
||||
|
||||
file_size = os.path.getsize(get_long_path(local_path))
|
||||
threshold_mb = config.get('HASH_THRESHOLD_MB', 30)
|
||||
threshold_bytes = threshold_mb * 1024 * 1024
|
||||
|
||||
if file_size > threshold_bytes:
|
||||
logger.info(f"Skipping hash check (size > {threshold_mb}MB): {os.path.basename(local_path)}")
|
||||
return True
|
||||
|
||||
local_hash = quickxorhash(local_path)
|
||||
if local_hash != remote_hash:
|
||||
logger.warning(f"Hash mismatch for {local_path}: local={local_hash}, remote={remote_hash}")
|
||||
return False
|
||||
return True
|
||||
|
||||
def get_headers(app, force_refresh=False):
|
||||
@@ -100,15 +191,30 @@ def get_headers(app, force_refresh=False):
|
||||
def get_site_id(app, site_url):
|
||||
parsed = urlparse(site_url)
|
||||
url = f"https://graph.microsoft.com/v1.0/sites/{parsed.netloc}:{parsed.path}"
|
||||
response = safe_get(url, headers=get_headers(app))
|
||||
response = safe_graph_get(app, url)
|
||||
return response.json()['id']
|
||||
|
||||
def get_drive_id(app, site_id, drive_name):
|
||||
url = f"https://graph.microsoft.com/v1.0/sites/{site_id}/drives"
|
||||
response = safe_get(url, headers=get_headers(app))
|
||||
for drive in response.json().get('value', []):
|
||||
if drive['name'] == drive_name: return drive['id']
|
||||
raise Exception(f"Drive {drive_name} not found")
|
||||
response = safe_graph_get(app, url)
|
||||
drives = response.json().get('value', [])
|
||||
|
||||
# Prøv præcis match
|
||||
for drive in drives:
|
||||
if drive['name'] == drive_name:
|
||||
return drive['id']
|
||||
|
||||
# Prøv fallback til "Documents" hvis "Delte dokumenter" fejler (SharePoint standard)
|
||||
if drive_name == "Delte dokumenter":
|
||||
for drive in drives:
|
||||
if drive['name'] == "Documents":
|
||||
logger.info("Found 'Documents' as fallback for 'Delte dokumenter'")
|
||||
return drive['id']
|
||||
|
||||
# Log tilgængelige navne for at hjælpe brugeren
|
||||
available_names = [d['name'] for d in drives]
|
||||
logger.error(f"Drive '{drive_name}' not found. Available drives on this site: {available_names}")
|
||||
raise Exception(f"Drive {drive_name} not found. Check the log for available drive names.")
|
||||
|
||||
# --- Punkt 2: Resume / Chunked Download logic ---
|
||||
def get_fresh_download_url(app, drive_id, item_id):
|
||||
@@ -119,53 +225,79 @@ def get_fresh_download_url(app, drive_id, item_id):
|
||||
try:
|
||||
headers = get_headers(app)
|
||||
response = requests.get(url, headers=headers, timeout=60)
|
||||
|
||||
|
||||
if response.status_code == 429:
|
||||
retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
|
||||
logger.warning(f"Throttled (429) in get_fresh_download_url. Waiting {retry_after}s...")
|
||||
time.sleep(retry_after)
|
||||
continue
|
||||
|
||||
if response.status_code == 401:
|
||||
logger.info(f"Access Token expired during refresh (Attempt {attempt+1}). Forcing refresh...")
|
||||
headers = get_headers(app, force_refresh=True)
|
||||
response = requests.get(url, headers=headers, timeout=60)
|
||||
|
||||
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
download_url = data.get('@microsoft.graph.downloadUrl')
|
||||
|
||||
|
||||
if download_url:
|
||||
return download_url, None
|
||||
|
||||
|
||||
# If item exists but URL is missing, it might be a transient SharePoint issue
|
||||
logger.warning(f"Attempt {attempt+1}: '@microsoft.graph.downloadUrl' missing for {item_id}. Retrying in 1s...")
|
||||
time.sleep(1)
|
||||
|
||||
logger.warning(f"Attempt {attempt+1}: '@microsoft.graph.downloadUrl' missing for {item_id}. Retrying in {2 ** attempt}s...")
|
||||
time.sleep(2 ** attempt)
|
||||
|
||||
except Exception as e:
|
||||
if attempt == 2:
|
||||
return None, str(e)
|
||||
logger.warning(f"Attempt {attempt+1} failed: {e}. Retrying...")
|
||||
time.sleep(1)
|
||||
logger.warning(f"Attempt {attempt+1} failed: {e}. Retrying in {2 ** attempt}s...")
|
||||
time.sleep(2 ** attempt)
|
||||
|
||||
return None, "Item returned but '@microsoft.graph.downloadUrl' was missing after 3 attempts."
|
||||
|
||||
def download_single_file(app, drive_id, item_id, local_path, expected_size, display_name, remote_hash=None, initial_url=None):
|
||||
def download_single_file(app, drive_id, item_id, local_path, expected_size, display_name, config, stop_event=None, remote_hash=None, initial_url=None, remote_mtime_str=None):
|
||||
try:
|
||||
if stop_event and stop_event.is_set():
|
||||
raise InterruptedError("Sync cancelled")
|
||||
|
||||
file_mode = 'wb'
|
||||
resume_header = {}
|
||||
existing_size = 0
|
||||
download_url = initial_url
|
||||
|
||||
long_local_path = get_long_path(local_path)
|
||||
|
||||
if os.path.exists(local_path):
|
||||
existing_size = os.path.getsize(local_path)
|
||||
if os.path.exists(long_local_path):
|
||||
existing_size = os.path.getsize(long_local_path)
|
||||
local_mtime = os.path.getmtime(long_local_path)
|
||||
|
||||
# Konvertér SharePoint ISO8601 UTC tid (f.eks. 2024-03-29T12:00:00Z) til unix timestamp
|
||||
remote_mtime = datetime.fromisoformat(remote_mtime_str.replace('Z', '+00:00')).timestamp()
|
||||
|
||||
# Hvis filen findes, har rigtig størrelse OG lokal er ikke ældre end remote -> SKIP
|
||||
if existing_size == expected_size:
|
||||
logger.info(f"Skipped (complete): {display_name}")
|
||||
return True, None
|
||||
if local_mtime >= (remote_mtime - 1): # Vi tillader 1 sekuds difference pga. filsystem-præcision
|
||||
logger.info(f"Skipped (up-to-date): {display_name}")
|
||||
return True, None
|
||||
else:
|
||||
logger.info(f"Update available: {display_name} (Remote is newer)")
|
||||
existing_size = 0
|
||||
elif existing_size < expected_size:
|
||||
logger.info(f"Resuming: {display_name} from {format_size(existing_size)}")
|
||||
resume_header = {'Range': f'bytes={existing_size}-'}
|
||||
file_mode = 'ab'
|
||||
# Ved resume tjekker vi også om kilden er ændret siden vi startede
|
||||
if local_mtime < (remote_mtime - 1):
|
||||
logger.warning(f"Remote file changed during partial download: {display_name}. Restarting.")
|
||||
existing_size = 0
|
||||
else:
|
||||
logger.info(f"Resuming: {display_name} from {format_size(existing_size)}")
|
||||
resume_header = {'Range': f'bytes={existing_size}-'}
|
||||
file_mode = 'ab'
|
||||
else:
|
||||
logger.warning(f"Local file larger than remote: {display_name}. Overwriting.")
|
||||
existing_size = 0
|
||||
|
||||
logger.info(f"Starting: {display_name} ({format_size(expected_size)})")
|
||||
os.makedirs(os.path.dirname(local_path), exist_ok=True)
|
||||
os.makedirs(os.path.dirname(long_local_path), exist_ok=True)
|
||||
|
||||
# Initial download attempt
|
||||
if not download_url:
|
||||
@@ -173,28 +305,30 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp
|
||||
if not download_url:
|
||||
return False, f"Could not fetch initial URL: {err}"
|
||||
|
||||
response = requests.get(download_url, headers=resume_header, stream=True, timeout=120)
|
||||
|
||||
# Handle 401 Unauthorized from SharePoint (expired download link)
|
||||
if response.status_code == 401:
|
||||
logger.warning(f"URL expired for {display_name}. Fetching fresh URL...")
|
||||
download_url, err = get_fresh_download_url(app, drive_id, item_id)
|
||||
if not download_url:
|
||||
return False, f"Failed to refresh download URL: {err}"
|
||||
# Retry download with new URL
|
||||
try:
|
||||
response = safe_get(download_url, resume_header, stream=True, timeout=120)
|
||||
|
||||
response.raise_for_status()
|
||||
except requests.exceptions.HTTPError as e:
|
||||
if e.response is not None and e.response.status_code == 401:
|
||||
# Handle 401 Unauthorized from SharePoint (expired download link)
|
||||
logger.warning(f"URL expired for {display_name}. Fetching fresh URL...")
|
||||
download_url, err = get_fresh_download_url(app, drive_id, item_id)
|
||||
if not download_url:
|
||||
return False, f"Failed to refresh download URL: {err}"
|
||||
response = safe_get(download_url, resume_header, stream=True, timeout=120)
|
||||
else:
|
||||
raise
|
||||
|
||||
with open(local_path, file_mode) as f:
|
||||
with open(long_local_path, file_mode) as f:
|
||||
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
|
||||
if stop_event and stop_event.is_set():
|
||||
raise InterruptedError("Sync cancelled")
|
||||
if chunk:
|
||||
f.write(chunk)
|
||||
|
||||
# Post-download check
|
||||
final_size = os.path.getsize(local_path)
|
||||
final_size = os.path.getsize(long_local_path)
|
||||
if final_size == expected_size:
|
||||
if verify_integrity(local_path, remote_hash):
|
||||
if verify_integrity(local_path, remote_hash, config):
|
||||
logger.info(f"DONE: {display_name}")
|
||||
return True, None
|
||||
else:
|
||||
@@ -202,13 +336,20 @@ def download_single_file(app, drive_id, item_id, local_path, expected_size, disp
|
||||
else:
|
||||
return False, f"Size mismatch: Remote={expected_size}, Local={final_size}"
|
||||
|
||||
except InterruptedError:
|
||||
raise
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
# --- Main Traversal Logic ---
|
||||
def process_item_list(app, drive_id, item_path, local_root_path, report, executor, futures):
|
||||
def process_item_list(app, drive_id, item_path, local_root_path, report, executor, futures, config, stop_event=None, depth=0):
|
||||
if depth >= MAX_FOLDER_DEPTH:
|
||||
logger.warning(f"Max folder depth ({MAX_FOLDER_DEPTH}) reached at: {item_path}. Skipping subtree.")
|
||||
return
|
||||
try:
|
||||
auth_headers = get_headers(app)
|
||||
if stop_event and stop_event.is_set():
|
||||
raise InterruptedError("Sync cancelled")
|
||||
|
||||
encoded_path = quote(item_path)
|
||||
|
||||
if not item_path:
|
||||
@@ -217,34 +358,38 @@ def process_item_list(app, drive_id, item_path, local_root_path, report, executo
|
||||
url = f"https://graph.microsoft.com/v1.0/drives/{drive_id}/root:/{encoded_path}:/children"
|
||||
|
||||
while url:
|
||||
response = safe_get(url, headers=auth_headers)
|
||||
response = safe_graph_get(app, url)
|
||||
data = response.json()
|
||||
items = data.get('value', [])
|
||||
|
||||
for item in items:
|
||||
if stop_event and stop_event.is_set():
|
||||
raise InterruptedError("Sync cancelled")
|
||||
|
||||
item_name = item['name']
|
||||
local_path = os.path.join(local_root_path, item_name)
|
||||
display_path = f"{item_path}/{item_name}".strip('/')
|
||||
|
||||
if 'folder' in item:
|
||||
process_item_list(app, drive_id, display_path, local_path, report, executor, futures)
|
||||
process_item_list(app, drive_id, display_path, local_path, report, executor, futures, config, stop_event, depth + 1)
|
||||
elif 'file' in item:
|
||||
item_id = item['id']
|
||||
download_url = item.get('@microsoft.graph.downloadUrl')
|
||||
remote_hash = item.get('file', {}).get('hashes', {}).get('quickXorHash')
|
||||
remote_mtime = item.get('lastModifiedDateTime')
|
||||
|
||||
future = executor.submit(
|
||||
download_single_file,
|
||||
app, drive_id, item_id,
|
||||
local_path, item['size'], display_path,
|
||||
remote_hash, download_url
|
||||
config, stop_event, remote_hash, download_url, remote_mtime
|
||||
)
|
||||
futures[future] = display_path
|
||||
|
||||
url = data.get('@odata.nextLink')
|
||||
if url:
|
||||
auth_headers = get_headers(app)
|
||||
|
||||
except InterruptedError:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error traversing {item_path}: {e}")
|
||||
with report_lock:
|
||||
@@ -255,9 +400,11 @@ def create_msal_app(tenant_id, client_id, client_secret):
|
||||
client_id, authority=f"https://login.microsoftonline.com/{tenant_id}", client_credential=client_secret
|
||||
)
|
||||
|
||||
def main():
|
||||
def main(config=None, stop_event=None):
|
||||
try:
|
||||
config = load_config('connection_info.txt')
|
||||
if config is None:
|
||||
config = load_config('connection_info.txt')
|
||||
|
||||
tenant_id = config.get('TENANT_ID', '')
|
||||
client_id = config.get('CLIENT_ID', '')
|
||||
client_secret = config.get('CLIENT_SECRET', '')
|
||||
@@ -277,18 +424,29 @@ def main():
|
||||
with ThreadPoolExecutor(max_workers=MAX_WORKERS, thread_name_prefix="DL") as executor:
|
||||
futures = {}
|
||||
for folder in folders:
|
||||
if stop_event and stop_event.is_set():
|
||||
break
|
||||
logger.info(f"Scanning: {folder or 'Root'}")
|
||||
process_item_list(app, drive_id, folder, os.path.join(local_base, folder), report, executor, futures)
|
||||
process_item_list(app, drive_id, folder, os.path.join(local_base, folder), report, executor, futures, config, stop_event)
|
||||
|
||||
logger.info(f"Scan complete. Processing {len(futures)} tasks...")
|
||||
for future in as_completed(futures):
|
||||
if stop_event and stop_event.is_set():
|
||||
break
|
||||
path = futures[future]
|
||||
success, error = future.result()
|
||||
if not success:
|
||||
logger.error(f"FAILED: {path} | {error}")
|
||||
with report_lock:
|
||||
report.append({"Path": path, "Error": error, "Timestamp": datetime.now().isoformat()})
|
||||
try:
|
||||
success, error = future.result()
|
||||
if not success:
|
||||
logger.error(f"FAILED: {path} | {error}")
|
||||
with report_lock:
|
||||
report.append({"Path": path, "Error": error, "Timestamp": datetime.now().isoformat()})
|
||||
except InterruptedError:
|
||||
continue # The executor will shut down anyway
|
||||
|
||||
if stop_event and stop_event.is_set():
|
||||
logger.warning("Synchronization was stopped by user.")
|
||||
return
|
||||
|
||||
report_file = f"download_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
|
||||
with open(report_file, 'w', newline='', encoding='utf-8') as f:
|
||||
writer = csv.DictWriter(f, fieldnames=["Path", "Error", "Timestamp"])
|
||||
@@ -297,6 +455,8 @@ def main():
|
||||
|
||||
logger.info(f"Sync complete. Errors: {len(report)}. Report: {report_file}")
|
||||
|
||||
except InterruptedError:
|
||||
logger.warning("Synchronization was stopped by user.")
|
||||
except Exception as e:
|
||||
logger.critical(f"FATAL ERROR: {e}")
|
||||
|
||||
|
||||
@@ -9,16 +9,6 @@ import requests
|
||||
# --- Global Stop Flag ---
|
||||
stop_event = threading.Event()
|
||||
|
||||
# For at stoppe uden at ændre download_sharepoint.py, "patcher" vi requests.get
|
||||
# så den tjekker stop_event før hver anmodning.
|
||||
original_get = requests.get
|
||||
def patched_get(*args, **kwargs):
|
||||
if stop_event.is_set():
|
||||
raise InterruptedError("Synkronisering afbrudt af brugeren.")
|
||||
return original_get(*args, **kwargs)
|
||||
|
||||
requests.get = patched_get
|
||||
|
||||
# --- Logging Handler for GUI ---
|
||||
class TextboxHandler(logging.Handler):
|
||||
def __init__(self, textbox):
|
||||
@@ -41,7 +31,7 @@ class SharepointApp(ctk.CTk):
|
||||
super().__init__()
|
||||
|
||||
self.title("SharePoint Download Tool - UX")
|
||||
self.geometry("900x750")
|
||||
self.geometry("1000x850") # Gjort lidt bredere og højere for at give plads
|
||||
ctk.set_appearance_mode("dark")
|
||||
ctk.set_default_color_theme("blue")
|
||||
|
||||
@@ -51,7 +41,7 @@ class SharepointApp(ctk.CTk):
|
||||
# Sidebar
|
||||
self.sidebar_frame = ctk.CTkFrame(self, width=350, corner_radius=0)
|
||||
self.sidebar_frame.grid(row=0, column=0, sticky="nsew")
|
||||
self.sidebar_frame.grid_rowconfigure(20, weight=1)
|
||||
self.sidebar_frame.grid_rowconfigure(25, weight=1)
|
||||
|
||||
self.logo_label = ctk.CTkLabel(self.sidebar_frame, text="Indstillinger", font=ctk.CTkFont(size=20, weight="bold"))
|
||||
self.logo_label.grid(row=0, column=0, padx=20, pady=(20, 10))
|
||||
@@ -64,22 +54,24 @@ class SharepointApp(ctk.CTk):
|
||||
("SITE_URL", "Site URL"),
|
||||
("DOCUMENT_LIBRARY", "Library Navn"),
|
||||
("FOLDERS_TO_DOWNLOAD", "Mapper (komma-sep)"),
|
||||
("LOCAL_PATH", "Lokal Sti")
|
||||
("LOCAL_PATH", "Lokal Sti"),
|
||||
("ENABLE_HASH_VALIDATION", "Valider Hash (True/False)"),
|
||||
("HASH_THRESHOLD_MB", "Hash Grænse (MB)")
|
||||
]
|
||||
|
||||
for i, (key, label) in enumerate(fields):
|
||||
lbl = ctk.CTkLabel(self.sidebar_frame, text=label)
|
||||
lbl.grid(row=i*2+1, column=0, padx=20, pady=(10, 0), sticky="w")
|
||||
lbl.grid(row=i*2+1, column=0, padx=20, pady=(5, 0), sticky="w")
|
||||
entry = ctk.CTkEntry(self.sidebar_frame, width=280)
|
||||
if key == "CLIENT_SECRET": entry.configure(show="*")
|
||||
entry.grid(row=i*2+2, column=0, padx=20, pady=(0, 5))
|
||||
self.entries[key] = entry
|
||||
|
||||
self.browse_button = ctk.CTkButton(self.sidebar_frame, text="Vælg Mappe", command=self.browse_folder, height=32)
|
||||
self.browse_button.grid(row=15, column=0, padx=20, pady=10)
|
||||
self.browse_button.grid(row=20, column=0, padx=20, pady=10)
|
||||
|
||||
self.save_button = ctk.CTkButton(self.sidebar_frame, text="Gem Indstillinger", command=self.save_settings, fg_color="transparent", border_width=2)
|
||||
self.save_button.grid(row=16, column=0, padx=20, pady=10)
|
||||
self.save_button.grid(row=21, column=0, padx=20, pady=10)
|
||||
|
||||
# Main side
|
||||
self.main_frame = ctk.CTkFrame(self, corner_radius=0, fg_color="transparent")
|
||||
@@ -147,7 +139,8 @@ class SharepointApp(ctk.CTk):
|
||||
|
||||
def run_sync(self):
|
||||
try:
|
||||
download_sharepoint.main()
|
||||
config = download_sharepoint.load_config("connection_info.txt")
|
||||
download_sharepoint.main(config=config, stop_event=stop_event)
|
||||
if stop_event.is_set():
|
||||
self.status_label.configure(text="Status: Afbrudt", text_color="red")
|
||||
else:
|
||||
|
||||
Reference in New Issue
Block a user