Hi, whoever experiences a similar problem, please find below the description of it and the procedure to "solve" it. Until it is fixed in the core, the below procedures are just a workaround but probably will let you keep the proper size of the disk.
Let me thank here Ken Task and Visvanath Ratnaweera for their initial guidance.
The below, I did with the full help of AI, just don't want me taking any credit.
Moodle: Massive Disk Growth Caused by Orphaned H5P Export Files
Summary
A Moodle installation (version 4.5, released October 2024) has been experiencing significant and unexpected disk growth in the `moodledata/filedir` directory over the past year. Investigation revealed that the primary cause is a large accumulation of orphaned H5P export files that are never cleaned up automatically.
---
Background: How Moodle Stores Files
Moodle stores all files in a content-addressable storage system under `moodledata/filedir`. Each file is stored once on disk, named by its SHA1 hash of the content. Multiple database records in `mdl_files` can reference the same physical file via `contenthash` — this is Moodle's built-in deduplication mechanism.
---
How H5P Export Files Are Generated
When an H5P activity is viewed for the first time, Moodle automatically generates an export `.h5p` file (a ZIP package) and stores it in `mdl_files` with:
- `component = core_h5p`
- `filearea = export`
- `contextlevel = 10` (system context)
This export file is generated once and cached. It is not regenerated on subsequent views unless the cache is cleared. This behavior is by design and functions correctly.
---
The Problem: Orphaned Export Files Never Deleted
When an H5P activity is deleted from a course, Moodle removes the activity record and its associated entries from `mdl_h5p` — but **does not delete the corresponding export file** from `mdl_files` or from disk.
The scheduled task `\core\task\h5p_clean_orphaned_records_task` exists in Moodle's cron system but does **not** clean up these orphaned export file records from `mdl_files`.
Scale of the Problem
Querying `mdl_files` for export files whose `pathnamehash` no longer exists in `mdl_h5p`:
```sql
SELECT COUNT(*), SUM(filesize)/1024/1024 AS MB
FROM mdl_files
WHERE component = 'core_h5p'
AND filearea = 'export'
AND pathnamehash NOT IN (SELECT pathnamehash FROM mdl_h5p);
```
**Result: ~17,800 orphaned records occupying approximately 8.5 GB on disk.**
These files have accumulated since the platform started scaling (early 2023) and have never been cleaned up.
---
Contributing Factors
- The platform uses a model where each student has their own copy of a course (created via "Copy Course"). This means many H5P activities are created and later deleted as courses are retired.
- File deduplication works correctly for shared content (e.g., book chapters, resources) — the same physical file is referenced by many courses with no disk overhead.
- H5P export files, however, are unique per activity instance and are not deduplicated across courses. Each deleted H5P activity leaves behind its own orphaned export file.
---
Current Disk Usage Breakdown (last 12 months, actual unique files on disk)
| Component | File Area | Unique Files | Actual MB on Disk |
|---|---|---|---|
| core_h5p | export | 7,008 | ~3,133 MB |
| mod_resource | content | 57 | ~91 MB |
| core_h5p | libraries | 842 | ~6 MB |
| others | various | — | minimal |
Of the ~7,000 active export files, the majority are legitimate (one per existing H5P activity). The ~17,800 orphaned exports represent the disk waste.
---
What Is NOT the Problem
- File deduplication is working correctly for all other file types.
- H5P export caching works correctly — files are not regenerated on every view.
- The disk growth is not caused by course copying or student course proliferation (those files are properly deduplicated).
---
Proposed Solution
1. **Identify orphaned export records** in `mdl_files` where `pathnamehash` no longer exists in `mdl_h5p`.
2. **Delete the corresponding physical files** from `moodledata/filedir`.
3. **Delete the orphaned records** from `mdl_files`.
4. **Implement a regular cleanup** — either via a custom cron script or by confirming whether a future Moodle version addresses this in `h5p_clean_orphaned_records_task`.
**Estimated space recovery: ~8.5 GB**
---
Question for the Community
Has anyone else observed this behavior in Moodle 4.x? Is `h5p_clean_orphaned_records_task` supposed to handle cleanup of `mdl_files` export records, or is this a known gap? Is there a safe, supported way to clean up these orphaned export files without risking data integrity?
THE SOLUTION PROCEDURE
Long story short, /filedir zip had 9.2 GB, after the below procedure has 1.25 GB.
All tested, the platform works.
Trace the problem down.
Create a script which will trace changes in /filedir and place it here:
/usr/local/bin/moodle_size_monitor.sh
(It was AI work, I just prompted a lot)
Script
#!/bin/bash
# Configuration
MOODLE_DIR="/var/www/moodledata/filedir"
OUTPUT_DIR="/path/to/backups/moodle_backups"
CURRENT_DATE=$(date '+%Y-%m-%d')
OUTPUT_FILE="$OUTPUT_DIR/moodle_files_size_${CURRENT_DATE}.txt"
STATE_FILE="$OUTPUT_DIR/.moodle_size_state.db"
TEMP_CURRENT="$OUTPUT_DIR/.moodle_current_${CURRENT_DATE}.tmp"
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Function to format file size
format_size() {
local size=$1
if [ $size -ge 1073741824 ]; then
echo "$(awk "BEGIN {printf \"%.2f GB\", $size/1073741824}")"
elif [ $size -ge 1048576 ]; then
echo "$(awk "BEGIN {printf \"%.2f MB\", $size/1048576}")"
elif [ $size -ge 1024 ]; then
echo "$(awk "BEGIN {printf \"%.2f KB\", $size/1024}")"
else
echo "${size} B"
fi
}
# Check if Moodle directory exists
if [ ! -d "$MOODLE_DIR" ]; then
echo "ERROR: Directory $MOODLE_DIR does not exist!" | tee -a "$OUTPUT_FILE"
exit 1
fi
echo "=== Starting Moodle directory scan: $(date '+%Y-%m-%d %H:%M:%S') ===" >&2
echo "This may take several minutes for large directories..." >&2
# Scan directory using du for better performance (aggregates by subdirectory)
> "$TEMP_CURRENT"
# Get directory sizes - use maxdepth to limit scan depth
echo "Calculating directory sizes (this may take a few minutes)..." >&2
find "$MOODLE_DIR" -maxdepth 3 -type d -exec du -sb {} \; 2>/dev/null | sort -rn > "$TEMP_CURRENT"
if [ ! -s "$TEMP_CURRENT" ]; then
echo "ERROR: No data collected from $MOODLE_DIR" | tee -a "$OUTPUT_FILE"
exit 1
fi
echo "Found $(wc -l < "$TEMP_CURRENT") directories to analyze..." >&2
echo "Scan completed. Analyzing changes..." >&2
# If this is the first run, save initial state and exit
if [ ! -f "$STATE_FILE" ]; then
echo "First run - saving initial state..." >&2
cp "$TEMP_CURRENT" "$STATE_FILE"
echo "=== FIRST RUN - $(date '+%Y-%m-%d %H:%M:%S') ===" > "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "Initial state saved for $(wc -l < "$TEMP_CURRENT") directories." >> "$OUTPUT_FILE"
echo "Next run will show size changes." >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "Top 10 largest directories:" >> "$OUTPUT_FILE"
head -10 "$TEMP_CURRENT" | while read size path; do
echo " $(format_size $size) - $path" >> "$OUTPUT_FILE"
done
rm -f "$TEMP_CURRENT"
exit 0
fi
# Compare states and find changes
echo "=== CHANGES DETECTED - $(date '+%Y-%m-%d %H:%M:%S') ===" > "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
changes_found=0
total_increase=0
while read current_size current_path; do
# Skip empty lines or malformed entries
if [ -z "$current_size" ] || [ -z "$current_path" ] || ! "$current_size" =~ ^[0-9]+$ ; then
continue
fi
# Find previous size of the same path
previous_size=$(grep -F " $current_path" "$STATE_FILE" | awk '{print $1}')
if [ -z "$previous_size" ]; then
# New directory
size_diff=$current_size
formatted_diff=$(format_size $size_diff)
formatted_current=$(format_size $current_size)
echo "[$formatted_diff] NEW DIRECTORY" >> "$OUTPUT_FILE"
echo " Location: $current_path" >> "$OUTPUT_FILE"
echo " Size: $formatted_current" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
changes_found=$((changes_found + 1))
total_increase=$((total_increase + size_diff))
elif "$previous_size" =~ ^[0-9]+$ && [ "$current_size" -gt "$previous_size" ]; then
# Directory grew
size_diff=$((current_size - previous_size))
formatted_diff=$(format_size $size_diff)
formatted_previous=$(format_size $previous_size)
formatted_current=$(format_size $current_size)
echo "[$formatted_diff] GROWTH" >> "$OUTPUT_FILE"
echo " Location: $current_path" >> "$OUTPUT_FILE"
echo " Previous size: $formatted_previous" >> "$OUTPUT_FILE"
echo " Current size: $formatted_current" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
changes_found=$((changes_found + 1))
total_increase=$((total_increase + size_diff))
fi
done < "$TEMP_CURRENT"
# Summary
if [ $changes_found -eq 0 ]; then
echo "=== NO CHANGES ===" >> "$OUTPUT_FILE"
echo "No directories detected that increased in size." >> "$OUTPUT_FILE"
else
echo "=== SUMMARY ===" >> "$OUTPUT_FILE"
echo "Number of changed/new directories: $changes_found" >> "$OUTPUT_FILE"
echo "Total increase: $(format_size $total_increase)" >> "$OUTPUT_FILE"
fi
echo "" >> "$OUTPUT_FILE"
echo "========================================" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
# Update state file
cp "$TEMP_CURRENT" "$STATE_FILE"
# Clean up
rm -f "$TEMP_CURRENT"
echo "Analysis completed successfully. Results in: $OUTPUT_FILE" >&2
echo "Total changes: $changes_found directories, $(format_size $total_increase) increase" >&2
CLEANUP
Orphaned H5P Export Files — Root Cause, Investigation & Complete Cleanup Procedure
**Moodle version:** 4.5 (2024100710.03)
**Related bug:** MDL-70700 (open, unresolved as of Moodle 4.5)
---
The Problem
Our Moodle installation was experiencing significant and unexplained disk growth in the
`moodledata/filedir` directory. The platform uses a student-per-course model where each
student receives their own copy of a template course via the "Copy Course" admin function.
Investigation revealed that **17,869 orphaned H5P export files** were accumulating on disk,
occupying approximately **8.5 GB** with no automatic cleanup mechanism removing them.
---
How H5P Export Files Work
When an H5P activity is viewed for the first time, Moodle automatically generates an export
`.h5p` file and stores it in `mdl_files` with:
- `component = core_h5p`
- `filearea = export`
- `contextlevel = 10` (system context)
This is by design and functions correctly. The problem occurs when an H5P activity is
**deleted** — Moodle removes the activity record from `mdl_h5p` but does **not** delete
the corresponding export file from `mdl_files` or from disk.
The scheduled task `\core\task\h5p_clean_orphaned_records_task` exists but does **not**
handle cleanup of these orphaned export file records. This is the known gap described
in MDL-70700.
---
Confirming the Problem
Run this query to count orphaned export files — records in `mdl_files` whose `pathnamehash`
no longer exists in `mdl_h5p`:
```sql
SELECT COUNT(*) AS orphaned_records, SUM(filesize)/1024/1024 AS MB
FROM mdl_files
WHERE component = 'core_h5p'
AND filearea = 'export'
AND pathnamehash NOT IN (SELECT pathnamehash FROM mdl_h5p);
```
In our case this returned **17,869 records / 8,493 MB**.
---
## Important Notes Before Cleanup
**Do NOT use raw SQL DELETE + manual `rm` to clean these files.**
Moodle stores files using content-addressable storage — multiple `mdl_files` records can
point to the same physical file via `contenthash`. Deleting a physical file while other
records still reference it will break Moodle.
The only safe approach is **Moodle File API** (`stored_file->delete()`), which automatically
checks whether any other record references the same `contenthash` before removing the
physical file.
---
## Complete Cleanup Procedure
Step 1 — Back up the database
```bash
mysqldump -u root -p your_moodle_database > /root/backup_before_h5p_cleanup.sql
```
Step 2 — Create the log directory
```bash
mkdir -p /var/log/h5p_cleanup
chown www-data:www-data /var/log/h5p_cleanup
chmod 750 /var/log/h5p_cleanup
```
Step 3 — Create the cleanup script
IMPORTANT: The script must be saved as plain ASCII — no special characters in the file.
Use the `cat` heredoc method below exactly as shown. Do not use a text editor — encoding
issues will cause the script to fail silently with "Error reading from database".
```bash
cat > /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php << 'ENDOFSCRIPT' <?php
define('CLI_SCRIPT', true);
require(__DIR__ . '/../../config.php');
require_once($CFG->libdir . '/clilib.php');
list($options, $unrecognized) = cli_get_params(
array('delete'=>false,'limit'=>0,'offset'=>0,'days'=>0,'log'=>'/var/log/h5p_cleanup/h5p_cleanup.csv','help'=>false),
array('d'=>'delete','l'=>'limit','o'=>'offset','D'=>'days','g'=>'log','h'=>'help')
);
if ($options['help']) {
echo "Usage: php cleanup_h5p_exports.php [--delete] [--limit=N] [--offset=N] [--days=N] [--log=PATH]\n";
echo "Default: dry-run. Add --delete to actually remove files.\n";
exit(0);
}
global $DB;
$fs = get_file_storage();
$cutoff = ($options['days'] > 0) ? time() - ($options['days'] * 86400) : 0;
$sql = "SELECT f.id, f.contenthash, f.pathnamehash, f.filesize, f.timecreated
FROM {files} f
LEFT JOIN {h5p} h ON f.pathnamehash = h.pathnamehash
WHERE f.component = :component
AND f.filearea = :filearea
AND h.pathnamehash IS NULL";
$params = ['component'=>'core_h5p','filearea'=>'export'];
if ($cutoff) { $sql .= " AND f.timecreated < :cutoff"; $params['cutoff'] = $cutoff; }
$sql .= " ORDER BY f.timecreated ASC";
$limitnum = (int)$options['limit'];
$limitfrom = (int)$options['offset'];
$records = $DB->get_records_sql($sql, $params, $limitfrom, $limitnum);
$total = count($records);
$sumbytes = array_sum(array_column((array)$records, 'filesize'));
echo "Records found: {$total}\n";
echo "Total size: " . round($sumbytes/1024/1024, 2) . " MB\n";
echo ($options['delete'] ? "MODE: DELETE\n\n" : "MODE: DRY-RUN (nothing will be removed)\n\n");
if ($total === 0) { echo "Nothing to do.\n"; exit(0); }
$logpath = $options['log'];
$logdir = dirname($logpath);
if (!is_dir($logdir)) mkdir($logdir, 0750, true);
$fh = fopen($logpath, 'a');
if (!$fh) { echo "ERROR: cannot open log file {$logpath}\n"; exit(1); }
if (filesize($logpath) === 0) fputcsv($fh, ['timestamp','action','fileid','contenthash','filesize','timecreated','note']);
$counter = 0;
foreach ($records as $r) {
$counter++;
$tc = date('Y-m-d H:i:s', $r->timecreated);
echo sprintf("%4d) fileid=%d size=%d created=%s\n", $counter, $r->id, $r->filesize, $tc);
$logrow = [date('c'), ($options['delete'] ? 'delete_attempt' : 'dry_run'), $r->id, $r->contenthash, $r->filesize, $tc, 'dry-run'];
if ($options['delete']) {
$file = $fs->get_file_by_id($r->id);
if ($file) {
try {
$file->delete();
echo " -> deleted\n";
$logrow[1] = 'deleted';
$logrow[6] = '';
} catch (Exception $e) {
echo " -> ERROR: " . $e->getMessage() . "\n";
$logrow[6] = 'error: ' . $e->getMessage();
}
} else {
echo " -> not found, skipping\n";
$logrow[6] = 'not found';
}
}
fputcsv($fh, $logrow);
}
fclose($fh);
echo "\n";
if ($options['delete']) echo "Done. Log saved to: {$logpath}\n";
else echo "Dry-run complete. Log saved to: {$logpath}\nRun with --delete to remove files.\n";
ENDOFSCRIPT
```
Step 4 — Set permissions
```bash
chown root:www-data /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php
chmod 750 /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php
```
Step 5 — Verify the script file is plain ASCII
```bash
file /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php
```
Expected output: `PHP script, ASCII text`
If it says `Unicode text` — delete the file and recreate it using the `cat` command in
Step 3. A Unicode-encoded file will fail with "Error reading from database" and no further
error detail, which is misleading.
Step 6 — Run a dry-run first
```bash
sudo -u www-data php /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php --limit=50 --log=/tmp/dryrun.csv
```
Review the output and the CSV log to confirm the files listed are genuinely orphaned exports.
Step 7 — Delete in batches
Run the script with `--delete` in progressively larger batches. Check remaining count
between batches if you want to monitor progress.
```bash
sudo -u www-data php /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php --delete --limit=100 --log=/var/log/h5p_cleanup/run1.csv
sudo -u www-data php /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php --delete --limit=1000 --log=/var/log/h5p_cleanup/run2.csv
sudo -u www-data php /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php --delete --limit=5000 --log=/var/log/h5p_cleanup/run3.csv
```
Repeat the last command until the remaining count reaches zero.
To check remaining count at any point, run in your database admin tool:
```sql
SELECT COUNT(*) AS remaining_orphans, SUM(filesize)/1024/1024 AS MB
FROM mdl_files
WHERE component = 'core_h5p'
AND filearea = 'export'
AND pathnamehash NOT IN (SELECT pathnamehash FROM mdl_h5p);
```
Step 8 — Verify disk space recovered
```bash
df -h /
du -sh /path/to/moodledata/filedir/
```
---
Results
After running the full cleanup on our installation:
- Orphaned records removed: **17,869**
- Disk space recovered: **~8.5 GB**
- Active export files remaining: **13** (legitimately cached, regenerated on demand)
- Moodle functioning normally — H5P activities, Content Bank, course copying all verified
---
# Key Technical Notes
- The script uses `get_file_storage()` and `stored_file->delete()` — Moodle's own File API
— ensuring database and filesystem stay in sync
- The script must be run as the web server user (`www-data`) so Moodle's database
abstraction layer initialises correctly
- Moodle's `get_records_sql()` requires LIMIT to be passed via the `$limitfrom` and
`$limitnum` parameters, NOT as inline SQL `LIMIT :limit OFFSET :offset` — this is a
critical detail; using inline LIMIT causes a silent "Error reading from database" failure
- The `{files}` and `{h5p}` notation in the SQL lets Moodle handle your table prefix
automatically — no need to hardcode it
- Export files are regenerated automatically by Moodle the next time a user views an H5P
activity, so removing them causes no data loss
---
Workaround: Automated Cleanup
via Cron
> **Note:** This is a workaround for MDL-70700, not a proper fix. The correct solution
> is for Moodle core to handle this in `h5p_clean_orphaned_records_task`. Until that
> happens, the cron job below prevents unbounded disk growth by removing export files
> older than 90 days every 3 months.
> > Removing export files causes no data loss — Moodle regenerates them automatically
> the next time a user opens the H5P activity. The 90-day window is a reasonable
> assumption that students are unlikely to return to an activity after that period,
> and that the Download/Reuse buttons are not actively used.
Step 1 — Create the cron wrapper script
```bash
cat > /usr/local/bin/moodle_h5p_cleanup.sh << 'ENDOFSCRIPT'
#!/bin/bash
MOODLE=/var/www/html/moodle
LOG=/var/log/h5p_cleanup/cron_$(date +%Y%m%d_%H%M%S).csv
PHPSCRIPT=${MOODLE}/admin/cli/cleanup_h5p_exports.php
echo "H5P export cleanup started: $(date)"
echo "Log: ${LOG}"
sudo -u www-data php ${PHPSCRIPT} --delete --days=90 --log=${LOG}
echo "H5P export cleanup finished: $(date)"
ENDOFSCRIPT
```
Step 2 — Set permissions
```bash
chmod 750 /usr/local/bin/moodle_h5p_cleanup.sh
```
Step 3 — Add cron job
```bash
crontab -e
```
Add this line at the bottom (runs at 03:00 on the 1st day of every 3rd month):
```
0 3 1 */3 * /usr/local/bin/moodle_h5p_cleanup.sh >> /var/log/h5p_cleanup/cron.log 2>&1
```
Step 4 — Verify cron job is installed
```bash
crontab -l
```
Step 5 — Test manually before first scheduled run
You can verify the script runs without errors by doing a dry-run with a short age window:
```bash
sudo -u www-data php /var/www/html/moodle/admin/cli/cleanup_h5p_exports.php --days=1 --log=/tmp/cron_test.csv
```
This will show how many files would be removed if they were older than 1 day, without
actually deleting anything. If the script reports "Records found" or "Nothing to do"
— it is working correctly.
Note: if you run the full cron wrapper manually (`/usr/local/bin/moodle_h5p_cleanup.sh`),
it will execute with `--days=90` and will report "Nothing to do" if no files are older
than 90 days — this is expected and correct behaviour.
To review results of any scheduled run:
```bash
tail -20 /var/log/h5p_cleanup/cron.log
ls -lh /var/log/h5p_cleanup/
```
---
Request to the Community
Has anyone found a better way to prevent this accumulation going forward, short of
waiting for MDL-70700 to be resolved? The cron workaround above works but feels like
it should not be necessary. Any insight into whether a core fix is planned or if there
is a configuration option that prevents export file generation in the first place would
be welcome.