May 18, 2025

PDFs with identical content but different hash

While working on a research project that involved scanned documents, we ran into the issue where we had multiple duplicates of the same file with different names. The obvious solution was to deduplicate all files using any hash however we learned that we still had multiple copies of the same document yet no hash collisions. These multiple copies weren't duplicate scans of the same physical document but rather identical files in every single way except their MD5 hash, how unusual!

After being unable to find any visual differences in PDF viewers, I decided to open the file in a hex editor and discovered that the only differences between seemingly identical files were these hash looking strings in near the end of the file. Checking the official pdf file type description (Section 7.5.5) I learned that this is 'ID' of the file which is stored in the trailer of the file and it changes anytime the file is saved even if the content isn't modified.

Deeper dive

The trailer of a PDF file among other data contains a series of key-value pairs enclosed in double angle brackets. The key I was interested in is the 'ID' which is an array containing 2 byte-strings. Per the documentation (described in section 14.4) the first element is the ID of the original file and the second element is the ID of the current file. If both elements are identical then that means that the PDF file has never been modified.

trailer
<< 
/ID [
    <778D4EA8E819D6270F74AC4151CE91E0>
    <45BFB43647181865852A03255F3B25FB> 
    ] 
/Root 1 0 R 
/Size 43 
/Prev 5067797 
>>
startxref 5069166
%%EOF
Example of PDF file trailer. Linebreaks added for clarity

This value is computed using a message digest algorithm (e.g. MD5 hash) and it uses the following as the seeds:

Current time
File path
File size in bytes
The values of all entries in the file's document information dictionary.

This means that if you scan a document once but save it twice then the two copies will end up having a different hash. That is exactly what happened in our case.

Note

Different applications add different metadata to PDF files. Which means the difference in the PDF output of two different applications wouldn't just be in the /ID block but also in other metadata. For example, the 'preview' app will add the following metadata to a PDF file when 'duplicating' a file

<< /Producer (macOS Version 15.1 \(Build 24B2083\) Quartz PDFContext, AppendMode 1.1) /CreationDate (D:20250610185924Z00'00') /ModDate (D:20250610185945Z00'00') /Title (duplicate_pdf) /Creator (Pages) >>

Calculating hash without /ID

For our purpose I wrote a python script that excludes this part of the file before calculating the hash and it worked swimmingly.

# run using python pdf_hasher.py <pdf_file_path.pdf>
# add -v argument to print the portions of the file which
# are excluded in hash calculation. 

import hashlib
import argparse
import sys

def hash_file_exclude_id(file_path, verbose=False):
    """Hashes a file excluding the /ID entry in the PDF trailer and optionally prints excluded parts."""
    hasher = hashlib.sha256()
    with open(file_path, 'rb') as file:
        content = file.read()

    content_str = content.decode('latin-1', errors='ignore')
    trailer_start_index = content_str.rfind('trailer')
    id_start_index = content_str.find('/ID', trailer_start_index)
    eof_index = content_str.find(']', id_start_index) + 1

    if id_start_index != -1 and eof_index != -1:
        content_before_id = content[:id_start_index]
        content_after_id = content[eof_index:]

        if verbose:
            excluded_content = content[id_start_index:eof_index]
            print(f"Excluded from hash in '{file_path}':")
            print(excluded_content.decode('latin-1', errors='ignore'))

        hasher.update(content_before_id)
        hasher.update(content_after_id)
    else:
        hasher.update(content)
        if verbose:
            print(f"No /ID found to exclude in '{file_path}'. Hashing entire file.")

    return hasher.hexdigest()

def main():
    parser = argparse.ArgumentParser(description="Hash a PDF file excluding its /ID entry.")
    parser.add_argument("file_path", help="Path to the PDF file to be hashed.")
    parser.add_argument("-v", "--verbose", action="store_true", help="Print the parts of the file that were excluded from the hash.")
    args = parser.parse_args()

    hash_value = hash_file_exclude_id(args.file_path, verbose=args.verbose)
    print(f"{hash_value}  {args.file_path}")

if __name__ == "__main__":
    main()
This code is an example of how a PDF hash can be calculated while ignoring the '/ID' identifier in the trailer of the file. This '/ID' part is modified each time the file is saved even if the contents have not been modified. This leads to files with identical contents generating different hashes. The python script opens a given PDF file in binary format and excludes the '/ID' part before calculating the hash

Conclusion

It may be really easy to accidentally modify a PDF since almost any file modification/duplication changes the file trailer. When using the 'preview' app on a Mac, export, duplicate, and print all produce a different file even when no content is modified. And that is how I learned that sometimes a hash is inadequate for a deduplication task.