[Scummvm-git-logs] scummvm-sites integrity -> 10a017e1decd951d170037e35ac6ae726c3f39c7

Wed Jul 2 21:43:46 UTC 2025

This automated email contains information about 38 new commits which have been
pushed to the 'scummvm-sites' repo located at https://api.github.com/repos/scummvm/scummvm-sites .

Summary:
f0907b8158 INTEGRITY: Add file filtering and update checksum calculation logic for 6 mac file variants
4b03ab9445 INTEGRITY: Add support for 7th mac variant - actual resource fork
6770ee4f45 INTEGRITY: Additional error handling for raw rsrc and actual mac fork
9bec3ae8f5 INTEGRITY: Remove punycode encoding from compute_hash.py
0253dcdbab INTEGRITY: Update requirements.txt
31cc77e765 INTEGRITY: Fix punycode
2890822c72 INTEGRITY: Add tests for macbinary
fd7fbe9a75 INTEGRITY: Extend file table with two more size fields
998539035d INTEGRITY: Changes in web app
eebd97c6d5 INTEGRTIY: Initial seeding and set.dat fixes.
b56087660f INTEGRITY: Identify size type from checksum prefix
0a9c916f0f INTEGRITY: Fix fileset deletion logic when matching set filesets with detection.
61efc527f3 INTEGRITY: Correct fileset redirection link in log in case of matching
77494265ee INTEGRITY: Add all md5 variants for files with size less than 5000B.
0e8f5bb45f INTEGRITY: Add pre-commit hook for ruff.
d84d2c502d INTEGRITY: Add warning log and skip duplicate detection entry.
0c9d84560f INTEGRITY: Drop support for generating m prefix checksums for macfiles
3f248a44ce INTEGRITY: Remove sha1 and crc checksum addition to database
109f206728 INTEGRITY: Parse 'f' prefix checksum as normal front bytes checksum
338f379667 INTEGRITY: Rewrite set.dat processing logic.
65b64934f2 INTEGRITY: Handle multiple fileset references in log entries
809be572a6 INTEGRITY: Drop set filesets if no candidates for matching - e.g mac files
52f6e7da92 INTEGRITY: Add re-updation logic for set.dat
22e5e5c853 INTEGRITY: Create extra indices in db for faster query
63c6b57769 INTEGRITY: Add additional log details for number of filesets in different categories
97c6dbb438 INTEGRITY: Filter all candidates in descending order of matches instead of only the max one
94aef02edd INTEGRITY: Remove set filesets with many to one mapping with a single fileset
6b41b1b708 INTEGRITY: Stop processing for the fileset if it already exists - checked by key
91bb26cf74 Skip certain logs while processing set.dat with skiplog flag
d923ae6a10 INTEGRITY: Replace detection duplicate check by filename, size and checksum instead of megakey
ca894b7af4 INTEGRITY: Improve filtering and navigation in fileset search
394c098b7a INTEGRITY: Remove punycode encoding while loading to database, further convert \ to / in filepaths for filesystem indepe
d335d91a55 INTEGIRTY: Iteratively look for extra files if romof or cloneof field is present in the set.dat metadata. Filtering upda
22d913150f INTEGRITY: Add possible merge button inside filesets dashboard
365b4f210a INTEGRITY: Filter files by filename instead of entire path, as detections do not necessarily store the entire filepath.
4a3626afe5 INTEGRITY: Avoid adding duplicate files from detection entries
961e678cdc INTEGRITY: Avoid detection file overriding on a match, when a similar file exist in a different directory. Add it as a n
10a017e1de INTEGRITY: Create copy of the game data during lookup map creation to avoid issues due to mutability of python dictionar


Commit: f0907b8158407631bdf1df522121bd654320253d
    https://github.com/scummvm/scummvm-sites/commit/f0907b8158407631bdf1df522121bd654320253d
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add file filtering and update checksum calculation logic for 6 mac file variants

* Checksum calculations for data forks and data section of the resource forks have been added for 6 macfile variants - macbinary with .bin, macbinary without .bin, appledouble with .rsrc, appledouble with ._, appledouble with __MACOSX a
nd raw rsrc.
* File filtering has been added to make sure only a single file entry is added in case of macfiles, e.g for appledouble with <._appledouble_file>(contains resource fork) and <appledouble_file>(contains data fork) , the second file i.e.
<appledouble_file> is removed from the checksum calculation dictionary. All the checksums can be calculated using <._appledouble_file>'s filepath.
* All the files are marked under one of the following category - NON_MAC, MAC_BINARY (both .bin and without .bin are handled together), APPLE_DOUBLE_RSRC, APPLE_DOUBLE_MACOSX, APPLE_DOUBLE_DOT_ and RAW_RSRC

Changed paths:
    .gitignore
    compute_hash.py

diff --git a/.gitignore b/.gitignore
index 15f33bf..fa6c9e5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,4 +2,5 @@
 mysql_config.json
 __pycache__
 .DS_Store
-.pre-commit-config.yaml
\ No newline at end of file
+.pre-commit-config.yaml
+dumps
diff --git a/compute_hash.py b/compute_hash.py
index e5e723f..d716d2e 100644
--- a/compute_hash.py
+++ b/compute_hash.py
@@ -3,6 +3,15 @@ import os
 import argparse
 import struct
 import sys
+from enum import Enum
+
+class FileType(Enum):
+    NON_MAC = "non_mac"
+    MAC_BINARY = "macbinary"
+    APPLE_DOUBLE_RSRC = "apple_double_rsrc"
+    APPLE_DOUBLE_MACOSX = "apple_double_macosx"
+    APPLE_DOUBLE_DOT_ = "apple_double_dot_"
+    RAW_RSRC = "raw_rsrc"
 
 script_version = "0.1"
 
@@ -49,12 +58,10 @@ def crc16xmodem(data, crc=0):
             (crc >> 8) & 0xff) ^ byte]
     return crc & 0xffff
 
-
 def filesize(filepath):
     """ Returns size of file """
     return os.stat(filepath).st_size
 
-
 def get_dirs_at_depth(directory, depth):
     directory = directory.rstrip(os.path.sep)
     assert os.path.isdir(directory)
@@ -65,7 +72,6 @@ def get_dirs_at_depth(directory, depth):
         if depth == num_sep_this - num_sep:
             yield root
 
-
 def escape_string(s: str) -> str:
     """
     Escape strings
@@ -85,7 +91,6 @@ def escape_string(s: str) -> str:
             new_name += char
     return new_name
 
-
 def needs_punyencoding(orig: str) -> bool:
     """
     A filename needs to be punyencoded when it:
@@ -99,7 +104,6 @@ def needs_punyencoding(orig: str) -> bool:
         return True
     return False
 
-
 def punyencode(orig: str) -> str:
     """
     Punyencode strings
@@ -118,7 +122,6 @@ def punyencode(orig: str) -> str:
         return "xn--" + encoded
     return orig
 
-
 def punyencode_filepath(filepath):
     filepath = filepath.rstrip("/")
     path_components = filepath.split(os.path.sep)
@@ -127,18 +130,49 @@ def punyencode_filepath(filepath):
 
     return os.path.join(*path_components)
 
-
 def read_be_32(byte_stream):
     """ Return unsigned integer of size_in_bits, assuming the data is big-endian """
     (uint,) = struct.unpack(">I", byte_stream[:32//8])
     return uint
 
-
 def read_be_16(byte_stream):
     """ Return unsigned integer of size_in_bits, assuming the data is big-endian """
     (uint,) = struct.unpack(">H", byte_stream[:16//8])
     return uint
 
+def is_raw_rsrc(filepath):
+    """ Returns boolean, checking whether the given .rsrc file is a raw .rsrc file and not appledouble."""
+    filename = os.path.basename(filepath)
+    if filename.endswith(".rsrc"):
+        with open(filepath, "rb") as f:
+            return not is_appledouble(f.read())
+    return False
+
+def is_appledouble_rsrc(filepath):
+    """ Returns boolean, checking whether the given .rsrc file is an appledouble or not."""
+    filename = os.path.basename(filepath)
+    if filename.endswith(".rsrc"):
+        with open(filepath, "rb") as f:
+            return is_appledouble(f.read())
+    return False
+
+def is_appledouble_in_dot_(filepath):
+    """ Returns boolean, checking whether the given ._ file is an appledouble or not. It also checks that the parent directory is not __MACOSX as that case is handled differently """
+    filename = os.path.basename(filepath)
+    parent_dir = os.path.basename(os.path.dirname(filepath))
+    if filename.startswith("._") and parent_dir != "__MACOSX":
+        with open(filepath, "rb") as f:
+            return is_appledouble(f.read())
+    return False
+    
+def is_appledouble_in_macosx(filepath):
+    """ Returns boolean, checking whether the given ._ file in __MACOSX folder is an appledouble or not. """
+    filename = os.path.basename(filepath)
+    parent_dir = os.path.basename(os.path.dirname(filepath))
+    if filename.startswith("._") and parent_dir == "__MACOSX":
+        with open(filepath, "rb") as f:
+            return is_appledouble(f.read())
+    return False
 
 def is_macbin(filepath):
     with open(filepath, "rb") as file:
@@ -173,8 +207,9 @@ def is_macbin(filepath):
 
             return True
 
+def macbin_get_resfork_data(file_byte_stream):
+    """ Returns the byte stream of the data section of the resource fork of a macbinary file as well as its size """
 
-def macbin_get_resfork(file_byte_stream):
     if not file_byte_stream:
         return file_byte_stream
 
@@ -182,8 +217,13 @@ def macbin_get_resfork(file_byte_stream):
     datalen_padded = ((datalen + 127) >> 7) << 7
     (rsrclen,) = struct.unpack(">I", file_byte_stream[0x57:0x5B])
 
-    return file_byte_stream[0x80 + datalen_padded: 0x80 + datalen_padded + rsrclen]
+    resoure_fork_offset = 128 + datalen_padded
+    data_offset = int.from_bytes(file_byte_stream[resoure_fork_offset+0 : resoure_fork_offset+4])
+    # map_offset = int.from_bytes(file_byte_stream[resoure_fork_offset+4 : resoure_fork_offset+8])
+    data_length = int.from_bytes(file_byte_stream[resoure_fork_offset+8 : resoure_fork_offset+12])
+    # map_length = int.from_bytes(file_byte_stream[resoure_fork_offset+12 : resoure_fork_offset+16])
 
+    return (file_byte_stream[resoure_fork_offset + data_offset: resoure_fork_offset + data_offset + data_length], data_length)
 
 def macbin_get_datafork(file_byte_stream):
     if not file_byte_stream:
@@ -192,29 +232,45 @@ def macbin_get_datafork(file_byte_stream):
     (datalen,) = struct.unpack(">I", file_byte_stream[0x53:0x57])
     return file_byte_stream[0x80: 0x80 + datalen]
 
-
 def is_appledouble(file_byte_stream):
-    # Check AppleDouble magic number
+    """
+    Appledouble Structure -
+
+    Header:
+    +$00 / 4: signature (0x00 0x05 0x16 0x00)
+    +$04 / 4: version (0x00 0x01 0x00 0x00 (v1) -or- 0x00 0x02 0x00 0x00 (v2))
+    +$08 /16: home file system string (v1) -or- zeroes (v2)
+    +$18 / 2: number of entries
+
+    Entries:
+    +$00 / 4: entry ID (1-15)
+    +$04 / 4: offset to data from start of file
+    +$08 / 4: length of entry in bytes; may be zero
+    """
     if (not file_byte_stream or read_be_32(file_byte_stream) != 0x00051607):
         return False
 
     return True
 
-
-def appledouble_get_resfork(file_byte_stream):
-    entry_count = read_be_32(file_byte_stream[24:])
-    for _ in range(entry_count):
-        id = read_be_32(file_byte_stream[28:])
-        offset = read_be_32(file_byte_stream[32:])
-        length = read_be_32(file_byte_stream[36:])
+def appledouble_get_resfork_data(file_byte_stream):
+    """ Returns the byte stream of the data section of the resource fork of an appledouble file as well as its size """
+    
+    entry_count = read_be_16(file_byte_stream[24:])
+    for entry in range(entry_count):
+        start_index = 26 + entry*12
+        id = read_be_32(file_byte_stream[start_index:])
+        offset = read_be_32(file_byte_stream[start_index+4:])
+        length = read_be_32(file_byte_stream[start_index+8:])
 
         if id == 2:
-            return file_byte_stream[offset:offset+length]
-
-    return b''
+            resource_fork_stream = file_byte_stream[offset:offset+length]
+            data_offset = int.from_bytes(resource_fork_stream[0:4])
+            data_length = int.from_bytes(resource_fork_stream[8:12])
 
+            return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
 
-def appledouble_get_datafork(filepath):
+def appledouble_get_datafork(filepath, fileinfo):
+    """ Returns data fork byte stream of appledouble file if found, otherwise empty byte string """
     try:
         index = filepath.index("__MACOSX")
     except ValueError:
@@ -223,48 +279,56 @@ def appledouble_get_datafork(filepath):
     if index is not None:
         # Remove '__MACOSX/' from filepath
         filepath = filepath[:index] + filepath[index+8+1:]
+    parent_filepath = os.path.dirname(filepath)
+    data_fork_path = os.path.join(parent_filepath, fileinfo[1])
 
-        # Remove '._' from filepath
-        filename = os.path.basename(filepath)
-        filepath = filepath[:-len(filename)] + (filename[2:]
-                                                if filename.startswith('._') else filename)
-
-        return filepath
-        # return open(filepath, "rb")
-
-    return b''
-
-
-def rsrc_get_datafork(filepath):
-    # Data fork is the same filename without the .rsrc extension
-    return open(filepath[:-5], "rb")
-
-
-def file_checksum(filepath, alg, size):
-    filename = os.path.basename(filepath)
-
-    # If it is Apple file with 2 forks
-    if (filepath.endswith('.rsrc') or is_macbin(filepath) or
-            filename.startswith('._') or filename.startswith('__MACOS')):
-        res = []
-        resfork = b''
-        datafork = b''
-
-        with open(filepath, "rb") as f:
-            if filepath.endswith('.rsrc'):
-                resfork = f.read()
-                datafork = rsrc_get_datafork(filepath)
-
-            if is_appledouble(f.read()):
+    try:
+        with open(data_fork_path, "rb") as f:
+            return f.read()
+    except (FileNotFoundError, IsADirectoryError):
+        return b''
+
+def raw_rsrc_get_datafork(filepath):
+    """ Returns data fork byte stream corresponding to raw rsrc file. """
+    with open(filepath[:-5]+".data", "rb") as f:
+        return f.read()
+
+def raw_rsrc_get_resource_fork_data(filepath):
+    """ Returns the byte stream of the data section of the resource fork of a raw rsrc file as well as its size """
+    with open(filepath, "rb") as f:
+        resource_fork_stream = f.read()
+        data_offset = int.from_bytes(resource_fork_stream[0:4])
+        data_length = int.from_bytes(resource_fork_stream[8:12])
+
+        return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
+
+def file_checksum(filepath, alg, size, file_info):
+    size = 0
+    with open(filepath, "rb") as f:
+        if file_info[0] == FileType.NON_MAC:
+            return (create_checksum_pairs(checksum(f, alg, size, filepath), alg, size), filesize(filepath))
+        
+        # Processing mac files
+        else:
+            res = []
+            resfork = b''
+            datafork = b''
+            if file_info[0] == FileType.MAC_BINARY:
+                f.seek(0)
+                (resfork, size) = macbin_get_resfork_data(f.read())
+                f.seek(0)
+                datafork = macbin_get_datafork(f.read())
+            elif file_info[0] == FileType.APPLE_DOUBLE_DOT_ or file_info[0] == FileType.APPLE_DOUBLE_RSRC or file_info[0] == FileType.APPLE_DOUBLE_MACOSX:
+                f.seek(0)
+                (resfork, size) = appledouble_get_resfork_data(f.read())
                 f.seek(0)
-                resfork = appledouble_get_resfork(f.read())
-                datafork = appledouble_get_datafork(filepath)
+                datafork = appledouble_get_datafork(filepath, file_info)
 
-            if is_macbin(filepath):
+            elif file_info[0] == FileType.RAW_RSRC:
                 f.seek(0)
-                resfork = macbin_get_resfork(f.read())
+                (resfork, size) = raw_rsrc_get_resource_fork_data(filepath)
                 f.seek(0)
-                datafork = macbin_get_datafork(f.read())
+                datafork = raw_rsrc_get_datafork(filepath)
 
             combined_forks = datafork + resfork
 
@@ -281,12 +345,7 @@ def file_checksum(filepath, alg, size):
             prefix = 'm'
             res.extend(create_checksum_pairs(hashes, alg, size, prefix))
 
-        return res
-
-    # If it is a normal file
-    with open(filepath, "rb") as file:
-        return create_checksum_pairs(checksum(file, alg, size, filepath), alg, size)
-
+            return (res, size)
 
 def create_checksum_pairs(hashes, alg, size, prefix=None):
     res = []
@@ -312,7 +371,6 @@ def create_checksum_pairs(hashes, alg, size, prefix=None):
 
     return res
 
-
 def checksum(file, alg, size, filepath):
     """ Returns checksum value of file buffer using a specific algoritm """
     # Will contain 5 elements:
@@ -365,7 +423,7 @@ def checksum(file, alg, size, filepath):
         hashes[0].update(bytes_stream)
         hashes[1].update(bytes_stream[:5000])
         hashes[2].update(bytes_stream[:1024 * 1024])
-        if filesize(filepath) >= 5000:
+        if len(bytes_stream) >= 5000:
             hashes[3].update(bytes_stream[-5000:])
         else:
             hashes[3] = hashes[0]
@@ -379,6 +437,94 @@ def checksum(file, alg, size, filepath):
     hashes = [h.hexdigest() for h in hashes if h]
     return hashes
 
+def extract_macbin_filename_from_header(file):
+    """ Extracts the filename from the header of the macbinary. """
+    with open(file, "rb") as f:
+        header = f.read(128)
+        name_len = header[1]
+        filename_bytes = header[2:2+name_len]
+        return filename_bytes.decode("utf-8")
+
+def file_classification(filepath):
+    """ Returns [ Filetype, Filename ]. Filetype is an enum value - NON_MAC, MAC_BINARY, APPLE_DOUBLE, MAC_RSRC
+        Filename for a normal file is the same as the original. Extensions are dropped for macfiles. """
+
+    # 1. Macbinary
+    if is_macbin(filepath):
+        return [FileType.MAC_BINARY, extract_macbin_filename_from_header(filepath)] 
+    
+    # 2. Appledouble .rsrc
+    if is_appledouble_rsrc(filepath):
+        base_name, _ = os.path.splitext(os.path.basename(filepath))
+        return [FileType.APPLE_DOUBLE_RSRC, base_name]
+    
+    # 3. Raw .rsrc
+    if is_raw_rsrc(filepath):
+        base_name, _ = os.path.splitext(os.path.basename(filepath))
+        return [FileType.RAW_RSRC, base_name]
+
+    # 4. Appledouble in ._
+    if is_appledouble_in_dot_(filepath):
+        filename = os.path.basename(filepath)
+        actual_filename = filename[2:]
+        return [FileType.APPLE_DOUBLE_DOT_, actual_filename]
+
+    # 5. Appledouble in __MACOSX folder
+    if is_appledouble_in_macosx(filepath):
+        filename = os.path.basename(filepath)
+        actual_filename = filename[2:]
+        return [FileType.APPLE_DOUBLE_MACOSX, actual_filename]
+    
+    # Normal file
+    else:
+        return [FileType.NON_MAC, os.path.basename(filepath)]
+
+def file_filter(files):
+    """ Removes extra macfiles from the given dictionary of files that are not needed for fork calculation.
+        This avoids extra checksum calculation of these mac files in form of non-mac files """
+    
+    to_be_deleted = []
+    
+    for filepath, file_info in files.items():
+        # For filename.rsrc (apple double rsrc), corresponding filename file (data fork) will be removed from the files dictionary 
+        if (file_info[0] == FileType.APPLE_DOUBLE_RSRC):
+            parent_dir_path = os.path.dirname(filepath)
+            expected_data_fork_path = os.path.join(parent_dir_path, file_info[1])
+            if (expected_data_fork_path in files):
+                to_be_deleted.append(expected_data_fork_path)
+            # else:
+            #     print(f"Resource-fork-only appledouble file: {filepath}")
+
+        # For ._filename, corresponding filename file (data fork) will be removed from the files dictionary 
+        elif (file_info[0] == FileType.APPLE_DOUBLE_DOT_):
+            parent_dir_path = os.path.dirname(filepath)
+            expected_data_fork_path = os.path.join(parent_dir_path, file_info[1])
+            if (expected_data_fork_path in files):
+               to_be_deleted.append(expected_data_fork_path)
+            # else:
+            #     print(f"Resource-fork-only appledouble file: {filepath}")
+
+        # For ._filename, corresponding ../filename file (data fork) will be removed from the files dictionary 
+        elif (file_info[0] == FileType.APPLE_DOUBLE_MACOSX):
+            grand_parent_dir_path = os.path.dirname(os.path.dirname(filepath))
+            expected_data_fork_path = os.path.join(grand_parent_dir_path, file_info[1])
+            if (expected_data_fork_path in files):
+                to_be_deleted.append(expected_data_fork_path)
+            # else:
+            #     print(f"Resource-fork-only appledouble file: {filepath}")
+
+        # For filename.rsrc (raw rsrc), corresponding filename.data file (data fork) and filename.finf file (finder info) will be removed from the files dictionary
+        elif (file_info[0] == FileType.RAW_RSRC):
+            parent_dir_path = os.path.dirname(filepath)
+            expected_data_fork_path = os.path.join(parent_dir_path, file_info[1]) + ".data"
+            expected_finf_path = os.path.join(parent_dir_path, file_info[1]) + ".finf"
+            if (expected_data_fork_path in files):
+                to_be_deleted.append(expected_data_fork_path)
+            if (expected_finf_path in files):
+                to_be_deleted.append(expected_finf_path)
+
+    for file in to_be_deleted:
+        del files[file]
 
 def compute_hash_of_dirs(root_directory, depth, size=0, alg="md5"):
     """ Return dictionary containing checksums of all files in directory """
@@ -387,19 +533,38 @@ def compute_hash_of_dirs(root_directory, depth, size=0, alg="md5"):
     for directory in get_dirs_at_depth(root_directory, depth):
         hash_of_dir = dict()
         files = []
-
+        # Dictionary with key : path and value : [ Filetype, Filename ]
+        file_collection = dict()
         # Getting only files of directory and subdirectories recursively
-        for root, dirs, contents in os.walk(directory):
+        for root, _, contents in os.walk(directory):
             files.extend([os.path.join(root, f) for f in contents])
 
-        for file in files:
-            hash_of_dir[os.path.relpath(file, directory)] = (file_checksum(
-                file, alg, size), filesize(file))
+        # Produce filetype and filename(name to be used in game entry) for each file
+        for filepath in files:
+            file_collection[filepath] = file_classification(filepath)
 
-        res.append(hash_of_dir)
+        # Remove extra entries of macfiles to avoid extra checksum calculation in form of non mac files
+        # Checksum for both the forks are calculated using a single file, so other files should be removed from the collection
+        # print(file_collection)
+        file_filter(file_collection)
+        # print(file_collection)
 
-    return res
+        # Calculate checksum of files
+        for file_path, file_info in file_collection.items():
+            # relative_path is used for the name field in game entry
+            relative_path = os.path.relpath(file_path, directory)
+            base_name = file_info[1]
+            relative_dir = os.path.dirname(relative_path)
+            relative_path = os.path.join(relative_dir, base_name)
+
+            if (file_info[0] == FileType.APPLE_DOUBLE_MACOSX):
+                relative_dir = os.path.dirname(os.path.dirname(relative_path))
+                relative_path = os.path.join(relative_dir, base_name) 
 
+            hash_of_dir[relative_path] = file_checksum(file_path, alg, size, file_info)
+
+        res.append(hash_of_dir)
+    return res
 
 def create_dat_file(hash_of_dirs, path, checksum_size=0):
     with open(f"{os.path.basename(path)}.dat", "w") as file:
@@ -415,8 +580,7 @@ def create_dat_file(hash_of_dirs, path, checksum_size=0):
         for hash_of_dir in hash_of_dirs:
             file.write("game (\n")
             for filename, (hashes, filesize) in hash_of_dir.items():
-                filename = (punyencode_filepath(filename)
-                            if needs_punyencoding(filename) else filename)
+                filename = (punyencode_filepath(filename) if needs_punyencoding(filename) else filename)
                 data = f"name \"{filename}\" size {filesize}"
                 for key, value in hashes:
                     data += f" {key} {value}"


Commit: 4b03ab944576b87643e4aa6aebb9ad33c8c87a26
    https://github.com/scummvm/scummvm-sites/commit/4b03ab944576b87643e4aa6aebb9ad33c8c87a26
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add support for 7th mac variant - actual resource fork

Changed paths:
    compute_hash.py


diff --git a/compute_hash.py b/compute_hash.py
index d716d2e..848a45a 100644
--- a/compute_hash.py
+++ b/compute_hash.py
@@ -12,6 +12,7 @@ class FileType(Enum):
     APPLE_DOUBLE_MACOSX = "apple_double_macosx"
     APPLE_DOUBLE_DOT_ = "apple_double_dot_"
     RAW_RSRC = "raw_rsrc"
+    ACTUAL_FORK_MAC = "actual_fork_mac"
 
 script_version = "0.1"
 
@@ -141,7 +142,7 @@ def read_be_16(byte_stream):
     return uint
 
 def is_raw_rsrc(filepath):
-    """ Returns boolean, checking whether the given .rsrc file is a raw .rsrc file and not appledouble."""
+    """ Returns boolean, checking if the given .rsrc file is a raw .rsrc file and not appledouble."""
     filename = os.path.basename(filepath)
     if filename.endswith(".rsrc"):
         with open(filepath, "rb") as f:
@@ -207,8 +208,35 @@ def is_macbin(filepath):
 
             return True
 
+def is_actual_resource_fork_mac(filepath):
+    """ Returns boolean, checking the actual mac fork if it exists. """
+
+    resource_fork_path = os.path.join(filepath, "..namedfork", "rsrc")
+    print(resource_fork_path)
+    return os.path.exists(resource_fork_path)
+
+def is_appledouble(file_byte_stream):
+    """
+    Appledouble Structure -
+
+    Header:
+    +$00 / 4: signature (0x00 0x05 0x16 0x00)
+    +$04 / 4: version (0x00 0x01 0x00 0x00 (v1) -or- 0x00 0x02 0x00 0x00 (v2))
+    +$08 /16: home file system string (v1) -or- zeroes (v2)
+    +$18 / 2: number of entries
+
+    Entries:
+    +$00 / 4: entry ID (1-15)
+    +$04 / 4: offset to data from start of file
+    +$08 / 4: length of entry in bytes; may be zero
+    """
+    if (not file_byte_stream or read_be_32(file_byte_stream) != 0x00051607):
+        return False
+
+    return True
+
 def macbin_get_resfork_data(file_byte_stream):
-    """ Returns the byte stream of the data section of the resource fork of a macbinary file as well as its size """
+    """ Returns the resource fork's data section as bytes of a macbinary file as well as its size """
 
     if not file_byte_stream:
         return file_byte_stream
@@ -219,9 +247,7 @@ def macbin_get_resfork_data(file_byte_stream):
 
     resoure_fork_offset = 128 + datalen_padded
     data_offset = int.from_bytes(file_byte_stream[resoure_fork_offset+0 : resoure_fork_offset+4])
-    # map_offset = int.from_bytes(file_byte_stream[resoure_fork_offset+4 : resoure_fork_offset+8])
     data_length = int.from_bytes(file_byte_stream[resoure_fork_offset+8 : resoure_fork_offset+12])
-    # map_length = int.from_bytes(file_byte_stream[resoure_fork_offset+12 : resoure_fork_offset+16])
 
     return (file_byte_stream[resoure_fork_offset + data_offset: resoure_fork_offset + data_offset + data_length], data_length)
 
@@ -253,7 +279,7 @@ def is_appledouble(file_byte_stream):
     return True
 
 def appledouble_get_resfork_data(file_byte_stream):
-    """ Returns the byte stream of the data section of the resource fork of an appledouble file as well as its size """
+    """ Returns the resource fork's data section as bytes of an appledouble file as well as its size """
     
     entry_count = read_be_16(file_byte_stream[24:])
     for entry in range(entry_count):
@@ -270,7 +296,7 @@ def appledouble_get_resfork_data(file_byte_stream):
             return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
 
 def appledouble_get_datafork(filepath, fileinfo):
-    """ Returns data fork byte stream of appledouble file if found, otherwise empty byte string """
+    """ Returns data fork's content as bytes of appledouble file if found, otherwise empty byte string """
     try:
         index = filepath.index("__MACOSX")
     except ValueError:
@@ -289,13 +315,28 @@ def appledouble_get_datafork(filepath, fileinfo):
         return b''
 
 def raw_rsrc_get_datafork(filepath):
-    """ Returns data fork byte stream corresponding to raw rsrc file. """
+    """ Returns the data fork's content as bytes corresponding to raw rsrc file. """
     with open(filepath[:-5]+".data", "rb") as f:
         return f.read()
 
 def raw_rsrc_get_resource_fork_data(filepath):
-    """ Returns the byte stream of the data section of the resource fork of a raw rsrc file as well as its size """
+    """ Returns the resource fork's data section as bytes of a raw rsrc file as well as its size """
+    with open(filepath, "rb") as f:
+        resource_fork_stream = f.read()
+        data_offset = int.from_bytes(resource_fork_stream[0:4])
+        data_length = int.from_bytes(resource_fork_stream[8:12])
+
+        return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
+    
+def actual_mac_fork_get_data_fork(filepath):
+    """ Returns the data fork's content as bytes if the actual mac fork exists """
     with open(filepath, "rb") as f:
+        return f.read()
+
+def actual_mac_fork_get_resource_fork_data(filepath):
+    """ Returns the resource fork's data section as bytes of the actual mac fork as well as its size """
+    resource_fork_path = os.path.join(filepath, "..namedfork", "rsrc")
+    with open(resource_fork_path, "rb") as f:
         resource_fork_stream = f.read()
         data_offset = int.from_bytes(resource_fork_stream[0:4])
         data_length = int.from_bytes(resource_fork_stream[8:12])
@@ -303,49 +344,46 @@ def raw_rsrc_get_resource_fork_data(filepath):
         return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
 
 def file_checksum(filepath, alg, size, file_info):
-    size = 0
+    cur_file_size = 0
     with open(filepath, "rb") as f:
         if file_info[0] == FileType.NON_MAC:
             return (create_checksum_pairs(checksum(f, alg, size, filepath), alg, size), filesize(filepath))
         
         # Processing mac files
-        else:
-            res = []
-            resfork = b''
-            datafork = b''
-            if file_info[0] == FileType.MAC_BINARY:
-                f.seek(0)
-                (resfork, size) = macbin_get_resfork_data(f.read())
-                f.seek(0)
-                datafork = macbin_get_datafork(f.read())
-            elif file_info[0] == FileType.APPLE_DOUBLE_DOT_ or file_info[0] == FileType.APPLE_DOUBLE_RSRC or file_info[0] == FileType.APPLE_DOUBLE_MACOSX:
-                f.seek(0)
-                (resfork, size) = appledouble_get_resfork_data(f.read())
-                f.seek(0)
-                datafork = appledouble_get_datafork(filepath, file_info)
-
-            elif file_info[0] == FileType.RAW_RSRC:
-                f.seek(0)
-                (resfork, size) = raw_rsrc_get_resource_fork_data(filepath)
-                f.seek(0)
-                datafork = raw_rsrc_get_datafork(filepath)
-
-            combined_forks = datafork + resfork
-
-            hashes = checksum(resfork, alg, size, filepath)
-            prefix = 'r'
-            if len(resfork):
-                res.extend(create_checksum_pairs(hashes, alg, size, prefix))
-
-            hashes = checksum(datafork, alg, size, filepath)
-            prefix = 'd'
+        res = []
+        resfork = b''
+        datafork = b''
+        file_data = f.read()
+
+        if file_info[0] == FileType.MAC_BINARY:
+            (resfork, cur_file_size) = macbin_get_resfork_data(file_data)
+            datafork = macbin_get_datafork(file_data)
+        elif file_info[0] in {FileType.APPLE_DOUBLE_DOT_, FileType.APPLE_DOUBLE_RSRC, FileType.APPLE_DOUBLE_MACOSX}:
+            (resfork, cur_file_size) = appledouble_get_resfork_data(file_data)
+            datafork = appledouble_get_datafork(filepath, file_info)
+        elif file_info[0] == FileType.RAW_RSRC:
+            (resfork, cur_file_size) = raw_rsrc_get_resource_fork_data(filepath)
+            datafork = raw_rsrc_get_datafork(filepath)
+        elif file_info[0] == FileType.ACTUAL_FORK_MAC:
+            (resfork, cur_file_size) = actual_mac_fork_get_resource_fork_data(filepath)
+            datafork = actual_mac_fork_get_data_fork(filepath)
+
+        combined_forks = datafork + resfork
+
+        hashes = checksum(resfork, alg, size, filepath)
+        prefix = 'r'
+        if len(resfork):
             res.extend(create_checksum_pairs(hashes, alg, size, prefix))
 
-            hashes = checksum(combined_forks, alg, size, filepath)
-            prefix = 'm'
-            res.extend(create_checksum_pairs(hashes, alg, size, prefix))
+        hashes = checksum(datafork, alg, size, filepath)
+        prefix = 'd'
+        res.extend(create_checksum_pairs(hashes, alg, size, prefix))
+
+        hashes = checksum(combined_forks, alg, size, filepath)
+        prefix = 'm'
+        res.extend(create_checksum_pairs(hashes, alg, size, prefix))
 
-            return (res, size)
+        return (res, cur_file_size)
 
 def create_checksum_pairs(hashes, alg, size, prefix=None):
     res = []
@@ -446,7 +484,7 @@ def extract_macbin_filename_from_header(file):
         return filename_bytes.decode("utf-8")
 
 def file_classification(filepath):
-    """ Returns [ Filetype, Filename ]. Filetype is an enum value - NON_MAC, MAC_BINARY, APPLE_DOUBLE, MAC_RSRC
+    """ Returns [ Filetype, Filename ]. Filetype is an enum value - NON_MAC, MAC_BINARY, APPLE_DOUBLE_RSRC, APPLE_DOUBLE_MACOSX, APPLE_DOUBLE_DOT_, RAW_RSRC
         Filename for a normal file is the same as the original. Extensions are dropped for macfiles. """
 
     # 1. Macbinary
@@ -475,6 +513,11 @@ def file_classification(filepath):
         actual_filename = filename[2:]
         return [FileType.APPLE_DOUBLE_MACOSX, actual_filename]
     
+    # 6. Actual resource fork of mac
+    if is_actual_resource_fork_mac(filepath):
+        filename = os.path.basename(filepath)
+        return [FileType.ACTUAL_FORK_MAC, filename]
+    
     # Normal file
     else:
         return [FileType.NON_MAC, os.path.basename(filepath)]
@@ -492,8 +535,6 @@ def file_filter(files):
             expected_data_fork_path = os.path.join(parent_dir_path, file_info[1])
             if (expected_data_fork_path in files):
                 to_be_deleted.append(expected_data_fork_path)
-            # else:
-            #     print(f"Resource-fork-only appledouble file: {filepath}")
 
         # For ._filename, corresponding filename file (data fork) will be removed from the files dictionary 
         elif (file_info[0] == FileType.APPLE_DOUBLE_DOT_):
@@ -501,8 +542,6 @@ def file_filter(files):
             expected_data_fork_path = os.path.join(parent_dir_path, file_info[1])
             if (expected_data_fork_path in files):
                to_be_deleted.append(expected_data_fork_path)
-            # else:
-            #     print(f"Resource-fork-only appledouble file: {filepath}")
 
         # For ._filename, corresponding ../filename file (data fork) will be removed from the files dictionary 
         elif (file_info[0] == FileType.APPLE_DOUBLE_MACOSX):
@@ -510,8 +549,6 @@ def file_filter(files):
             expected_data_fork_path = os.path.join(grand_parent_dir_path, file_info[1])
             if (expected_data_fork_path in files):
                 to_be_deleted.append(expected_data_fork_path)
-            # else:
-            #     print(f"Resource-fork-only appledouble file: {filepath}")
 
         # For filename.rsrc (raw rsrc), corresponding filename.data file (data fork) and filename.finf file (finder info) will be removed from the files dictionary
         elif (file_info[0] == FileType.RAW_RSRC):
@@ -545,9 +582,7 @@ def compute_hash_of_dirs(root_directory, depth, size=0, alg="md5"):
 
         # Remove extra entries of macfiles to avoid extra checksum calculation in form of non mac files
         # Checksum for both the forks are calculated using a single file, so other files should be removed from the collection
-        # print(file_collection)
         file_filter(file_collection)
-        # print(file_collection)
 
         # Calculate checksum of files
         for file_path, file_info in file_collection.items():


Commit: 6770ee4f4557007f1d1b4098c37c5790ed7a51e5
    https://github.com/scummvm/scummvm-sites/commit/6770ee4f4557007f1d1b4098c37c5790ed7a51e5
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Additional error handling for raw rsrc and actual mac fork

Changed paths:
    compute_hash.py


diff --git a/compute_hash.py b/compute_hash.py
index 848a45a..0ed2b2d 100644
--- a/compute_hash.py
+++ b/compute_hash.py
@@ -316,8 +316,11 @@ def appledouble_get_datafork(filepath, fileinfo):
 
 def raw_rsrc_get_datafork(filepath):
     """ Returns the data fork's content as bytes corresponding to raw rsrc file. """
-    with open(filepath[:-5]+".data", "rb") as f:
-        return f.read()
+    try:
+        with open(filepath[:-5]+".data", "rb") as f:
+            return f.read()
+    except (FileNotFoundError, IsADirectoryError):
+        return b''
 
 def raw_rsrc_get_resource_fork_data(filepath):
     """ Returns the resource fork's data section as bytes of a raw rsrc file as well as its size """
@@ -327,11 +330,14 @@ def raw_rsrc_get_resource_fork_data(filepath):
         data_length = int.from_bytes(resource_fork_stream[8:12])
 
         return (resource_fork_stream[data_offset: data_offset+data_length], data_length)
-    
+
 def actual_mac_fork_get_data_fork(filepath):
     """ Returns the data fork's content as bytes if the actual mac fork exists """
-    with open(filepath, "rb") as f:
-        return f.read()
+    try:
+        with open(filepath, "rb") as f:
+            return f.read()
+    except (FileNotFoundError, IsADirectoryError):
+        return b''
 
 def actual_mac_fork_get_resource_fork_data(filepath):
     """ Returns the resource fork's data section as bytes of the actual mac fork as well as its size """


Commit: 9bec3ae8f5e9e315658039b6fadc0468b066e356
    https://github.com/scummvm/scummvm-sites/commit/9bec3ae8f5e9e315658039b6fadc0468b066e356
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Remove punycode encoding from compute_hash.py

Punycode encoding is already handled by dat parser, i.e during uploading of dat files to the database.

Changed paths:
    compute_hash.py


diff --git a/compute_hash.py b/compute_hash.py
index 0ed2b2d..f74f67c 100644
--- a/compute_hash.py
+++ b/compute_hash.py
@@ -73,64 +73,6 @@ def get_dirs_at_depth(directory, depth):
         if depth == num_sep_this - num_sep:
             yield root
 
-def escape_string(s: str) -> str:
-    """
-    Escape strings
-
-    Escape the following:
-    - escape char: \x81
-    - unallowed filename chars: https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
-    - control chars < 0x20
-    """
-    new_name = ""
-    for char in s:
-        if char == "\x81":
-            new_name += "\x81\x79"
-        elif char in '/":*|\\?%<>\x7f' or ord(char) < 0x20 or (ord(char) & 0x80):
-            new_name += "\x81" + chr(0x80 + ord(char))
-        else:
-            new_name += char
-    return new_name
-
-def needs_punyencoding(orig: str) -> bool:
-    """
-    A filename needs to be punyencoded when it:
-
-    - contains a char that should be escaped or
-    - ends with a dot or a space.
-    """
-    if orig != escape_string(orig):
-        return True
-    if orig[-1] in " .":
-        return True
-    return False
-
-def punyencode(orig: str) -> str:
-    """
-    Punyencode strings
-
-    - escape special characters and
-    - ensure filenames can't end in a space or dot
-    """
-    s = escape_string(orig)
-    encoded = s.encode("punycode").decode("ascii")
-    # punyencoding adds an '-' at the end when there are no special chars
-    # don't use it for comparing
-    compare = encoded
-    if encoded.endswith("-"):
-        compare = encoded[:-1]
-    if orig != compare or compare[-1] in " .":
-        return "xn--" + encoded
-    return orig
-
-def punyencode_filepath(filepath):
-    filepath = filepath.rstrip("/")
-    path_components = filepath.split(os.path.sep)
-    for i, component in enumerate(path_components):
-        path_components[i] = punyencode(component)
-
-    return os.path.join(*path_components)
-
 def read_be_32(byte_stream):
     """ Return unsigned integer of size_in_bits, assuming the data is big-endian """
     (uint,) = struct.unpack(">I", byte_stream[:32//8])
@@ -621,7 +563,6 @@ def create_dat_file(hash_of_dirs, path, checksum_size=0):
         for hash_of_dir in hash_of_dirs:
             file.write("game (\n")
             for filename, (hashes, filesize) in hash_of_dir.items():
-                filename = (punyencode_filepath(filename) if needs_punyencoding(filename) else filename)
                 data = f"name \"{filename}\" size {filesize}"
                 for key, value in hashes:
                     data += f" {key} {value}"


Commit: 0253dcdbab0626d056f27e19d2aff344ded6378a
    https://github.com/scummvm/scummvm-sites/commit/0253dcdbab0626d056f27e19d2aff344ded6378a
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Update requirements.txt

Changed paths:
    requirements.txt


diff --git a/requirements.txt b/requirements.txt
index 41cebfb..8486da7 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,10 +1,18 @@
 blinker==1.9.0
+cffi==1.17.1
 click==8.1.8
+cryptography==45.0.3
 Flask==3.1.0
+iniconfig==2.1.0
 itsdangerous==2.2.0
 Jinja2==3.1.5
 MarkupSafe==3.0.2
+packaging==25.0
+pluggy==1.6.0
+pycparser==2.22
+Pygments==2.19.1
 PyMySQL==1.1.1
+pytest==8.4.0
 setuptools==75.8.0
 Werkzeug==3.1.3
 wheel==0.45.1


Commit: 31cc77e765eebdecad65099196da2bfbd701cad2
    https://github.com/scummvm/scummvm-sites/commit/31cc77e765eebdecad65099196da2bfbd701cad2
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Fix punycode

* Match punycode with dumper companion
* Escape single quote in encoded punycode for sql query
* Add unit tests for punycode_need_encode and encode_punycode

Changed paths:
  A tests/test_punycode.py
    .gitignore
    db_functions.py


diff --git a/.gitignore b/.gitignore
index fa6c9e5..a06a587 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,6 @@ __pycache__
 .DS_Store
 .pre-commit-config.yaml
 dumps
+.pytest_cache
+mac_dats
+macresfork 2
\ No newline at end of file
diff --git a/db_functions.py b/db_functions.py
index 54606a2..93c0ce0 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -188,11 +188,9 @@ def insert_file(file, detection, src, conn):
     if not detection:
         checktype = "None"
         detection = 0
-    detection_type = (
-        f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
-    )
-    if punycode_need_encode(file["name"]):
-        query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{encode_punycode(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
+    detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+    if punycode_need_encode(file['name']):
+        query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(encode_punycode(file['name']))}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
     else:
         query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
     with conn.cursor() as cursor:
@@ -238,12 +236,11 @@ def my_escape_string(s: str) -> str:
     for char in s:
         if char == "\x81":
             new_name += "\x81\x79"
-        elif char in '/":*|\\?%<>\x7f' or ord(char) < 0x20 or (ord(char) & 0x80):
+        elif char in SPECIAL_SYMBOLS or ord(char) < 0x20:
             new_name += "\x81" + chr(0x80 + ord(char))
         else:
             new_name += char
-    return escape_string(new_name)
-
+    return new_name
 
 def encode_punycode(orig):
     """
@@ -995,12 +992,9 @@ def populate_file(fileset, fileset_id, conn, detection):
             if not detection:
                 checktype = "None"
                 detection = 0
-            detection_type = (
-                f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
-            )
-            if punycode_need_encode(file["name"]):
-                print(encode_punycode(file["name"]))
-                query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{encode_punycode(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
+            detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+            if punycode_need_encode(file['name']):
+                query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(encode_punycode(file['name']))}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
             else:
                 query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
             cursor.execute(query)
diff --git a/tests/test_punycode.py b/tests/test_punycode.py
new file mode 100644
index 0000000..d940b0a
--- /dev/null
+++ b/tests/test_punycode.py
@@ -0,0 +1,54 @@
+import pytest
+
+from db_functions import punycode_need_encode, encode_punycode
+
+
+def test_needs_punyencoding():
+    checks = [
+    ["Icon\r", True],
+    ["ascii", False],
+    ["ends with dot .", True],
+    ["ends with space ", True],
+    ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", True],
+    ["Hello*", True],
+    ["File I/O", True],
+    ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", True],
+    ["Buried in Timeâ„¢ Demo", True],
+    ["â€¢Main Menu", True],
+    ["Spaceship Warlockâ„¢", True],
+    ["ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>", True],
+    ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", True],
+    ["JÃ¶nssonligan.exe", True],
+    ["G3ãƒ•ã‚©ãƒ«ãƒ€", True],
+    ["Big[test]", False],
+    ["Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|\"(everything)/\":*|\\?%<>,;=", True],
+    ["Buried in Timeï½ª Demo", True]
+    ]
+    for input, expected in checks:
+        assert punycode_need_encode(input) == expected
+
+def test_punycode_encode():
+    checks = [
+    ["Icon\r", "xn--Icon-ja6e"],
+    ["ascii", "ascii"],
+    ["ends with dot .", "xn--ends with dot .-"],
+    ["ends with space ", "xn--ends with space -"],
+    ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", "xn--(Power PC)-jx4ilmwb1a7h"],
+    ["Hello*", "xn--Hello-la10a"],
+    ["File I/O", "xn--File IO-oa82b"],
+    ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", "xn--HDG3-rw3c5o2dpa9kzb2170dd4tzyda5j4k"],
+    ["Buried in Timeâ„¢ Demo", "xn--Buried in Time Demo-eo0l"],
+    ["â€¢Main Menu", "xn--Main Menu-zd0e"],
+    ["Spaceship Warlockâ„¢", "xn--Spaceship Warlock-306j"],
+    ["ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>", "xn--baa0pja0512dela6bueub9gshf1k1a1rt742c060a2x4u"],
+    ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", "xn--Jnssonligan gr p djupet.exe-glcd70c"],
+    ["JÃ¶nssonligan.exe", "xn--Jnssonligan.exe-8sb"],
+    ["G3ãƒ•ã‚©ãƒ«ãƒ€", "xn--G3-3g4axdtexf"],
+    ["Big[test]", "Big[test]"],
+    ["Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|\"(everything)/\":*|\\?%<>,;=", "xn--Where  Do you Want  To G  ;Unless=nowhere,or(everything),;=-5baedgdcbtamaaaaaaaaa99woa3wnnmb82aqb71ekb9g3c1f1cyb7bx6rfcv2pxa"],
+    ["Buried in Timeï½ª Demo", "xn--Buried in Time Demo-yp97h"]
+    ]
+    for input, expected in checks:
+        assert encode_punycode(input) == expected
+
+


Commit: 2890822c72ead0ad597e3f3a1cbf0111128017ec
    https://github.com/scummvm/scummvm-sites/commit/2890822c72ead0ad597e3f3a1cbf0111128017ec
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add tests for macbinary

Changed paths:
  A tests/create/create_binary.py
  A tests/data/invalid_mac_binary/bad_checksum.bin
  A tests/data/invalid_mac_binary/forks_mismatch.bin
  A tests/data/invalid_mac_binary/len_less_than_128_bytes.bin
  A tests/data/invalid_mac_binary/name_length_too_large.bin
  A tests/data/invalid_mac_binary/zero_len_fields.bin
  A tests/data/valid_mac_binary/AFM Read Me!
  A tests/data/valid_mac_binary/AFM Read Me!.bin
  A tests/data/valid_mac_binary/valid_macbinary.bin
  A tests/test_compute_hash.py
    tests/test_punycode.py


diff --git a/tests/create/create_binary.py b/tests/create/create_binary.py
new file mode 100644
index 0000000..d3d9842
--- /dev/null
+++ b/tests/create/create_binary.py
@@ -0,0 +1,69 @@
+import struct
+import os
+
+def generate_macbinary_test_files():
+    output_dir = "../data/invalid_mac_binary/"
+
+    with open(os.path.join(output_dir, "len_less_than_128_bytes.bin"), "wb") as f:
+        f.write(b'\x12')
+
+
+    header = bytearray([1]*128)
+    # name length, data fork len, resource fork len, type/creator
+    header[1] = 0  
+    header[83:87] = struct.pack(">I", 0)
+    header[87:91] = struct.pack(">I", 0)
+    header[69:73] = struct.pack(">I", 0)
+
+    with open(os.path.join(output_dir, "zero_len_fields.bin"), "wb") as f:
+        f.write(header)
+
+    header = bytearray([1]*128)
+    header[1] = 10
+    header[124:126] = struct.pack(">H", 0xFFFF)
+    with open(os.path.join(output_dir, "bad_checksum.bin"), "wb") as f:
+        f.write(header)
+
+
+    header[124:126] = struct.pack(">H", 4263)
+    header[1] = 100 
+    with open(os.path.join(output_dir, "name_length_too_large.bin"), "wb") as f:
+        f.write(header)
+
+
+    header[1] = 10
+    header[83:87] = struct.pack(">I", 100)
+    header[87:91] = struct.pack(">I", 50)
+    with open(os.path.join(output_dir, "forks_mismatch.bin"), "wb") as f:
+        f.write(header)
+        f.write(b'\x00' * 50) 
+
+
+    output_dir = "../data/valid_mac_binary/"
+
+    # Valid
+    header = bytearray(128)
+    header[0] = 0
+    header[1] = 5
+    header[2:7] = b'test\x00'
+    header[69:73] = b'1111' 
+    header[73:77] = b'1111'
+    header[74] = 0
+    header[82] = 0
+    data_fork_len = 10
+    header[83:87] = struct.pack(">I", data_fork_len)
+    res_fork_len = 0
+    header[87:91] = struct.pack(">I", res_fork_len)
+    data_fork = b'0123456789'
+    data_fork_len_padded = (((data_fork_len + 127) >> 7) << 7)
+    data_fork_padded = data_fork + b'\x00' * (data_fork_len_padded - data_fork_len)
+    header[124:126] = struct.pack(">H", 27858)
+    file_size = 128 + data_fork_len_padded + res_fork_len
+
+
+    with open(os.path.join(output_dir, "valid_macbinary.bin"), "wb") as f:
+        f.write(header)
+        f.write(data_fork_padded)
+
+if __name__ == "__main__":
+    generate_macbinary_test_files()
\ No newline at end of file
diff --git a/tests/data/invalid_mac_binary/bad_checksum.bin b/tests/data/invalid_mac_binary/bad_checksum.bin
new file mode 100644
index 0000000..97ebd2b
--- /dev/null
+++ b/tests/data/invalid_mac_binary/bad_checksum.bin
@@ -0,0 +1,2 @@
+
+ÿÿ
\ No newline at end of file
diff --git a/tests/data/invalid_mac_binary/forks_mismatch.bin b/tests/data/invalid_mac_binary/forks_mismatch.bin
new file mode 100644
index 0000000..071ecb0
Binary files /dev/null and b/tests/data/invalid_mac_binary/forks_mismatch.bin differ
diff --git a/tests/data/invalid_mac_binary/len_less_than_128_bytes.bin b/tests/data/invalid_mac_binary/len_less_than_128_bytes.bin
new file mode 100644
index 0000000..a4ceb35
--- /dev/null
+++ b/tests/data/invalid_mac_binary/len_less_than_128_bytes.bin
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/tests/data/invalid_mac_binary/name_length_too_large.bin b/tests/data/invalid_mac_binary/name_length_too_large.bin
new file mode 100644
index 0000000..8c89634
--- /dev/null
+++ b/tests/data/invalid_mac_binary/name_length_too_large.bin
@@ -0,0 +1 @@
+d§
\ No newline at end of file
diff --git a/tests/data/invalid_mac_binary/zero_len_fields.bin b/tests/data/invalid_mac_binary/zero_len_fields.bin
new file mode 100644
index 0000000..c1ee79d
Binary files /dev/null and b/tests/data/invalid_mac_binary/zero_len_fields.bin differ
diff --git a/tests/data/valid_mac_binary/AFM Read Me! b/tests/data/valid_mac_binary/AFM Read Me!
new file mode 100644
index 0000000..d5c2d53
Binary files /dev/null and b/tests/data/valid_mac_binary/AFM Read Me! differ
diff --git a/tests/data/valid_mac_binary/AFM Read Me!.bin b/tests/data/valid_mac_binary/AFM Read Me!.bin
new file mode 100644
index 0000000..d5c2d53
Binary files /dev/null and b/tests/data/valid_mac_binary/AFM Read Me!.bin differ
diff --git a/tests/data/valid_mac_binary/valid_macbinary.bin b/tests/data/valid_mac_binary/valid_macbinary.bin
new file mode 100644
index 0000000..f2a099c
Binary files /dev/null and b/tests/data/valid_mac_binary/valid_macbinary.bin differ
diff --git a/tests/test_compute_hash.py b/tests/test_compute_hash.py
new file mode 100644
index 0000000..30ef94e
--- /dev/null
+++ b/tests/test_compute_hash.py
@@ -0,0 +1,17 @@
+import sys
+import os
+sys.path.insert(0, ".")
+
+from compute_hash import is_macbin
+
+def test_is_macbin():
+    invalid_mac_dir = "tests/data/invalid_mac_binary"
+    valid_mac_dir = "tests/data/valid_mac_binary"
+    checks = []
+    for file  in os.listdir(valid_mac_dir):
+        checks.append([os.path.join(valid_mac_dir, file), True])
+    for file  in os.listdir(invalid_mac_dir):
+        checks.append([os.path.join(invalid_mac_dir, file), False])
+
+    for input, expected in checks:
+        assert is_macbin(input) == expected
diff --git a/tests/test_punycode.py b/tests/test_punycode.py
index d940b0a..bff37b1 100644
--- a/tests/test_punycode.py
+++ b/tests/test_punycode.py
@@ -1,5 +1,3 @@
-import pytest
-
 from db_functions import punycode_need_encode, encode_punycode
 
 


Commit: fd7fbe9a75ecddad3f8edb4757bb78a43a6db497
    https://github.com/scummvm/scummvm-sites/commit/fd7fbe9a75ecddad3f8edb4757bb78a43a6db497
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Extend file table with two more size fields

Changed paths:
    schema.py


diff --git a/schema.py b/schema.py
index 8244743..438bb12 100644
--- a/schema.py
+++ b/schema.py
@@ -175,13 +175,16 @@ except Exception:
     cursor.execute("ALTER TABLE file MODIFY COLUMN punycode_name VARCHAR(200);")
 
 try:
-    cursor.execute(
-        "ALTER TABLE file ADD COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';"
-    )
-except Exception:
-    cursor.execute(
-        "ALTER TABLE file MODIFY COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';"
-    )
+    cursor.execute("ALTER TABLE file ADD COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';")
+except:
+    cursor.execute("ALTER TABLE file MODIFY COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';")        
+       
+try:
+    cursor.execute("ALTER TABLE file ADD COLUMN `size-r` BIGINT DEFAULT 0, ADD COLUMN `size-rd` BIGINT DEFAULT 0;")
+except:
+    cursor.execute("ALTER TABLE file MODIFY COLUMN `size-r` BIGINT DEFAULT 0, MODIFY COLUMN `size-rd` BIGINT DEFAULT 0;")
+
+
 
 for index, definition in indices.items():
     try:


Commit: 998539035d1111b260fb310f290b1128a9f96f71
    https://github.com/scummvm/scummvm-sites/commit/998539035d1111b260fb310f290b1128a9f96f71
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Changes in web app

* Add size-r and size-rd fields in files
* Add a Clear Database button, for development phase

Changed paths:
    fileset.py


diff --git a/fileset.py b/fileset.py
index 9455624..32489cd 100644
--- a/fileset.py
+++ b/fileset.py
@@ -67,13 +67,42 @@ def index():
     <ul>
         <li><a href="{{ url_for('logs') }}">Logs</a></li>
     </ul>
+    <form action="{{ url_for('clear_database') }}" method="POST"> 
+        <button style="margin:100px 0 0 0; background-color:red"  type="submit"> Clear Database </button>
+    </form>
     </body>
     </html>
     """
     return render_template_string(html)
 
+ at app.route('/clear_database', methods=['POST'])
+def clear_database():
+    try:
+        conn = db_connect()
+        with conn.cursor() as cursor:
+            cursor.execute("SET FOREIGN_KEY_CHECKS = 0;")
+            cursor.execute("TRUNCATE TABLE filechecksum")
+            cursor.execute("TRUNCATE TABLE file")
+            cursor.execute("TRUNCATE TABLE fileset")
+            cursor.execute("TRUNCATE TABLE history")
+            cursor.execute("TRUNCATE TABLE game")
+            cursor.execute("TRUNCATE TABLE engine")
+            cursor.execute("TRUNCATE TABLE log")
+            cursor.execute("TRUNCATE TABLE queue")
+            cursor.execute("TRUNCATE TABLE transactions")
+            cursor.execute("SET FOREIGN_KEY_CHECKS = 1;")
+            conn.commit()
+            print("DATABASE CLEARED")
+    except Exception as e:
+        print(f"Error clearing database: {e}")
+    finally:
+        conn.close()
 
- at app.route("/fileset", methods=["GET", "POST"])
+    return redirect('/')
+
+
+
+ at app.route('/fileset', methods=['GET', 'POST'])
 def fileset():
     id = request.args.get("id", default=1, type=int)
     widetable = request.args.get("widetable", default="partial", type=str)
@@ -206,16 +235,10 @@ def fileset():
                     if "desc" in sort:
                         order += " DESC"
 
-            columns_to_select = (
-                "file.id, name, size, checksum, detection, detection_type, `timestamp`"
-            )
+            columns_to_select = "file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp`"
             columns_to_select += ", ".join(md5_columns)
-            print(
-                f"SELECT file.id, name, size, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}"
-            )
-            cursor.execute(
-                f"SELECT file.id, name, size, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}"
-            )
+            print(f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}")
+            cursor.execute(f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}")
             result = cursor.fetchall()
 
             all_columns = list(result[0].keys()) if result else []
@@ -1047,4 +1070,4 @@ def delete_files(id):
 
 if __name__ == "__main__":
     app.secret_key = secret_key
-    app.run(debug=True, host="0.0.0.0")
+    app.run(port=5001,debug=True, host='0.0.0.0')


Commit: eebd97c6d51c96ad890d78cafdeebbdfe7400e5d
    https://github.com/scummvm/scummvm-sites/commit/eebd97c6d51c96ad890d78cafdeebbdfe7400e5d
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRTIY: Initial seeding and set.dat fixes.

* Sort files by name before megakey calculation.
* Correctly merging matched set.dat to original detection fileset.
* Conditional size field check as some dats might have three sizes.

Changed paths:
    dat_parser.py
    db_functions.py


diff --git a/dat_parser.py b/dat_parser.py
index eab7fae..e66886c 100644
--- a/dat_parser.py
+++ b/dat_parser.py
@@ -28,6 +28,12 @@ def map_checksum_data(content_string):
         elif tokens[i] == "size":
             current_rom["size"] = int(tokens[i + 1])
             i += 2
+        elif tokens[i] == 'size-r':
+            current_rom['size-r'] = int(tokens[i + 1])
+            i += 2
+        elif tokens[i] == 'size-rd':
+            current_rom['size-rd'] = int(tokens[i + 1])
+            i += 2
         else:
             checksum_key = tokens[i]
             checksum_value = tokens[i + 1] if len(tokens) >= 6 else "0"
@@ -64,7 +70,7 @@ def map_key_values(content_string, arr):
 
 def match_outermost_brackets(input):
     """
-    Parse DAT file and separate the contents each segment into an array
+    Parse DAT file and separate the contents each segment into an array.
     Segments are of the form `scummvm ( )`, `game ( )` etc.
     """
     matches = []
@@ -107,6 +113,7 @@ def parse_dat(dat_filepath):
     resources = {}
 
     matches = match_outermost_brackets(content)
+    # print(matches)
     if matches:
         for data_segment in matches:
             if (
@@ -122,7 +129,7 @@ def parse_dat(dat_filepath):
                 temp = {}
                 temp = map_key_values(data_segment[0], temp)
                 resources[temp["name"]] = temp
-    # print(header, game_data, resources)
+    # print(header, game_data, resources, dat_filepath)
     return header, game_data, resources, dat_filepath
 
 
@@ -148,6 +155,7 @@ def main():
 
     if args.match:
         for filepath in args.match:
+            # print(parse_dat(filepath)[2])
             match_fileset(parse_dat(filepath), args.user)
 
 
diff --git a/db_functions.py b/db_functions.py
index 93c0ce0..e6574bb 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -50,6 +50,8 @@ def get_checksum_props(checkcode, checksum):
 
         checksum = checksum.split(":")[1]
 
+    # print(checksize)
+
     return checksize, checktype, checksum
 
 
@@ -189,10 +191,23 @@ def insert_file(file, detection, src, conn):
         checktype = "None"
         detection = 0
     detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
-    if punycode_need_encode(file['name']):
-        query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(encode_punycode(file['name']))}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
-    else:
-        query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
+
+    extended_file_size = True if 'size-r' in file else False
+
+    name = encode_punycode(file['name']) if punycode_need_encode(file['name']) else file['name']
+    escaped_name = escape_string(name)
+
+    columns = ['name', 'size']
+    values = [f"'{escaped_name}'", f"'{file['size']}'"]
+
+    if extended_file_size:
+        columns.extend(['`size-r`', '`size-rd`'])
+        values.extend([f"'{file['size-r']}'", f"'{file['size-rd']}'"])
+
+    columns.extend(['checksum', 'fileset', 'detection', 'detection_type', '`timestamp`'])
+    values.extend([f"'{checksum}'", '@fileset_last', str(detection), f"'{detection_type}'", 'NOW()'])
+
+    query = f"INSERT INTO file ({', '.join(columns)}) VALUES ({', '.join(values)})"
     with conn.cursor() as cursor:
         cursor.execute(query)
 
@@ -380,8 +395,11 @@ def calc_key(fileset):
 
 def calc_megakey(fileset):
     key_string = f":{fileset['platform']}:{fileset['language']}"
-    if "rom" in fileset.keys():
-        for file in fileset["rom"]:
+    # print(fileset.keys())
+    if 'rom' in fileset.keys():
+        files = fileset['rom']
+        files.sort(key=lambda x : x['name'])
+        for file in fileset['rom']:
             for key, value in file.items():
                 key_string += ":" + str(value)
     elif "files" in fileset.keys():
@@ -389,7 +407,8 @@ def calc_megakey(fileset):
             for key, value in file.items():
                 key_string += ":" + str(value)
 
-    key_string = key_string.strip(":")
+    key_string = key_string.strip(':')
+    # print(key_string)
     return hashlib.md5(key_string.encode()).hexdigest()
 
 
@@ -467,7 +486,7 @@ def db_insert(data_arr, username=None, skiplog=False):
             for file in fileset["rom"]:
                 insert_file(file, detection, src, conn)
                 for key, value in file.items():
-                    if key not in ["name", "size"]:
+                    if key not in ["name", "size", "size-r", "size-rd"]:
                         insert_filechecksum(file, key, conn)
 
     if detection:
@@ -492,9 +511,9 @@ def db_insert(data_arr, username=None, skiplog=False):
 
 def compare_filesets(id1, id2, conn):
     with conn.cursor() as cursor:
-        cursor.execute(f"SELECT name, size, checksum FROM file WHERE fileset = '{id1}'")
+        cursor.execute(f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id1}'")
         fileset1 = cursor.fetchall()
-        cursor.execute(f"SELECT name, size, checksum FROM file WHERE fileset = '{id2}'")
+        cursor.execute(f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id2}'")
         fileset2 = cursor.fetchall()
 
     # Sort filesets on checksum
@@ -828,7 +847,7 @@ def find_matching_filesets(fileset, conn, status):
         for file in fileset["rom"]:
             matched_set = set()
             for key, value in file.items():
-                if key not in ["name", "size", "sha1", "crc"]:
+                if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:
                     checksum = file[key]
                     checktype = key
                     checksize, checktype, checksum = get_checksum_props(
@@ -993,17 +1012,30 @@ def populate_file(fileset, fileset_id, conn, detection):
                 checktype = "None"
                 detection = 0
             detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
-            if punycode_need_encode(file['name']):
-                query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(encode_punycode(file['name']))}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
-            else:
-                query = f"INSERT INTO file (name, size, checksum, fileset, detection, detection_type, `timestamp`) VALUES ('{escape_string(file['name'])}', '{file['size']}', '{checksum}', @fileset_last, {detection}, '{detection_type}', NOW())"
+
+            extended_file_size = True if 'size-r' in file else False
+
+            name = encode_punycode(file['name']) if punycode_need_encode(file['name']) else file['name']
+            escaped_name = escape_string(name)
+
+            columns = ['name', 'size']
+            values = [f"'{escaped_name}'", f"'{file['size']}'"]
+
+            if extended_file_size:
+                columns.extend(['`size-r`', '`size-rd`'])
+                values.extend([f"'{file['size-r']}'", f"'{file['size-rd']}'"])
+
+            columns.extend(['checksum', 'fileset', 'detection', 'detection_type', '`timestamp`'])
+            values.extend([f"'{checksum}'", str(fileset_id), str(detection), f"'{detection_type}'", 'NOW()'])
+
+            query = f"INSERT INTO file ({', '.join(columns)}) VALUES ({', '.join(values)})"
             cursor.execute(query)
             cursor.execute("SET @file_last = LAST_INSERT_ID()")
             cursor.execute("SELECT @file_last AS file_id")
             file_id = cursor.fetchone()["file_id"]
             target_id = None
             for key, value in file.items():
-                if key not in ["name", "size"]:
+                if key not in ["name", "size", "size-r", "size-rd"]:
                     insert_filechecksum(file, key, conn)
                     if value in target_files_dict and not file_exists:
                         file_exists = True
@@ -1040,7 +1072,7 @@ def insert_new_fileset(
         for file in fileset["rom"]:
             insert_file(file, detection, src, conn)
             for key, value in file.items():
-                if key not in ["name", "size", "sha1", "crc"]:
+                if key not in ["name", "size", "size-r", "size-rd" "sha1", "crc"]:
                     insert_filechecksum(file, key, conn)
 
 
@@ -1076,7 +1108,12 @@ def user_integrity_check(data, ip, game_metadata=None):
     new_files = []
 
     for file in data["files"]:
-        new_file = {"name": file["name"], "size": file["size"]}
+        new_file = {
+            "name": file["name"],
+            "size": file["size"],
+            "size-r": file["size-r"],
+            "size-rd": file["size-rd"] 
+        }
         for checksum in file["checksums"]:
             checksum_type = checksum["type"]
             checksum_value = checksum["checksum"]


Commit: b56087660fa87ea5d431eecd3307de3bfbac4b71
    https://github.com/scummvm/scummvm-sites/commit/b56087660fa87ea5d431eecd3307de3bfbac4b71
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Identify size type from checksum prefix

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index e6574bb..ff97b72 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -197,12 +197,21 @@ def insert_file(file, detection, src, conn):
     name = encode_punycode(file['name']) if punycode_need_encode(file['name']) else file['name']
     escaped_name = escape_string(name)
 
-    columns = ['name', 'size']
-    values = [f"'{escaped_name}'", f"'{file['size']}'"]
+    columns = ['name', 'size', '`size-r`', '`size-rd`']
+    values = [f"'{escaped_name}'"]
 
     if extended_file_size:
-        columns.extend(['`size-r`', '`size-rd`'])
-        values.extend([f"'{file['size-r']}'", f"'{file['size-rd']}'"])
+        values.extend([f"'{file['size']}'", f"'{file['size-r']}'", f"'{file['size-rd']}'"])
+    else:
+        values.extend([f"'{file['size']}'", f"'0'", f"'0'"])
+        for key, value in file.items():
+            if key not in ["name", "size", "size-r", "size-rd"] and ':' in file[key]:
+                    c = file[key]
+                    prefix = c.split(':')[0]
+                    if prefix == 'r' or prefix == 'm':
+                        values[1] = f"'0'"
+                        values[3] = f"'{file['size']}'"
+                        break
 
     columns.extend(['checksum', 'fileset', 'detection', 'detection_type', '`timestamp`'])
     values.extend([f"'{checksum}'", '@fileset_last', str(detection), f"'{detection_type}'", 'NOW()'])


Commit: 0a9c916f0f756f646867b2e507b73711a3d5b108
    https://github.com/scummvm/scummvm-sites/commit/0a9c916f0f756f646867b2e507b73711a3d5b108
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Fix fileset deletion logic when matching set filesets with detection.

Previously, the logic attempted to fetch the current fileset ID using the last insert ID, which was incorrect since the last insert was into the `filechecksum` table, not `fileset`. Now, the correct fileset ID is returned directly from the `insert_fileset` function.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index ff97b72..031c156 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -48,10 +48,7 @@ def get_checksum_props(checkcode, checksum):
         prefix = checksum.split(":")[0]
         checktype += "-" + prefix
 
-        checksum = checksum.split(":")[1]
-
-    # print(checksize)
-
+        checksum = checksum.split(':')[1]
     return checksize, checktype, checksum
 
 
@@ -140,13 +137,15 @@ def insert_fileset(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
             update_history(existing_entry, existing_entry, conn, log_last)
-
-        return True
+            
+        return existing_entry
 
     # $game and $key should not be parsed as a mysql string, hence no quotes
     query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last))"
+    fileset_id = -1
     with conn.cursor() as cursor:
         cursor.execute(query)
+        fileset_id = cursor.lastrowid
         cursor.execute("SET @fileset_last = LAST_INSERT_ID()")
 
     category_text = f"Uploaded from {src}"
@@ -171,7 +170,7 @@ def insert_fileset(
             f"INSERT INTO transactions (`transaction`, fileset) VALUES ({transaction}, {fileset_last})"
         )
 
-    return True
+    return fileset_id
 
 
 def insert_file(file, detection, src, conn):
@@ -181,6 +180,7 @@ def insert_file(file, detection, src, conn):
     checktype = "None"
     if "md5" in file:
         checksum = file["md5"]
+        checksum = checksum.split(':')[1] if ':' in checksum else checksum
     else:
         for key, value in file.items():
             if "md5" in key:
@@ -811,29 +811,18 @@ def process_fileset(
         matched_map = find_matching_filesets(fileset, conn, src)
     else:
         matched_map = matching_set(fileset, conn)
-
-    insert_new_fileset(
-        fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
-    )
-    with conn.cursor() as cursor:
-        cursor.execute("SET @fileset_last = LAST_INSERT_ID()")
-        cursor.execute("SELECT LAST_INSERT_ID()")
-        fileset_last = cursor.fetchone()["LAST_INSERT_ID()"]
+    
+    fileset_id = insert_new_fileset(fileset, conn, detection, src, key, megakey, transaction_id, log_text, user)
+    # with conn.cursor() as cursor:
+    #     cursor.execute("SET @fileset_last = LAST_INSERT_ID()")
+    #     cursor.execute("SELECT LAST_INSERT_ID()")
+    #     fileset_last_old = cursor.fetchone()['LAST_INSERT_ID()']
+    #     fileset_last = cursor.lastrowid
+    #     print(fileset_last_old)
+    #     print(fileset_last)
+        
     if matched_map:
-        handle_matched_filesets(
-            fileset_last,
-            matched_map,
-            fileset,
-            conn,
-            detection,
-            src,
-            key,
-            megakey,
-            transaction_id,
-            log_text,
-            user,
-        )
-
+        handle_matched_filesets(fileset_id, matched_map, fileset, conn, detection, src, key, megakey, transaction_id, log_text, user)
 
 def insert_game_data(fileset, conn):
     engine_name = fileset["engine"]
@@ -887,13 +876,16 @@ def matching_set(fileset, conn):
             matched_set = set()
             if "md5" in file:
                 checksum = file["md5"]
+                if ':' in checksum:
+                    checksum = checksum.split(':')[1]
                 size = file["size"]
+
                 query = f"""
                     SELECT DISTINCT fs.id AS fileset_id
                     FROM fileset fs
                     JOIN file f ON fs.id = f.fileset
                     JOIN filechecksum fc ON f.id = fc.file
-                    WHERE fc.checksum = '{checksum}' AND fc.checktype = 'md5'
+                    WHERE fc.checksum = '{checksum}' AND fc.checktype LIKE 'md5%'
                     AND fc.checksize > {size}
                     AND fs.status = 'detection'
                 """
@@ -977,10 +969,11 @@ def handle_matched_filesets(
 
 def delete_original_fileset(fileset_id, conn):
     with conn.cursor() as cursor:
+        print(fileset_id)
         cursor.execute(f"DELETE FROM file WHERE fileset = {fileset_id}")
         cursor.execute(f"DELETE FROM fileset WHERE id = {fileset_id}")
-
-
+    conn.commit()
+        
 def update_fileset_status(cursor, fileset_id, status):
     cursor.execute(f"""
         UPDATE fileset SET 
@@ -1063,26 +1056,15 @@ def populate_file(fileset, fileset_id, conn, detection):
                     f"UPDATE file SET detection_type = 'None' WHERE id = {file_id}"
                 )
 
-
-def insert_new_fileset(
-    fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=""
-):
-    if insert_fileset(
-        src,
-        detection,
-        key,
-        megakey,
-        transaction_id,
-        log_text,
-        conn,
-        username=user,
-        ip=ip,
-    ):
+def insert_new_fileset(fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=''):
+    fileset_id = insert_fileset(src, detection, key, megakey, transaction_id, log_text, conn, username=user, ip=ip)
+    if fileset_id:
         for file in fileset["rom"]:
             insert_file(file, detection, src, conn)
             for key, value in file.items():
                 if key not in ["name", "size", "size-r", "size-rd" "sha1", "crc"]:
                     insert_filechecksum(file, key, conn)
+    return fileset_id
 
 
 def log_matched_fileset(src, fileset_last, fileset_id, state, user, conn):


Commit: 61efc527f39eea70879f7755862ca2c55c4e161a
    https://github.com/scummvm/scummvm-sites/commit/61efc527f39eea70879f7755862ca2c55c4e161a
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Correct fileset redirection link in log in case of matching

Changed paths:
    fileset.py
    pagination.py


diff --git a/fileset.py b/fileset.py
index 32489cd..bb4483e 100644
--- a/fileset.py
+++ b/fileset.py
@@ -169,6 +169,13 @@ def fileset():
                         <button type='submit'>Mark as full</button>
                     </form>
                     """
+            
+            
+            cursor.execute("SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset" , (id,))
+            row = cursor.fetchone()
+            print(row)
+            if row:
+                id = row['fileset']
             cursor.execute(f"SELECT * FROM fileset WHERE id = {id}")
             result = cursor.fetchone()
             print(result)
@@ -1031,7 +1038,7 @@ def fileset_search():
     SELECT extra, platform, language, game.gameid, megakey,
     status, fileset.id as fileset
     FROM fileset
-    JOIN game ON game.id = fileset.game
+    LEFT JOIN game ON game.id = fileset.game
     """
     order = "ORDER BY fileset.id"
     filters = {
diff --git a/pagination.py b/pagination.py
index f9f8ecd..c28a2ee 100644
--- a/pagination.py
+++ b/pagination.py
@@ -195,10 +195,14 @@ def create_page(
                     if matches:
                         fileset_id = matches.group(1)
                         fileset_text = matches.group(0)
-                        value = value.replace(
-                            fileset_text,
-                            f"<a href='fileset?id={fileset_id}'>{fileset_text}</a>",
-                        )
+                        with conn.cursor() as cursor:
+                            cursor.execute("SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset", (fileset_id,))
+                            row = cursor.fetchone()
+                            print(row)
+                            if row:
+                                fileset_id = row['fileset']
+
+                        value = value.replace(fileset_text, f"<a href='fileset?id={fileset_id}'>{fileset_text}</a>")
 
                 html += f"<td>{value}</td>\n"
             html += "</tr>\n"


Commit: 77494265eef7559d5c149779ba816f4f8ae443cb
    https://github.com/scummvm/scummvm-sites/commit/77494265eef7559d5c149779ba816f4f8ae443cb
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add all md5 variants for files with size less than 5000B.

If a file has a size less than 5000, then its checksum is equal for all variants, md5-full_size, md5-5000, md5-t-5000 and md5-1M.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 031c156..ef32ef7 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -236,9 +236,38 @@ def insert_filechecksum(file, checktype, conn):
     checksum = file[checktype]
     checksize, checktype, checksum = get_checksum_props(checktype, checksum)
 
+
     query = f"INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (@file_last, '{checksize}', '{checktype}', '{checksum}')"
     with conn.cursor() as cursor:
         cursor.execute(query)
+        if "md5" not in checktype:
+            return
+        if (checktype[-1] == 'm' or checktype[-1] == 'd' or checktype[-1] == 'r'):
+            return
+        
+        cursor.execute("SELECT size FROM file WHERE id = @file_last")
+        result = cursor.fetchone()
+        if not result:
+            return
+        file_size = result['size']
+        if file_size != -1 and (int(file_size) <= int(checksize) or int(checksize) == 0) and file_size <= 5000:
+            md5_variants = ['md5-0', 'md5-1M', 'md5-5000', 'md5-t-5000']
+            inserted_checksum_type = checktype + "-" + checksize
+            for cs in md5_variants:
+                if cs != inserted_checksum_type:
+                    exploded_checksum = cs.split('-')
+                    c_size = exploded_checksum.pop()
+                    c_type = '-'.join(exploded_checksum)
+
+                    query = f"INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (@file_last, '{c_size}', '{c_type}', '{checksum}')"
+                    with conn.cursor() as cursor:
+                        cursor.execute(query)
+            
+
+
+
+    
+    
 
 
 def delete_filesets(conn):
@@ -1010,6 +1039,7 @@ def populate_file(fileset, fileset_id, conn, detection):
                     if "md5" in key:
                         checksize, checktype, checksum = get_checksum_props(key, value)
                         break
+            
             if not detection:
                 checktype = "None"
                 detection = 0


Commit: 0e8f5bb45fa3fcb522c208799889779ba0bd961a
    https://github.com/scummvm/scummvm-sites/commit/0e8f5bb45fa3fcb522c208799889779ba0bd961a
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add pre-commit hook for ruff.

Set up and ran ruff formatter and linter on the codebase.

Changed paths:
  A .pre-commit-config.yaml
    dat_parser.py
    db_functions.py
    fileset.py
    pagination.py
    schema.py
    tests/create/create_binary.py
    tests/test_compute_hash.py
    tests/test_punycode.py


diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
new file mode 100644
index 0000000..3c9cbcc
--- /dev/null
+++ b/.pre-commit-config.yaml
@@ -0,0 +1,9 @@
+repos:
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  # Ruff version.
+  rev: v0.11.13
+  hooks:
+    # Run the linter.
+    - id: ruff-check
+    # Run the formatter.
+    - id: ruff-format
\ No newline at end of file
diff --git a/dat_parser.py b/dat_parser.py
index e66886c..5fefbce 100644
--- a/dat_parser.py
+++ b/dat_parser.py
@@ -14,7 +14,6 @@ def remove_quotes(string):
 
 def map_checksum_data(content_string):
     arr = []
-
     content_string = content_string.strip().strip("()").strip()
 
     tokens = re.split(r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)', content_string)
@@ -28,11 +27,11 @@ def map_checksum_data(content_string):
         elif tokens[i] == "size":
             current_rom["size"] = int(tokens[i + 1])
             i += 2
-        elif tokens[i] == 'size-r':
-            current_rom['size-r'] = int(tokens[i + 1])
+        elif tokens[i] == "size-r":
+            current_rom["size-r"] = int(tokens[i + 1])
             i += 2
-        elif tokens[i] == 'size-rd':
-            current_rom['size-rd'] = int(tokens[i + 1])
+        elif tokens[i] == "size-rd":
+            current_rom["size-rd"] = int(tokens[i + 1])
             i += 2
         else:
             checksum_key = tokens[i]
diff --git a/db_functions.py b/db_functions.py
index ef32ef7..8db29d5 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -38,17 +38,25 @@ def get_checksum_props(checkcode, checksum):
     if "-" in checkcode:
         exploded_checkcode = checkcode.split("-")
         last = exploded_checkcode.pop()
+
+        # For type md5-t-5000
         if last == "1M" or last.isdigit():
             checksize = last
-
         checktype = "-".join(exploded_checkcode)
 
+        # # Of type md5-5000-t
+        # else:
+        #     second_last = exploded_checkcode.pop()
+        #     print(second_last)
+        #     checksize = second_last
+        #     checktype = exploded_checkcode[0]+'-'+last
+
     # Detection entries have checktypes as part of the checksum prefix
     if ":" in checksum:
         prefix = checksum.split(":")[0]
         checktype += "-" + prefix
 
-        checksum = checksum.split(':')[1]
+        checksum = checksum.split(":")[1]
     return checksize, checktype, checksum
 
 
@@ -137,7 +145,7 @@ def insert_fileset(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
             update_history(existing_entry, existing_entry, conn, log_last)
-            
+
         return existing_entry
 
     # $game and $key should not be parsed as a mysql string, hence no quotes
@@ -180,7 +188,7 @@ def insert_file(file, detection, src, conn):
     checktype = "None"
     if "md5" in file:
         checksum = file["md5"]
-        checksum = checksum.split(':')[1] if ':' in checksum else checksum
+        checksum = checksum.split(":")[1] if ":" in checksum else checksum
     else:
         for key, value in file.items():
             if "md5" in key:
@@ -190,35 +198,32 @@ def insert_file(file, detection, src, conn):
     if not detection:
         checktype = "None"
         detection = 0
-    detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+    detection_type = (
+        f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+    )
 
-    extended_file_size = True if 'size-r' in file else False
+    name = (
+        encode_punycode(file["name"])
+        if punycode_need_encode(file["name"])
+        else file["name"]
+    )
 
-    name = encode_punycode(file['name']) if punycode_need_encode(file['name']) else file['name']
-    escaped_name = escape_string(name)
+    values = [name]
 
-    columns = ['name', 'size', '`size-r`', '`size-rd`']
-    values = [f"'{escaped_name}'"]
+    values.append(file["size"] if "size" in file else "0")
+    values.append(file["size-r"] if "size-r" in file else "0")
+    values.append(file["size-rd"] if "size-rd" in file else "0")
 
-    if extended_file_size:
-        values.extend([f"'{file['size']}'", f"'{file['size-r']}'", f"'{file['size-rd']}'"])
-    else:
-        values.extend([f"'{file['size']}'", f"'0'", f"'0'"])
-        for key, value in file.items():
-            if key not in ["name", "size", "size-r", "size-rd"] and ':' in file[key]:
-                    c = file[key]
-                    prefix = c.split(':')[0]
-                    if prefix == 'r' or prefix == 'm':
-                        values[1] = f"'0'"
-                        values[3] = f"'{file['size']}'"
-                        break
+    values.extend([checksum, detection, detection_type])
 
-    columns.extend(['checksum', 'fileset', 'detection', 'detection_type', '`timestamp`'])
-    values.extend([f"'{checksum}'", '@fileset_last', str(detection), f"'{detection_type}'", 'NOW()'])
+    # Parameterised Query
+    placeholders = (
+        ["%s"] * (len(values[:5])) + ["@fileset_last"] + ["%s"] * 2 + ["NOW()"]
+    )
+    query = f"INSERT INTO file ( name, size, `size-r`, `size-rd`, checksum, fileset, detection, detection_type, `timestamp` ) VALUES ({', '.join(placeholders)})"
 
-    query = f"INSERT INTO file ({', '.join(columns)}) VALUES ({', '.join(values)})"
     with conn.cursor() as cursor:
-        cursor.execute(query)
+        cursor.execute(query, values)
 
     if detection:
         with conn.cursor() as cursor:
@@ -236,38 +241,50 @@ def insert_filechecksum(file, checktype, conn):
     checksum = file[checktype]
     checksize, checktype, checksum = get_checksum_props(checktype, checksum)
 
-
     query = f"INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (@file_last, '{checksize}', '{checktype}', '{checksum}')"
     with conn.cursor() as cursor:
         cursor.execute(query)
         if "md5" not in checktype:
             return
-        if (checktype[-1] == 'm' or checktype[-1] == 'd' or checktype[-1] == 'r'):
-            return
-        
-        cursor.execute("SELECT size FROM file WHERE id = @file_last")
+
+        size_name = "size"
+        if checktype[-1] == "r":
+            size_name += "-rd"
+        if checktype[-1] == "s":
+            size_name += "-d"
+
+        cursor.execute(f"SELECT `{size_name}` FROM file WHERE id = @file_last")
         result = cursor.fetchone()
         if not result:
             return
-        file_size = result['size']
-        if file_size != -1 and (int(file_size) <= int(checksize) or int(checksize) == 0) and file_size <= 5000:
-            md5_variants = ['md5-0', 'md5-1M', 'md5-5000', 'md5-t-5000']
-            inserted_checksum_type = checktype + "-" + checksize
-            for cs in md5_variants:
-                if cs != inserted_checksum_type:
-                    exploded_checksum = cs.split('-')
-                    c_size = exploded_checksum.pop()
-                    c_type = '-'.join(exploded_checksum)
-
-                    query = f"INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (@file_last, '{c_size}', '{c_type}', '{checksum}')"
+        file_size = result[size_name]
+        c_size = checksize
+        if checksize == "1M":
+            c_size = 1024 * 1024
+        if (
+            file_size != -1
+            and (int(file_size) <= int(c_size) or int(c_size) == 0)
+            and file_size <= 5000
+        ):
+            md5_variants_map = {
+                "d": ["md5-d-0", "md5-d-1M", "md5-d-5000", "md5-dt-5000"],
+                "r": ["md5-r-0", "md5-r-1M", "md5-r-5000", "md5-rt-5000"],
+                "default": ["md5-0", "md5-1M", "md5-5000", "md5-t-5000"],
+            }
+
+            key = checktype[-1] if checktype[-1] in md5_variants_map else "default"
+            variants = md5_variants_map[key]
+            inserted_checksum_type = f"{checktype}-{checksize}"
+
+            for checksum_name in variants:
+                if checksum_name != inserted_checksum_type:
+                    exploded = checksum_name.split("-")
+                    checksum_size = exploded.pop()
+                    checksum_type = "-".join(exploded)
+
+                    query = "INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (@file_last, %s, %s, %s)"
                     with conn.cursor() as cursor:
-                        cursor.execute(query)
-            
-
-
-
-    
-    
+                        cursor.execute(query, (checksum_size, checksum_type, checksum))
 
 
 def delete_filesets(conn):
@@ -295,12 +312,13 @@ def my_escape_string(s: str) -> str:
             new_name += char
     return new_name
 
+
 def encode_punycode(orig):
     """
     Punyencode strings
 
     - escape special characters and
-    - ensure filenames can't end in a space or dot
+    - ensure filenames can't end in a space or dotif temp == None:
     """
     s = my_escape_string(orig)
     encoded = s.encode("punycode").decode("ascii")
@@ -434,10 +452,10 @@ def calc_key(fileset):
 def calc_megakey(fileset):
     key_string = f":{fileset['platform']}:{fileset['language']}"
     # print(fileset.keys())
-    if 'rom' in fileset.keys():
-        files = fileset['rom']
-        files.sort(key=lambda x : x['name'])
-        for file in fileset['rom']:
+    if "rom" in fileset.keys():
+        files = fileset["rom"]
+        files.sort(key=lambda x: x["name"])
+        for file in fileset["rom"]:
             for key, value in file.items():
                 key_string += ":" + str(value)
     elif "files" in fileset.keys():
@@ -445,7 +463,7 @@ def calc_megakey(fileset):
             for key, value in file.items():
                 key_string += ":" + str(value)
 
-    key_string = key_string.strip(':')
+    key_string = key_string.strip(":")
     # print(key_string)
     return hashlib.md5(key_string.encode()).hexdigest()
 
@@ -549,9 +567,13 @@ def db_insert(data_arr, username=None, skiplog=False):
 
 def compare_filesets(id1, id2, conn):
     with conn.cursor() as cursor:
-        cursor.execute(f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id1}'")
+        cursor.execute(
+            f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id1}'"
+        )
         fileset1 = cursor.fetchall()
-        cursor.execute(f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id2}'")
+        cursor.execute(
+            f"SELECT name, size, `size-r`, `size-rd`, checksum FROM file WHERE fileset = '{id2}'"
+        )
         fileset2 = cursor.fetchall()
 
     # Sort filesets on checksum
@@ -840,8 +862,10 @@ def process_fileset(
         matched_map = find_matching_filesets(fileset, conn, src)
     else:
         matched_map = matching_set(fileset, conn)
-    
-    fileset_id = insert_new_fileset(fileset, conn, detection, src, key, megakey, transaction_id, log_text, user)
+
+    fileset_id = insert_new_fileset(
+        fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
+    )
     # with conn.cursor() as cursor:
     #     cursor.execute("SET @fileset_last = LAST_INSERT_ID()")
     #     cursor.execute("SELECT LAST_INSERT_ID()")
@@ -849,9 +873,22 @@ def process_fileset(
     #     fileset_last = cursor.lastrowid
     #     print(fileset_last_old)
     #     print(fileset_last)
-        
+
     if matched_map:
-        handle_matched_filesets(fileset_id, matched_map, fileset, conn, detection, src, key, megakey, transaction_id, log_text, user)
+        handle_matched_filesets(
+            fileset_id,
+            matched_map,
+            fileset,
+            conn,
+            detection,
+            src,
+            key,
+            megakey,
+            transaction_id,
+            log_text,
+            user,
+        )
+
 
 def insert_game_data(fileset, conn):
     engine_name = fileset["engine"]
@@ -905,8 +942,8 @@ def matching_set(fileset, conn):
             matched_set = set()
             if "md5" in file:
                 checksum = file["md5"]
-                if ':' in checksum:
-                    checksum = checksum.split(':')[1]
+                if ":" in checksum:
+                    checksum = checksum.split(":")[1]
                 size = file["size"]
 
                 query = f"""
@@ -1002,7 +1039,8 @@ def delete_original_fileset(fileset_id, conn):
         cursor.execute(f"DELETE FROM file WHERE fileset = {fileset_id}")
         cursor.execute(f"DELETE FROM fileset WHERE id = {fileset_id}")
     conn.commit()
-        
+
+
 def update_fileset_status(cursor, fileset_id, status):
     cursor.execute(f"""
         UPDATE fileset SET 
@@ -1039,60 +1077,127 @@ def populate_file(fileset, fileset_id, conn, detection):
                     if "md5" in key:
                         checksize, checktype, checksum = get_checksum_props(key, value)
                         break
-            
+
             if not detection:
                 checktype = "None"
                 detection = 0
-            detection_type = f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+            detection_type = (
+                f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
+            )
 
-            extended_file_size = True if 'size-r' in file else False
+            extended_file_size = True if "size-r" in file else False
 
-            name = encode_punycode(file['name']) if punycode_need_encode(file['name']) else file['name']
+            name = (
+                encode_punycode(file["name"])
+                if punycode_need_encode(file["name"])
+                else file["name"]
+            )
             escaped_name = escape_string(name)
 
-            columns = ['name', 'size']
+            columns = ["name", "size"]
             values = [f"'{escaped_name}'", f"'{file['size']}'"]
 
             if extended_file_size:
-                columns.extend(['`size-r`', '`size-rd`'])
+                columns.extend(["`size-r`", "`size-rd`"])
                 values.extend([f"'{file['size-r']}'", f"'{file['size-rd']}'"])
 
-            columns.extend(['checksum', 'fileset', 'detection', 'detection_type', '`timestamp`'])
-            values.extend([f"'{checksum}'", str(fileset_id), str(detection), f"'{detection_type}'", 'NOW()'])
+            columns.extend(
+                ["checksum", "fileset", "detection", "detection_type", "`timestamp`"]
+            )
+            values.extend(
+                [
+                    f"'{checksum}'",
+                    str(fileset_id),
+                    str(detection),
+                    f"'{detection_type}'",
+                    "NOW()",
+                ]
+            )
 
-            query = f"INSERT INTO file ({', '.join(columns)}) VALUES ({', '.join(values)})"
+            query = (
+                f"INSERT INTO file ({', '.join(columns)}) VALUES ({', '.join(values)})"
+            )
             cursor.execute(query)
             cursor.execute("SET @file_last = LAST_INSERT_ID()")
             cursor.execute("SELECT @file_last AS file_id")
+
             file_id = cursor.fetchone()["file_id"]
-            target_id = None
+            d_type = 0
+            previous_checksums = {}
+
             for key, value in file.items():
                 if key not in ["name", "size", "size-r", "size-rd"]:
                     insert_filechecksum(file, key, conn)
                     if value in target_files_dict and not file_exists:
+                        cursor.execute(
+                            f"SELECT detection_type FROM file WHERE id = {target_files_dict[value]['id']}"
+                        )
+                        d_type = cursor.fetchone()["detection_type"]
                         file_exists = True
-                        target_id = target_files_dict[value]["id"]
+                        cursor.execute(
+                            f"SELECT * FROM file WHERE fileset = {fileset_id}"
+                        )
+                        target_files = cursor.fetchall()
+                        for target_file in target_files:
+                            cursor.execute(
+                                f"SELECT * FROM filechecksum WHERE file = {target_file['id']}"
+                            )
+                            target_checksums = cursor.fetchall()
+                            for checksum in target_checksums:
+                                previous_checksums[
+                                    f"{checksum['checktype']}-{checksum['checksize']}"
+                                ] = checksum["checksum"]
                         cursor.execute(
                             f"DELETE FROM file WHERE id = {target_files_dict[value]['id']}"
                         )
 
             if file_exists:
+                cursor.execute(
+                    f"SELECT checktype, checksize FROM filechecksum WHERE file = {file_id}"
+                )
+                existing_checks = cursor.fetchall()
+                existing_checksum = []
+                for existing_check in existing_checks:
+                    existing_checksum.append(
+                        existing_check["checktype"] + "-" + existing_check["checksize"]
+                    )
+                for key, value in previous_checksums.items():
+                    if key not in existing_checksum:
+                        checksize, checktype, checksum = get_checksum_props(key, value)
+                        cursor.execute(
+                            "INSERT INTO filechecksum (file, checksize, checktype, checksum) VALUES (%s, %s, %s, %s)",
+                            (file_id, checksize, checktype, checksum),
+                        )
+
                 cursor.execute(f"UPDATE file SET detection = 1 WHERE id = {file_id}")
                 cursor.execute(
-                    f"UPDATE file SET detection_type = '{target_files_dict[target_id]}' WHERE id = {file_id}"
+                    f"UPDATE file SET detection_type = '{d_type}' WHERE id = {file_id}"
                 )
             else:
                 cursor.execute(
                     f"UPDATE file SET detection_type = 'None' WHERE id = {file_id}"
                 )
 
-def insert_new_fileset(fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=''):
-    fileset_id = insert_fileset(src, detection, key, megakey, transaction_id, log_text, conn, username=user, ip=ip)
+
+def insert_new_fileset(
+    fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=""
+):
+    fileset_id = insert_fileset(
+        src,
+        detection,
+        key,
+        megakey,
+        transaction_id,
+        log_text,
+        conn,
+        username=user,
+        ip=ip,
+    )
     if fileset_id:
         for file in fileset["rom"]:
             insert_file(file, detection, src, conn)
             for key, value in file.items():
-                if key not in ["name", "size", "size-r", "size-rd" "sha1", "crc"]:
+                if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:
                     insert_filechecksum(file, key, conn)
     return fileset_id
 
@@ -1133,7 +1238,7 @@ def user_integrity_check(data, ip, game_metadata=None):
             "name": file["name"],
             "size": file["size"],
             "size-r": file["size-r"],
-            "size-rd": file["size-rd"] 
+            "size-rd": file["size-rd"],
         }
         for checksum in file["checksums"]:
             checksum_type = checksum["type"]
diff --git a/fileset.py b/fileset.py
index bb4483e..1a6add7 100644
--- a/fileset.py
+++ b/fileset.py
@@ -75,7 +75,8 @@ def index():
     """
     return render_template_string(html)
 
- at app.route('/clear_database', methods=['POST'])
+
+ at app.route("/clear_database", methods=["POST"])
 def clear_database():
     try:
         conn = db_connect()
@@ -98,11 +99,10 @@ def clear_database():
     finally:
         conn.close()
 
-    return redirect('/')
-
+    return redirect("/")
 
 
- at app.route('/fileset', methods=['GET', 'POST'])
+ at app.route("/fileset", methods=["GET", "POST"])
 def fileset():
     id = request.args.get("id", default=1, type=int)
     widetable = request.args.get("widetable", default="partial", type=str)
@@ -169,13 +169,15 @@ def fileset():
                         <button type='submit'>Mark as full</button>
                     </form>
                     """
-            
-            
-            cursor.execute("SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset" , (id,))
+
+            cursor.execute(
+                "SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset",
+                (id,),
+            )
             row = cursor.fetchone()
             print(row)
             if row:
-                id = row['fileset']
+                id = row["fileset"]
             cursor.execute(f"SELECT * FROM fileset WHERE id = {id}")
             result = cursor.fetchone()
             print(result)
@@ -228,6 +230,8 @@ def fileset():
             share_columns = [
                 "name",
                 "size",
+                "size-r",
+                "size-rd",
                 "checksum",
                 "detection",
                 "detection_type",
@@ -244,8 +248,12 @@ def fileset():
 
             columns_to_select = "file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp`"
             columns_to_select += ", ".join(md5_columns)
-            print(f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}")
-            cursor.execute(f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}")
+            print(
+                f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}"
+            )
+            cursor.execute(
+                f"SELECT file.id, name, size, `size-r`, `size-rd`, checksum, detection, detection_type, `timestamp` FROM file WHERE fileset = {id} {order}"
+            )
             result = cursor.fetchall()
 
             all_columns = list(result[0].keys()) if result else []
@@ -651,7 +659,6 @@ def confirm_merge(id):
                 WHERE 
                     fs.id = {target_id}
             """)
-            target_fileset = cursor.fetchone()
 
             def highlight_differences(source, target):
                 diff = difflib.ndiff(source, target)
@@ -682,6 +689,9 @@ def confirm_merge(id):
             <table border="1">
             <tr><th>Field</th><th>Source Fileset</th><th>Target Fileset</th></tr>
             """
+
+            target_fileset = cursor.fetchone()
+
             for column in source_fileset.keys():
                 source_value = str(source_fileset[column])
                 target_value = str(target_fileset[column])
@@ -1077,4 +1087,4 @@ def delete_files(id):
 
 if __name__ == "__main__":
     app.secret_key = secret_key
-    app.run(port=5001,debug=True, host='0.0.0.0')
+    app.run(port=5001, debug=True, host="0.0.0.0")
diff --git a/pagination.py b/pagination.py
index c28a2ee..f7b593e 100644
--- a/pagination.py
+++ b/pagination.py
@@ -196,13 +196,19 @@ def create_page(
                         fileset_id = matches.group(1)
                         fileset_text = matches.group(0)
                         with conn.cursor() as cursor:
-                            cursor.execute("SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset", (fileset_id,))
+                            cursor.execute(
+                                "SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset",
+                                (fileset_id,),
+                            )
                             row = cursor.fetchone()
                             print(row)
                             if row:
-                                fileset_id = row['fileset']
+                                fileset_id = row["fileset"]
 
-                        value = value.replace(fileset_text, f"<a href='fileset?id={fileset_id}'>{fileset_text}</a>")
+                        value = value.replace(
+                            fileset_text,
+                            f"<a href='fileset?id={fileset_id}'>{fileset_text}</a>",
+                        )
 
                 html += f"<td>{value}</td>\n"
             html += "</tr>\n"
diff --git a/schema.py b/schema.py
index 438bb12..98e2536 100644
--- a/schema.py
+++ b/schema.py
@@ -175,15 +175,22 @@ except Exception:
     cursor.execute("ALTER TABLE file MODIFY COLUMN punycode_name VARCHAR(200);")
 
 try:
-    cursor.execute("ALTER TABLE file ADD COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';")
-except:
-    cursor.execute("ALTER TABLE file MODIFY COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';")        
-       
-try:
-    cursor.execute("ALTER TABLE file ADD COLUMN `size-r` BIGINT DEFAULT 0, ADD COLUMN `size-rd` BIGINT DEFAULT 0;")
-except:
-    cursor.execute("ALTER TABLE file MODIFY COLUMN `size-r` BIGINT DEFAULT 0, MODIFY COLUMN `size-rd` BIGINT DEFAULT 0;")
+    cursor.execute(
+        "ALTER TABLE file ADD COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';"
+    )
+except Exception:
+    cursor.execute(
+        "ALTER TABLE file MODIFY COLUMN encoding_type VARCHAR(20) DEFAULT 'UTF-8';"
+    )
 
+try:
+    cursor.execute(
+        "ALTER TABLE file ADD COLUMN `size-r` BIGINT DEFAULT 0, ADD COLUMN `size-rd` BIGINT DEFAULT 0;"
+    )
+except Exception:
+    cursor.execute(
+        "ALTER TABLE file MODIFY COLUMN `size-r` BIGINT DEFAULT 0, MODIFY COLUMN `size-rd` BIGINT DEFAULT 0;"
+    )
 
 
 for index, definition in indices.items():
diff --git a/tests/create/create_binary.py b/tests/create/create_binary.py
index d3d9842..4001fa6 100644
--- a/tests/create/create_binary.py
+++ b/tests/create/create_binary.py
@@ -1,16 +1,16 @@
 import struct
 import os
 
+
 def generate_macbinary_test_files():
     output_dir = "../data/invalid_mac_binary/"
 
     with open(os.path.join(output_dir, "len_less_than_128_bytes.bin"), "wb") as f:
-        f.write(b'\x12')
-
+        f.write(b"\x12")
 
-    header = bytearray([1]*128)
+    header = bytearray([1] * 128)
     # name length, data fork len, resource fork len, type/creator
-    header[1] = 0  
+    header[1] = 0
     header[83:87] = struct.pack(">I", 0)
     header[87:91] = struct.pack(">I", 0)
     header[69:73] = struct.pack(">I", 0)
@@ -18,26 +18,23 @@ def generate_macbinary_test_files():
     with open(os.path.join(output_dir, "zero_len_fields.bin"), "wb") as f:
         f.write(header)
 
-    header = bytearray([1]*128)
+    header = bytearray([1] * 128)
     header[1] = 10
     header[124:126] = struct.pack(">H", 0xFFFF)
     with open(os.path.join(output_dir, "bad_checksum.bin"), "wb") as f:
         f.write(header)
 
-
     header[124:126] = struct.pack(">H", 4263)
-    header[1] = 100 
+    header[1] = 100
     with open(os.path.join(output_dir, "name_length_too_large.bin"), "wb") as f:
         f.write(header)
 
-
     header[1] = 10
     header[83:87] = struct.pack(">I", 100)
     header[87:91] = struct.pack(">I", 50)
     with open(os.path.join(output_dir, "forks_mismatch.bin"), "wb") as f:
         f.write(header)
-        f.write(b'\x00' * 50) 
-
+        f.write(b"\x00" * 50)
 
     output_dir = "../data/valid_mac_binary/"
 
@@ -45,25 +42,24 @@ def generate_macbinary_test_files():
     header = bytearray(128)
     header[0] = 0
     header[1] = 5
-    header[2:7] = b'test\x00'
-    header[69:73] = b'1111' 
-    header[73:77] = b'1111'
+    header[2:7] = b"test\x00"
+    header[69:73] = b"1111"
+    header[73:77] = b"1111"
     header[74] = 0
     header[82] = 0
     data_fork_len = 10
     header[83:87] = struct.pack(">I", data_fork_len)
     res_fork_len = 0
     header[87:91] = struct.pack(">I", res_fork_len)
-    data_fork = b'0123456789'
-    data_fork_len_padded = (((data_fork_len + 127) >> 7) << 7)
-    data_fork_padded = data_fork + b'\x00' * (data_fork_len_padded - data_fork_len)
+    data_fork = b"0123456789"
+    data_fork_len_padded = ((data_fork_len + 127) >> 7) << 7
+    data_fork_padded = data_fork + b"\x00" * (data_fork_len_padded - data_fork_len)
     header[124:126] = struct.pack(">H", 27858)
-    file_size = 128 + data_fork_len_padded + res_fork_len
-
 
     with open(os.path.join(output_dir, "valid_macbinary.bin"), "wb") as f:
         f.write(header)
         f.write(data_fork_padded)
 
+
 if __name__ == "__main__":
-    generate_macbinary_test_files()
\ No newline at end of file
+    generate_macbinary_test_files()
diff --git a/tests/test_compute_hash.py b/tests/test_compute_hash.py
index 30ef94e..b1cab92 100644
--- a/tests/test_compute_hash.py
+++ b/tests/test_compute_hash.py
@@ -1,16 +1,18 @@
 import sys
 import os
+
 sys.path.insert(0, ".")
 
 from compute_hash import is_macbin
 
+
 def test_is_macbin():
     invalid_mac_dir = "tests/data/invalid_mac_binary"
     valid_mac_dir = "tests/data/valid_mac_binary"
     checks = []
-    for file  in os.listdir(valid_mac_dir):
+    for file in os.listdir(valid_mac_dir):
         checks.append([os.path.join(valid_mac_dir, file), True])
-    for file  in os.listdir(invalid_mac_dir):
+    for file in os.listdir(invalid_mac_dir):
         checks.append([os.path.join(invalid_mac_dir, file), False])
 
     for input, expected in checks:
diff --git a/tests/test_punycode.py b/tests/test_punycode.py
index bff37b1..1affd0c 100644
--- a/tests/test_punycode.py
+++ b/tests/test_punycode.py
@@ -3,50 +3,58 @@ from db_functions import punycode_need_encode, encode_punycode
 
 def test_needs_punyencoding():
     checks = [
-    ["Icon\r", True],
-    ["ascii", False],
-    ["ends with dot .", True],
-    ["ends with space ", True],
-    ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", True],
-    ["Hello*", True],
-    ["File I/O", True],
-    ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", True],
-    ["Buried in Timeâ„¢ Demo", True],
-    ["â€¢Main Menu", True],
-    ["Spaceship Warlockâ„¢", True],
-    ["ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>", True],
-    ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", True],
-    ["JÃ¶nssonligan.exe", True],
-    ["G3ãƒ•ã‚©ãƒ«ãƒ€", True],
-    ["Big[test]", False],
-    ["Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|\"(everything)/\":*|\\?%<>,;=", True],
-    ["Buried in Timeï½ª Demo", True]
+        ["Icon\r", True],
+        ["ascii", False],
+        ["ends with dot .", True],
+        ["ends with space ", True],
+        ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", True],
+        ["Hello*", True],
+        ["File I/O", True],
+        ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", True],
+        ["Buried in Timeâ„¢ Demo", True],
+        ["â€¢Main Menu", True],
+        ["Spaceship Warlockâ„¢", True],
+        ["ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>", True],
+        ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", True],
+        ["JÃ¶nssonligan.exe", True],
+        ["G3ãƒ•ã‚©ãƒ«ãƒ€", True],
+        ["Big[test]", False],
+        [
+            'Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|"(everything)/":*|\\?%<>,;=',
+            True,
+        ],
+        ["Buried in Timeï½ª Demo", True],
     ]
     for input, expected in checks:
         assert punycode_need_encode(input) == expected
 
+
 def test_punycode_encode():
     checks = [
-    ["Icon\r", "xn--Icon-ja6e"],
-    ["ascii", "ascii"],
-    ["ends with dot .", "xn--ends with dot .-"],
-    ["ends with space ", "xn--ends with space -"],
-    ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", "xn--(Power PC)-jx4ilmwb1a7h"],
-    ["Hello*", "xn--Hello-la10a"],
-    ["File I/O", "xn--File IO-oa82b"],
-    ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", "xn--HDG3-rw3c5o2dpa9kzb2170dd4tzyda5j4k"],
-    ["Buried in Timeâ„¢ Demo", "xn--Buried in Time Demo-eo0l"],
-    ["â€¢Main Menu", "xn--Main Menu-zd0e"],
-    ["Spaceship Warlockâ„¢", "xn--Spaceship Warlock-306j"],
-    ["ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>", "xn--baa0pja0512dela6bueub9gshf1k1a1rt742c060a2x4u"],
-    ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", "xn--Jnssonligan gr p djupet.exe-glcd70c"],
-    ["JÃ¶nssonligan.exe", "xn--Jnssonligan.exe-8sb"],
-    ["G3ãƒ•ã‚©ãƒ«ãƒ€", "xn--G3-3g4axdtexf"],
-    ["Big[test]", "Big[test]"],
-    ["Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|\"(everything)/\":*|\\?%<>,;=", "xn--Where  Do you Want  To G  ;Unless=nowhere,or(everything),;=-5baedgdcbtamaaaaaaaaa99woa3wnnmb82aqb71ekb9g3c1f1cyb7bx6rfcv2pxa"],
-    ["Buried in Timeï½ª Demo", "xn--Buried in Time Demo-yp97h"]
+        ["Icon\r", "xn--Icon-ja6e"],
+        ["ascii", "ascii"],
+        ["ends with dot .", "xn--ends with dot .-"],
+        ["ends with space ", "xn--ends with space -"],
+        ["ãƒãƒƒãƒ‰ãƒ‡ã‚¤(Power PC)", "xn--(Power PC)-jx4ilmwb1a7h"],
+        ["Hello*", "xn--Hello-la10a"],
+        ["File I/O", "xn--File IO-oa82b"],
+        ["HDã«ï½ºï¾‹ï¾Ÿï½°ã—ã¦ä¸‹ã•ã„ã€‚G3", "xn--HDG3-rw3c5o2dpa9kzb2170dd4tzyda5j4k"],
+        ["Buried in Timeâ„¢ Demo", "xn--Buried in Time Demo-eo0l"],
+        ["â€¢Main Menu", "xn--Main Menu-zd0e"],
+        ["Spaceship Warlockâ„¢", "xn--Spaceship Warlock-306j"],
+        [
+            "ãƒ¯ãƒãƒ“ãƒ¼ã‚¸ãƒ£ãƒƒã‚¯ã®å¤§å†’é™º<ãƒ‡ãƒ¢>",
+            "xn--baa0pja0512dela6bueub9gshf1k1a1rt742c060a2x4u",
+        ],
+        ["JÃ¶nssonligan gÃ¥r pÃ¥ djupet.exe", "xn--Jnssonligan gr p djupet.exe-glcd70c"],
+        ["JÃ¶nssonligan.exe", "xn--Jnssonligan.exe-8sb"],
+        ["G3ãƒ•ã‚©ãƒ«ãƒ€", "xn--G3-3g4axdtexf"],
+        ["Big[test]", "Big[test]"],
+        [
+            'Where \\ Do <you> Want / To: G* ? ;Unless=nowhere,or|"(everything)/":*|\\?%<>,;=',
+            "xn--Where  Do you Want  To G  ;Unless=nowhere,or(everything),;=-5baedgdcbtamaaaaaaaaa99woa3wnnmb82aqb71ekb9g3c1f1cyb7bx6rfcv2pxa",
+        ],
+        ["Buried in Timeï½ª Demo", "xn--Buried in Time Demo-yp97h"],
     ]
     for input, expected in checks:
         assert encode_punycode(input) == expected
-
-


Commit: d84d2c502da5794750c369f6b6b0782c1adab7c2
    https://github.com/scummvm/scummvm-sites/commit/d84d2c502da5794750c369f6b6b0782c1adab7c2
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add warning log and skip duplicate detection entry.

There are some entries in scummvm.dat that share the same megakey even with different gameid or title. During the initial seeding, these entries will now be skipped if one of them has already been added. Their engineid, gameid, platform, and language will be logged.

Changed paths:
  R .pre-commit-config.yaml
    .gitignore
    db_functions.py


diff --git a/.gitignore b/.gitignore
index a06a587..29610c1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,8 +2,4 @@
 mysql_config.json
 __pycache__
 .DS_Store
-.pre-commit-config.yaml
-dumps
 .pytest_cache
-mac_dats
-macresfork 2
\ No newline at end of file
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
deleted file mode 100644
index 3c9cbcc..0000000
--- a/.pre-commit-config.yaml
+++ /dev/null
@@ -1,9 +0,0 @@
-repos:
-- repo: https://github.com/astral-sh/ruff-pre-commit
-  # Ruff version.
-  rev: v0.11.13
-  hooks:
-    # Run the linter.
-    - id: ruff-check
-    # Run the formatter.
-    - id: ruff-format
\ No newline at end of file
diff --git a/db_functions.py b/db_functions.py
index 8db29d5..4ff8774 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -487,6 +487,10 @@ def db_insert(data_arr, username=None, skiplog=False):
         print(f"Missing key in header: {e}")
         return
 
+    cursor = conn.cursor()
+    cursor.execute("SELECT COUNT(*) FROM file;")
+    is_db_empty = list(cursor.fetchone().values())[0] == 0
+
     src = "dat" if author not in ["scan", "scummvm"] else author
 
     detection = src == "scummvm"
@@ -508,6 +512,9 @@ def db_insert(data_arr, username=None, skiplog=False):
     create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
     for fileset in game_data:
+        key = calc_key(fileset) if not detection else ""
+        megakey = calc_megakey(fileset) if detection else ""
+
         if detection:
             engine_name = fileset["engine"]
             engineid = fileset["sourcefile"]
@@ -517,6 +524,18 @@ def db_insert(data_arr, username=None, skiplog=False):
             platform = fileset["platform"]
             lang = fileset["language"]
 
+            if is_db_empty:
+                with conn.cursor() as cursor:
+                    cursor.execute(
+                        "SELECT id FROM fileset WHERE megakey = %s", (megakey,)
+                    )
+                    existing_entry = cursor.fetchone()
+                    if existing_entry is not None:
+                        log_text = f"Skipping Entry as megakey already exsits in Fileset:{existing_entry['id']} : engineid = {engineid}, gameid = {gameid}, platform = {platform}, language = {lang}"
+                        create_log("Warning", user, escape_string(log_text), conn)
+                        print(log_text)
+                        continue
+
             insert_game(
                 engine_name, engineid, title, gameid, extra, platform, lang, conn
             )
@@ -524,8 +543,6 @@ def db_insert(data_arr, username=None, skiplog=False):
             if "romof" in fileset and fileset["romof"] in resources:
                 fileset["rom"] = fileset["rom"] + resources[fileset["romof"]]["rom"]
 
-        key = calc_key(fileset) if not detection else ""
-        megakey = calc_megakey(fileset) if detection else ""
         log_text = f"size {os.path.getsize(filepath)}, author {author}, version {version}. State {status}."
 
         if insert_fileset(
@@ -1035,7 +1052,6 @@ def handle_matched_filesets(
 
 def delete_original_fileset(fileset_id, conn):
     with conn.cursor() as cursor:
-        print(fileset_id)
         cursor.execute(f"DELETE FROM file WHERE fileset = {fileset_id}")
         cursor.execute(f"DELETE FROM fileset WHERE id = {fileset_id}")
     conn.commit()


Commit: 0c9d84560fd279cfd4f214e7c4aabf8e7d948cfe
    https://github.com/scummvm/scummvm-sites/commit/0c9d84560fd279cfd4f214e7c4aabf8e7d948cfe
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Drop support for generating m prefix checksums for macfiles

Changed paths:
    compute_hash.py


diff --git a/compute_hash.py b/compute_hash.py
index f74f67c..0067cfd 100644
--- a/compute_hash.py
+++ b/compute_hash.py
@@ -316,8 +316,6 @@ def file_checksum(filepath, alg, size, file_info):
             (resfork, cur_file_size) = actual_mac_fork_get_resource_fork_data(filepath)
             datafork = actual_mac_fork_get_data_fork(filepath)
 
-        combined_forks = datafork + resfork
-
         hashes = checksum(resfork, alg, size, filepath)
         prefix = 'r'
         if len(resfork):
@@ -327,10 +325,6 @@ def file_checksum(filepath, alg, size, file_info):
         prefix = 'd'
         res.extend(create_checksum_pairs(hashes, alg, size, prefix))
 
-        hashes = checksum(combined_forks, alg, size, filepath)
-        prefix = 'm'
-        res.extend(create_checksum_pairs(hashes, alg, size, prefix))
-
         return (res, cur_file_size)
 
 def create_checksum_pairs(hashes, alg, size, prefix=None):


Commit: 3f248a44ce7e1bcebee748899b651229e981f253
    https://github.com/scummvm/scummvm-sites/commit/3f248a44ce7e1bcebee748899b651229e981f253
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Remove sha1 and crc checksum addition to database

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 4ff8774..60c02ac 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -559,7 +559,7 @@ def db_insert(data_arr, username=None, skiplog=False):
             for file in fileset["rom"]:
                 insert_file(file, detection, src, conn)
                 for key, value in file.items():
-                    if key not in ["name", "size", "size-r", "size-rd"]:
+                    if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:
                         insert_filechecksum(file, key, conn)
 
     if detection:
@@ -1142,7 +1142,7 @@ def populate_file(fileset, fileset_id, conn, detection):
             previous_checksums = {}
 
             for key, value in file.items():
-                if key not in ["name", "size", "size-r", "size-rd"]:
+                if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:
                     insert_filechecksum(file, key, conn)
                     if value in target_files_dict and not file_exists:
                         cursor.execute(


Commit: 109f206728e3cb65ecff376b96cebfbbcd8462af
    https://github.com/scummvm/scummvm-sites/commit/109f206728e3cb65ecff376b96cebfbbcd8462af
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Parse 'f' prefix checksum as normal front bytes checksum

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 60c02ac..92cbc8b 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -44,17 +44,11 @@ def get_checksum_props(checkcode, checksum):
             checksize = last
         checktype = "-".join(exploded_checkcode)
 
-        # # Of type md5-5000-t
-        # else:
-        #     second_last = exploded_checkcode.pop()
-        #     print(second_last)
-        #     checksize = second_last
-        #     checktype = exploded_checkcode[0]+'-'+last
-
     # Detection entries have checktypes as part of the checksum prefix
     if ":" in checksum:
         prefix = checksum.split(":")[0]
-        checktype += "-" + prefix
+        if prefix != "f":
+            checktype += "-" + prefix
 
         checksum = checksum.split(":")[1]
     return checksize, checktype, checksum


Commit: 338f379667922152e332e8fb47493960909dc46d
    https://github.com/scummvm/scummvm-sites/commit/338f379667922152e332e8fb47493960909dc46d
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Rewrite set.dat processing logic.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 92cbc8b..b082b5a 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -797,6 +797,15 @@ def populate_matching_games():
 
 
 def match_fileset(data_arr, username=None):
+    """
+    data_arr -> tuple : (header, game_data, resources, filepath).
+    header -> dict : Information like author, version, description, etc.
+    game_data -> list[dict] : List of individual game entry as dictionary.
+    rom -> list[dict] : A key from one of the dict values from game_data. Contains all the game files as dict.
+    resources -> dict : Some extra files in case of set.dats
+    filepath -> str : Path of the dat file.
+    """
+
     header, game_data, resources, filepath = data_arr
 
     try:
@@ -806,7 +815,7 @@ def match_fileset(data_arr, username=None):
         return
 
     try:
-        author = header["author"]
+        author = header["author"] if "author" in header else "Unkown author"
         version = header["version"]
     except KeyError as e:
         print(f"Missing key in header: {e}")
@@ -829,9 +838,9 @@ def match_fileset(data_arr, username=None):
     user = f"cli:{getpass.getuser()}" if username is None else username
     create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
-    for fileset in game_data:
-        process_fileset(
-            fileset,
+    if src == "dat":
+        set_process(
+            game_data,
             resources,
             detection,
             src,
@@ -843,11 +852,206 @@ def match_fileset(data_arr, username=None):
             source_status,
             user,
         )
+    else:
+        for fileset in game_data:
+            process_fileset(
+                fileset,
+                resources,
+                detection,
+                src,
+                conn,
+                transaction_id,
+                filepath,
+                author,
+                version,
+                source_status,
+                user,
+            )
     finalize_fileset_insertion(
         conn, transaction_id, src, filepath, author, version, source_status, user
     )
 
 
+def set_process(
+    game_data,
+    resources,
+    detection,
+    src,
+    conn,
+    transaction_id,
+    filepath,
+    author,
+    version,
+    source_status,
+    user,
+):
+    """
+    Entry point for processing set.dat.
+    -> Creates a new fileset for every fileset (delete later in case of a match).
+    -> set_filter_candidate_filesets(...) : Returns possible candidates for match
+    -> set_perform_match(...) : Handles different kind of scenarios for candidates
+    """
+
+    for fileset in game_data:
+        if "romof" in fileset and fileset["romof"] in resources:
+            fileset["rom"] += resources[fileset["romof"]]["rom"]
+        key = calc_key(fileset)
+        megakey = ""
+        log_text = f"size {os.path.getsize(filepath)}, author {author}, version {version}. State {source_status}."
+
+        fileset_id = insert_new_fileset(
+            fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
+        )
+
+        candidate_filesets = set_filter_candidate_filesets(fileset_id, fileset, conn)
+
+        set_perform_match(
+            fileset, src, user, fileset_id, detection, candidate_filesets, conn
+        )
+
+
+def set_perform_match(
+    fileset, src, user, fileset_id, detection, candidate_filesets, conn
+):
+    """
+    TODO
+    """
+    with conn.cursor() as cursor:
+        if len(candidate_filesets) == 1:
+            matched_fileset_id = candidate_filesets[0]
+            cursor.execute(
+                "SELECT status FROM fileset WHERE id = %s", (matched_fileset_id,)
+            )
+            status = cursor.fetchone()["status"]
+            if status == "detection":
+                update_fileset_status(cursor, matched_fileset_id, "partial")
+                set_populate_file(fileset, matched_fileset_id, conn, detection)
+                log_matched_fileset(
+                    src,
+                    fileset_id,
+                    matched_fileset_id,
+                    "partial",
+                    user,
+                    conn,
+                )
+                delete_original_fileset(fileset_id, conn)
+            else:
+                pass
+
+        elif len(candidate_filesets) > 1:
+            strong_match_candidate_filesets = []
+            for candidate_fileset in candidate_filesets:
+                if is_full_checksum_match(candidate_fileset, fileset, conn):
+                    strong_match_candidate_filesets.append(candidate_fileset)
+
+            if len(strong_match_candidate_filesets) == 1:
+                update_fileset_status(cursor, matched_fileset_id, "partial")
+                set_populate_file(fileset, matched_fileset_id, conn, detection)
+                log_matched_fileset(
+                    src,
+                    fileset_id,
+                    matched_fileset_id,
+                    "partial",
+                    user,
+                    conn,
+                )
+                delete_original_fileset(fileset_id, conn)
+            else:
+                if len(strong_match_candidate_filesets) > 1:
+                    print("Many strong match candidates")
+                category_text = "Manual Merge Required"
+                log_text = f"Merge Fileset:{fileset_id} manually. Possible matches are: {', '.join(f'Fileset:{id}' for id in candidate_filesets)}."
+                print(log_text)
+                create_log(
+                    escape_string(category_text), user, escape_string(log_text), conn
+                )
+
+
+def is_full_checksum_match(candidate_fileset, fileset, conn):
+    """
+    Return type - Boolean
+    Checks if all the files in the candidate fileset has a matching checksum with the set fileset.
+    """
+    with conn.cursor() as cursor:
+        cursor.execute(
+            "SELECT id, name FROM file WHERE fileset = %s", (candidate_fileset,)
+        )
+        target_files = cursor.fetchall()
+        candidate_files = {
+            target_file["name"]: target_file["id"] for target_file in target_files
+        }
+        set_checksums = set()
+        for file in fileset["rom"]:
+            if "md5" in file:
+                set_checksums.add((file["name"].lower(), file["md5"]))
+
+        for fname, fid in candidate_files.items():
+            cursor.execute("SELECT checksum FROM filechecksum WHERE file = %s", (fid,))
+            candidate_checksums = cursor.fetchall()
+            if candidate_checksums:
+                found = False
+                for candidate_checksum in candidate_checksums:
+                    if (fname.lower(), candidate_checksum["checksum"]) in set_checksums:
+                        found = True
+                        break
+                if not found:
+                    return False
+        return True
+
+
+def set_filter_candidate_filesets(fileset_id, fileset, conn):
+    """
+    Returns a list of candidate filesets that can be merged
+    """
+    with conn.cursor() as cursor:
+        # Returns those filesets which have the maximum number of all detection files matching in the set fileset filtered by engine, file name and file size(if not -1).
+        # Returns multiple filesets if multiple filesets have same max number of matching files
+
+        query = """
+            WITH candidate_fileset AS ( 
+            SELECT fs.id AS fileset_id, f.name, f.size
+            FROM file f
+            JOIN fileset fs ON f.fileset = fs.id
+            JOIN game g ON g.id = fs.game
+            JOIN engine e ON e.id = g.engine
+            WHERE fs.id != %s
+            AND e.engineid = %s
+            AND f.detection = 1
+            ),
+            total_detection_files AS (
+            SELECT cf.fileset_id, COUNT(*) AS detection_files_found
+            FROM candidate_fileset cf
+            GROUP BY fileset_id
+            ),
+            set_fileset AS (
+            SELECT name, size FROM file
+            WHERE fileset = %s
+            ),
+            matched_detection_files AS (
+            SELECT cf.fileset_id, COUNT(*) AS match_files_count
+            FROM candidate_fileset cf
+            JOIN set_fileset sf ON cf.name = sf.name AND (cf.size = sf.size OR cf.size = -1)
+            GROUP BY cf.fileset_id
+            ),
+            max_match_count AS (
+                SELECT MAX(match_files_count) AS max_count FROM matched_detection_files
+            )
+            SELECT mdf.fileset_id
+            FROM matched_detection_files mdf
+            JOIN total_detection_files tdf ON mdf.fileset_id = tdf.fileset_id
+            JOIN max_match_count mmc ON mdf.match_files_count = mmc.max_count
+            WHERE mdf.match_files_count = tdf.detection_files_found;
+        """
+        cursor.execute(query, (fileset_id, fileset["sourcefile"], fileset_id))
+        rows = cursor.fetchall()
+        candidates = []
+        if rows:
+            for row in rows:
+                candidates.append(row["fileset_id"])
+
+        return candidates
+
+
 def process_fileset(
     fileset,
     resources,
@@ -877,13 +1081,6 @@ def process_fileset(
     fileset_id = insert_new_fileset(
         fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
     )
-    # with conn.cursor() as cursor:
-    #     cursor.execute("SET @fileset_last = LAST_INSERT_ID()")
-    #     cursor.execute("SELECT LAST_INSERT_ID()")
-    #     fileset_last_old = cursor.fetchone()['LAST_INSERT_ID()']
-    #     fileset_last = cursor.lastrowid
-    #     print(fileset_last_old)
-    #     print(fileset_last)
 
     if matched_map:
         handle_matched_filesets(
@@ -1189,6 +1386,73 @@ def populate_file(fileset, fileset_id, conn, detection):
                 )
 
 
+def set_populate_file(fileset, fileset_id, conn, detection):
+    """
+    TODO
+    """
+    with conn.cursor() as cursor:
+        cursor.execute(f"SELECT id, name FROM file WHERE fileset = {fileset_id}")
+        target_files = cursor.fetchall()
+        candidate_files = {
+            target_file["name"].lower(): target_file["id"]
+            for target_file in target_files
+        }
+
+        for file in fileset["rom"]:
+            if "md5" not in file:
+                continue
+            checksize, checktype, checksum = get_checksum_props("md5", file["md5"])
+
+            if file["name"].lower() not in candidate_files:
+                name = (
+                    encode_punycode(file["name"])
+                    if punycode_need_encode(file["name"])
+                    else file["name"]
+                )
+
+                values = [name]
+
+                values.append(file["size"] if "size" in file else "0")
+                values.append(file["size-r"] if "size-r" in file else "0")
+                values.append(file["size-rd"] if "size-rd" in file else "0")
+
+                values.extend([checksum, fileset_id, detection, "None"])
+
+                placeholders = (
+                    ["%s"] * (len(values[:5])) + ["%s"] + ["%s"] * 2 + ["NOW()"]
+                )
+                query = f"INSERT INTO file ( name, size, `size-r`, `size-rd`, checksum, fileset, detection, detection_type, `timestamp` ) VALUES ({', '.join(placeholders)})"
+
+                cursor.execute(query, values)
+                cursor.execute("SET @file_last = LAST_INSERT_ID()")
+                cursor.execute("SELECT @file_last AS file_id")
+
+                insert_filechecksum(file, "md5", conn)
+
+            else:
+                query = """
+                    UPDATE file
+                    SET size = %s
+                    WHERE id = %s
+                """
+                cursor.execute(
+                    query, (file["size"], candidate_files[file["name"].lower()])
+                )
+                query = """
+                    INSERT INTO filechecksum (file, checksize, checktype, checksum)
+                    VALUES (%s, %s, %s, %s)
+                """
+                cursor.execute(
+                    query,
+                    (
+                        candidate_files[file["name"].lower()],
+                        checksize,
+                        checktype,
+                        checksum,
+                    ),
+                )
+
+
 def insert_new_fileset(
     fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=""
 ):


Commit: 65b64934f29753e3a0c9d142ed254d2c4aaad016
    https://github.com/scummvm/scummvm-sites/commit/65b64934f29753e3a0c9d142ed254d2c4aaad016
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Handle multiple fileset references in log entries

Changed paths:
    fileset.py
    pagination.py


diff --git a/fileset.py b/fileset.py
index 1a6add7..3012408 100644
--- a/fileset.py
+++ b/fileset.py
@@ -83,14 +83,14 @@ def clear_database():
         with conn.cursor() as cursor:
             cursor.execute("SET FOREIGN_KEY_CHECKS = 0;")
             cursor.execute("TRUNCATE TABLE filechecksum")
+            cursor.execute("TRUNCATE TABLE history")
+            cursor.execute("TRUNCATE TABLE transactions")
+            cursor.execute("TRUNCATE TABLE queue")
             cursor.execute("TRUNCATE TABLE file")
             cursor.execute("TRUNCATE TABLE fileset")
-            cursor.execute("TRUNCATE TABLE history")
             cursor.execute("TRUNCATE TABLE game")
             cursor.execute("TRUNCATE TABLE engine")
             cursor.execute("TRUNCATE TABLE log")
-            cursor.execute("TRUNCATE TABLE queue")
-            cursor.execute("TRUNCATE TABLE transactions")
             cursor.execute("SET FOREIGN_KEY_CHECKS = 1;")
             conn.commit()
             print("DATABASE CLEARED")
diff --git a/pagination.py b/pagination.py
index f7b593e..04fbfb4 100644
--- a/pagination.py
+++ b/pagination.py
@@ -191,17 +191,16 @@ def create_page(
 
                 # Add links to fileset in logs table
                 if isinstance(value, str):
-                    matches = re.search(r"Fileset:(\d+)", value)
-                    if matches:
-                        fileset_id = matches.group(1)
-                        fileset_text = matches.group(0)
+                    matches = re.findall(r"Fileset:(\d+)", value)
+                    for fileset_id in matches:
+                        fileset_text = f"Fileset:{fileset_id}"
+
                         with conn.cursor() as cursor:
                             cursor.execute(
                                 "SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset",
                                 (fileset_id,),
                             )
                             row = cursor.fetchone()
-                            print(row)
                             if row:
                                 fileset_id = row["fileset"]
 


Commit: 809be572a6b59bb6f50d95a0a48ce9d6b58fc379
    https://github.com/scummvm/scummvm-sites/commit/809be572a6b59bb6f50d95a0a48ce9d6b58fc379
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Drop set filesets if no candidates for matching - e.g mac files

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index b082b5a..3a72193 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -905,6 +905,19 @@ def set_process(
 
         candidate_filesets = set_filter_candidate_filesets(fileset_id, fileset, conn)
 
+        # Mac files in set.dat are not represented properly and they won't find a candidate fileset for a match, so we can drop them.
+        if len(candidate_filesets) == 0:
+            category_text = "Drop set fileset"
+            fileset_name = fileset["name"] if "name" in fileset else ""
+            fileset_description = (
+                fileset["description"] if "description" in fileset else ""
+            )
+            log_text = f"Drop fileset as no matching candidates. Name: {fileset_name}, Description: {fileset_description}"
+            create_log(
+                escape_string(category_text), user, escape_string(log_text), conn
+            )
+            delete_original_fileset(fileset_id, conn)
+
         set_perform_match(
             fileset, src, user, fileset_id, detection, candidate_filesets, conn
         )


Commit: 52f6e7da927721f8f605f67527237837b2da51f6
    https://github.com/scummvm/scummvm-sites/commit/52f6e7da927721f8f605f67527237837b2da51f6
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add re-updation logic for set.dat

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 3a72193..ccb755b 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -867,9 +867,9 @@ def match_fileset(data_arr, username=None):
                 source_status,
                 user,
             )
-    finalize_fileset_insertion(
-        conn, transaction_id, src, filepath, author, version, source_status, user
-    )
+        finalize_fileset_insertion(
+            conn, transaction_id, src, filepath, author, version, source_status, user
+        )
 
 
 def set_process(
@@ -892,6 +892,9 @@ def set_process(
     -> set_perform_match(...) : Handles different kind of scenarios for candidates
     """
 
+    # Keeps count of filesets that were already present
+    fully_matched_filesets = 0
+
     for fileset in game_data:
         if "romof" in fileset and fileset["romof"] in resources:
             fileset["rom"] += resources[fileset["romof"]]["rom"]
@@ -903,7 +906,9 @@ def set_process(
             fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
         )
 
-        candidate_filesets = set_filter_candidate_filesets(fileset_id, fileset, conn)
+        candidate_filesets = set_filter_candidate_filesets(
+            fileset_id, fileset, transaction_id, conn
+        )
 
         # Mac files in set.dat are not represented properly and they won't find a candidate fileset for a match, so we can drop them.
         if len(candidate_filesets) == 0:
@@ -918,13 +923,43 @@ def set_process(
             )
             delete_original_fileset(fileset_id, conn)
 
-        set_perform_match(
-            fileset, src, user, fileset_id, detection, candidate_filesets, conn
+        fully_matched_filesets = set_perform_match(
+            fileset,
+            src,
+            user,
+            fileset_id,
+            detection,
+            candidate_filesets,
+            fully_matched_filesets,
+            conn,
         )
 
+    # Final log
+    with conn.cursor() as cursor:
+        query = """
+            UPDATE fileset
+            SET status='partial'
+            WHERE status='partial_pending'
+        """
+        cursor.execute(query)
+        cursor.execute(
+            f"SELECT COUNT(fileset) from transactions WHERE `transaction` = {transaction_id}"
+        )
+        fileset_insertion_count = cursor.fetchone()["COUNT(fileset)"]
+        category_text = f"Uploaded from {src}"
+        log_text = f"Completed loading DAT file, filename {filepath}, size {os.path.getsize(filepath)}, author {author}, version {version}. State {source_status}. Number of filesets: {fileset_insertion_count}. Number of filesets already present: {fully_matched_filesets}.  Transaction: {transaction_id}"
+        create_log(escape_string(category_text), user, escape_string(log_text), conn)
+
 
 def set_perform_match(
-    fileset, src, user, fileset_id, detection, candidate_filesets, conn
+    fileset,
+    src,
+    user,
+    fileset_id,
+    detection,
+    candidate_filesets,
+    fully_matched_filesets,
+    conn,
 ):
     """
     TODO
@@ -937,7 +972,7 @@ def set_perform_match(
             )
             status = cursor.fetchone()["status"]
             if status == "detection":
-                update_fileset_status(cursor, matched_fileset_id, "partial")
+                update_fileset_status(cursor, matched_fileset_id, "partial_pending")
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
                 log_matched_fileset(
                     src,
@@ -948,17 +983,46 @@ def set_perform_match(
                     conn,
                 )
                 delete_original_fileset(fileset_id, conn)
-            else:
-                pass
+            elif status == "partial" or status == "full":
+                (is_match, unmatched_files) = is_full_checksum_match(
+                    matched_fileset_id, fileset, conn
+                )
+                if is_match:
+                    category_text = "Already present"
+                    log_text = f"Already present as - Fileset:{matched_fileset_id}. Deleting Fileset:{fileset_id}"
+                    log_last = create_log(
+                        escape_string(category_text),
+                        user,
+                        escape_string(log_text),
+                        conn,
+                    )
+                    update_history(fileset_id, matched_fileset_id, conn, log_last)
+                    fully_matched_filesets += 1
+                    delete_original_fileset(fileset_id, conn)
+
+                else:
+                    category_text = "Mismatch"
+                    log_text = f"Fileset:{fileset_id} mismatched with Fileset:{matched_fileset_id} with status:{status}. Try manual merge."
+                    print(
+                        f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {len(unmatched_files)}."
+                    )
+                    # print(f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {', '.join(filename for filename in unmatched_files)}.")
+                    create_log(
+                        escape_string(category_text),
+                        user,
+                        escape_string(log_text),
+                        conn,
+                    )
 
         elif len(candidate_filesets) > 1:
             strong_match_candidate_filesets = []
             for candidate_fileset in candidate_filesets:
-                if is_full_checksum_match(candidate_fileset, fileset, conn):
+                (is_match, _) = is_full_checksum_match(candidate_fileset, fileset, conn)
+                if is_match:
                     strong_match_candidate_filesets.append(candidate_fileset)
 
             if len(strong_match_candidate_filesets) == 1:
-                update_fileset_status(cursor, matched_fileset_id, "partial")
+                update_fileset_status(cursor, matched_fileset_id, "partial_pending")
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
                 log_matched_fileset(
                     src,
@@ -970,8 +1034,6 @@ def set_perform_match(
                 )
                 delete_original_fileset(fileset_id, conn)
             else:
-                if len(strong_match_candidate_filesets) > 1:
-                    print("Many strong match candidates")
                 category_text = "Manual Merge Required"
                 log_text = f"Merge Fileset:{fileset_id} manually. Possible matches are: {', '.join(f'Fileset:{id}' for id in candidate_filesets)}."
                 print(log_text)
@@ -979,13 +1041,16 @@ def set_perform_match(
                     escape_string(category_text), user, escape_string(log_text), conn
                 )
 
+    return fully_matched_filesets
+
 
 def is_full_checksum_match(candidate_fileset, fileset, conn):
     """
-    Return type - Boolean
+    Return type - (Boolean, List of unmatched files)
     Checks if all the files in the candidate fileset has a matching checksum with the set fileset.
     """
     with conn.cursor() as cursor:
+        unmatched_files = []
         cursor.execute(
             "SELECT id, name FROM file WHERE fileset = %s", (candidate_fileset,)
         )
@@ -996,7 +1061,12 @@ def is_full_checksum_match(candidate_fileset, fileset, conn):
         set_checksums = set()
         for file in fileset["rom"]:
             if "md5" in file:
-                set_checksums.add((file["name"].lower(), file["md5"]))
+                name = (
+                    encode_punycode(file["name"])
+                    if punycode_need_encode(file["name"])
+                    else file["name"]
+                )
+                set_checksums.add((name.lower(), file["md5"]))
 
         for fname, fid in candidate_files.items():
             cursor.execute("SELECT checksum FROM filechecksum WHERE file = %s", (fid,))
@@ -1008,11 +1078,12 @@ def is_full_checksum_match(candidate_fileset, fileset, conn):
                         found = True
                         break
                 if not found:
-                    return False
-        return True
+                    unmatched_files.append(fname)
+
+        return (len(unmatched_files) == 0, unmatched_files)
 
 
-def set_filter_candidate_filesets(fileset_id, fileset, conn):
+def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
     """
     Returns a list of candidate filesets that can be merged
     """
@@ -1027,9 +1098,12 @@ def set_filter_candidate_filesets(fileset_id, fileset, conn):
             JOIN fileset fs ON f.fileset = fs.id
             JOIN game g ON g.id = fs.game
             JOIN engine e ON e.id = g.engine
+            JOIN transactions t ON t.fileset = fs.id
             WHERE fs.id != %s
             AND e.engineid = %s
             AND f.detection = 1
+            AND t.transaction != %s
+            AND fs.status != 'partial_pending'
             ),
             total_detection_files AS (
             SELECT cf.fileset_id, COUNT(*) AS detection_files_found
@@ -1055,7 +1129,9 @@ def set_filter_candidate_filesets(fileset_id, fileset, conn):
             JOIN max_match_count mmc ON mdf.match_files_count = mmc.max_count
             WHERE mdf.match_files_count = tdf.detection_files_found;
         """
-        cursor.execute(query, (fileset_id, fileset["sourcefile"], fileset_id))
+        cursor.execute(
+            query, (fileset_id, fileset["sourcefile"], transaction_id, fileset_id)
+        )
         rows = cursor.fetchall()
         candidates = []
         if rows:
@@ -1512,7 +1588,6 @@ def finalize_fileset_insertion(
             create_log(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
-    # conn.close()
 
 
 def user_integrity_check(data, ip, game_metadata=None):


Commit: 22e5e5c8535d8bdc4a890fa2d0cdea66348c33b2
    https://github.com/scummvm/scummvm-sites/commit/22e5e5c8535d8bdc4a890fa2d0cdea66348c33b2
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Create extra indices in db for faster query

Changed paths:
    schema.py


diff --git a/schema.py b/schema.py
index 98e2536..00cb975 100644
--- a/schema.py
+++ b/schema.py
@@ -149,6 +149,8 @@ indices = {
     "key": "CREATE INDEX fileset_key ON fileset (`key`)",
     "status": "CREATE INDEX status ON fileset (status)",
     "fileset": "CREATE INDEX fileset ON history (fileset)",
+    "file_name_size": "CREATE INDEX file_name_size ON file (name, size)",
+    "file_fileset_detection": "CREATE INDEX file_fileset_detection ON file (fileset, detection)",
 }
 
 try:


Commit: 63c6b57769b6bbe267db22fea7444f0fc80b7de2
    https://github.com/scummvm/scummvm-sites/commit/63c6b57769b6bbe267db22fea7444f0fc80b7de2
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add additional log details for number of filesets in different categories

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index ccb755b..30a7f0c 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -894,13 +894,17 @@ def set_process(
 
     # Keeps count of filesets that were already present
     fully_matched_filesets = 0
+    auto_merged_filesets = 0
+    manual_merged_filesets = 0
+    mismatch_filesets = 0
+    dropped_early_filesets = 0
 
     for fileset in game_data:
         if "romof" in fileset and fileset["romof"] in resources:
             fileset["rom"] += resources[fileset["romof"]]["rom"]
         key = calc_key(fileset)
         megakey = ""
-        log_text = f"size {os.path.getsize(filepath)}, author {author}, version {version}. State {source_status}."
+        log_text = f"State {source_status}."
 
         fileset_id = insert_new_fileset(
             fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
@@ -921,9 +925,15 @@ def set_process(
             create_log(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
+            dropped_early_filesets += 1
             delete_original_fileset(fileset_id, conn)
 
-        fully_matched_filesets = set_perform_match(
+        (
+            fully_matched_filesets,
+            auto_merged_filesets,
+            manual_merged_filesets,
+            mismatch_filesets,
+        ) = set_perform_match(
             fileset,
             src,
             user,
@@ -931,23 +941,23 @@ def set_process(
             detection,
             candidate_filesets,
             fully_matched_filesets,
+            auto_merged_filesets,
+            manual_merged_filesets,
+            mismatch_filesets,
             conn,
         )
 
     # Final log
     with conn.cursor() as cursor:
-        query = """
-            UPDATE fileset
-            SET status='partial'
-            WHERE status='partial_pending'
-        """
-        cursor.execute(query)
         cursor.execute(
             f"SELECT COUNT(fileset) from transactions WHERE `transaction` = {transaction_id}"
         )
         fileset_insertion_count = cursor.fetchone()["COUNT(fileset)"]
         category_text = f"Uploaded from {src}"
-        log_text = f"Completed loading DAT file, filename {filepath}, size {os.path.getsize(filepath)}, author {author}, version {version}. State {source_status}. Number of filesets: {fileset_insertion_count}. Number of filesets already present: {fully_matched_filesets}.  Transaction: {transaction_id}"
+        log_text = f"Completed loading DAT file, filename {filepath}, size {os.path.getsize(filepath)}. State {source_status}. Number of filesets: {fileset_insertion_count}. Transaction: {transaction_id}"
+        create_log(escape_string(category_text), user, escape_string(log_text), conn)
+        category_text = "Upload information"
+        log_text = f"Number of filesets: {fileset_insertion_count}. Filesets automatically merged: {auto_merged_filesets}. Filesets dropped early(no candidate) - {dropped_early_filesets}. Filesets requiring manual merge: {manual_merged_filesets}. Partial/Full filesets already present: {fully_matched_filesets}. Partial/Full filesets with mismatch {mismatch_filesets}."
         create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
 
@@ -959,6 +969,9 @@ def set_perform_match(
     detection,
     candidate_filesets,
     fully_matched_filesets,
+    auto_merged_filesets,
+    manual_merged_filesets,
+    mismatch_filesets,
     conn,
 ):
     """
@@ -972,8 +985,9 @@ def set_perform_match(
             )
             status = cursor.fetchone()["status"]
             if status == "detection":
-                update_fileset_status(cursor, matched_fileset_id, "partial_pending")
+                update_fileset_status(cursor, matched_fileset_id, "partial")
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
+                auto_merged_filesets += 1
                 log_matched_fileset(
                     src,
                     fileset_id,
@@ -1006,6 +1020,7 @@ def set_perform_match(
                     print(
                         f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {len(unmatched_files)}."
                     )
+                    mismatch_filesets += 1
                     # print(f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {', '.join(filename for filename in unmatched_files)}.")
                     create_log(
                         escape_string(category_text),
@@ -1022,8 +1037,9 @@ def set_perform_match(
                     strong_match_candidate_filesets.append(candidate_fileset)
 
             if len(strong_match_candidate_filesets) == 1:
-                update_fileset_status(cursor, matched_fileset_id, "partial_pending")
+                update_fileset_status(cursor, matched_fileset_id, "partial")
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
+                auto_merged_filesets += 1
                 log_matched_fileset(
                     src,
                     fileset_id,
@@ -1040,8 +1056,14 @@ def set_perform_match(
                 create_log(
                     escape_string(category_text), user, escape_string(log_text), conn
                 )
+                manual_merged_filesets += 1
 
-    return fully_matched_filesets
+    return (
+        fully_matched_filesets,
+        auto_merged_filesets,
+        manual_merged_filesets,
+        mismatch_filesets,
+    )
 
 
 def is_full_checksum_match(candidate_fileset, fileset, conn):
@@ -1103,7 +1125,6 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
             AND e.engineid = %s
             AND f.detection = 1
             AND t.transaction != %s
-            AND fs.status != 'partial_pending'
             ),
             total_detection_files AS (
             SELECT cf.fileset_id, COUNT(*) AS detection_files_found


Commit: 97c6dbb438fdbf02e7fe16e7fd354b55b4ada8d6
    https://github.com/scummvm/scummvm-sites/commit/97c6dbb438fdbf02e7fe16e7fd354b55b4ada8d6
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Filter all candidates in descending order of matches instead of only the max one

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 30a7f0c..fb62576 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -1030,26 +1030,26 @@ def set_perform_match(
                     )
 
         elif len(candidate_filesets) > 1:
-            strong_match_candidate_filesets = []
+            found_match = False
             for candidate_fileset in candidate_filesets:
                 (is_match, _) = is_full_checksum_match(candidate_fileset, fileset, conn)
                 if is_match:
-                    strong_match_candidate_filesets.append(candidate_fileset)
+                    update_fileset_status(cursor, matched_fileset_id, "partial")
+                    set_populate_file(fileset, matched_fileset_id, conn, detection)
+                    auto_merged_filesets += 1
+                    log_matched_fileset(
+                        src,
+                        fileset_id,
+                        matched_fileset_id,
+                        "partial",
+                        user,
+                        conn,
+                    )
+                    delete_original_fileset(fileset_id, conn)
+                    found_match = True
+                    break
 
-            if len(strong_match_candidate_filesets) == 1:
-                update_fileset_status(cursor, matched_fileset_id, "partial")
-                set_populate_file(fileset, matched_fileset_id, conn, detection)
-                auto_merged_filesets += 1
-                log_matched_fileset(
-                    src,
-                    fileset_id,
-                    matched_fileset_id,
-                    "partial",
-                    user,
-                    conn,
-                )
-                delete_original_fileset(fileset_id, conn)
-            else:
+            if not found_match:
                 category_text = "Manual Merge Required"
                 log_text = f"Merge Fileset:{fileset_id} manually. Possible matches are: {', '.join(f'Fileset:{id}' for id in candidate_filesets)}."
                 print(log_text)
@@ -1110,8 +1110,7 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
     Returns a list of candidate filesets that can be merged
     """
     with conn.cursor() as cursor:
-        # Returns those filesets which have the maximum number of all detection files matching in the set fileset filtered by engine, file name and file size(if not -1).
-        # Returns multiple filesets if multiple filesets have same max number of matching files
+        # Returns those filesets which have all detection files matching in the set fileset filtered by engine, file name and file size(if not -1) sorted in descending order of matches
 
         query = """
             WITH candidate_fileset AS ( 
@@ -1140,15 +1139,12 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
             FROM candidate_fileset cf
             JOIN set_fileset sf ON cf.name = sf.name AND (cf.size = sf.size OR cf.size = -1)
             GROUP BY cf.fileset_id
-            ),
-            max_match_count AS (
-                SELECT MAX(match_files_count) AS max_count FROM matched_detection_files
             )
             SELECT mdf.fileset_id
             FROM matched_detection_files mdf
             JOIN total_detection_files tdf ON mdf.fileset_id = tdf.fileset_id
-            JOIN max_match_count mmc ON mdf.match_files_count = mmc.max_count
-            WHERE mdf.match_files_count = tdf.detection_files_found;
+            WHERE mdf.match_files_count = tdf.detection_files_found
+            ORDER BY mdf.match_files_count DESC;
         """
         cursor.execute(
             query, (fileset_id, fileset["sourcefile"], transaction_id, fileset_id)


Commit: 94aef02eddcbfaa10f9382d7e26a6f2670896361
    https://github.com/scummvm/scummvm-sites/commit/94aef02eddcbfaa10f9382d7e26a6f2670896361
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Remove set filesets with many to one mapping with a single fileset

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index fb62576..d8dc5d9 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -899,6 +899,10 @@ def set_process(
     mismatch_filesets = 0
     dropped_early_filesets = 0
 
+    # A mapping from set filesets to candidate filesets list
+    set_to_candidate_dict = defaultdict(list)
+    id_to_fileset_dict = defaultdict(dict)
+
     for fileset in game_data:
         if "romof" in fileset and fileset["romof"] in resources:
             fileset["rom"] += resources[fileset["romof"]]["rom"]
@@ -928,6 +932,24 @@ def set_process(
             dropped_early_filesets += 1
             delete_original_fileset(fileset_id, conn)
 
+        id_to_fileset_dict[fileset_id] = fileset
+        set_to_candidate_dict[fileset_id].extend(candidate_filesets)
+
+    # Remove all such filesets, which have many to one mapping with a single candidate, those are extra variants.
+    value_to_keys = defaultdict(list)
+    for set_fileset, candidates in set_to_candidate_dict.items():
+        if len(candidates) == 1:
+            value_to_keys[candidates[0]].append(set_fileset)
+    for candidate, set_filesets in value_to_keys.items():
+        if len(set_filesets) > 1:
+            for set_fileset in set_filesets:
+                dropped_early_filesets += 1
+                delete_original_fileset(set_fileset, conn)
+                del set_to_candidate_dict[set_fileset]
+                del id_to_fileset_dict[set_fileset]
+
+    for fileset_id, candidate_filesets in set_to_candidate_dict.items():
+        fileset = id_to_fileset_dict[fileset_id]
         (
             fully_matched_filesets,
             auto_merged_filesets,
@@ -957,7 +979,7 @@ def set_process(
         log_text = f"Completed loading DAT file, filename {filepath}, size {os.path.getsize(filepath)}. State {source_status}. Number of filesets: {fileset_insertion_count}. Transaction: {transaction_id}"
         create_log(escape_string(category_text), user, escape_string(log_text), conn)
         category_text = "Upload information"
-        log_text = f"Number of filesets: {fileset_insertion_count}. Filesets automatically merged: {auto_merged_filesets}. Filesets dropped early(no candidate) - {dropped_early_filesets}. Filesets requiring manual merge: {manual_merged_filesets}. Partial/Full filesets already present: {fully_matched_filesets}. Partial/Full filesets with mismatch {mismatch_filesets}."
+        log_text = f"Number of filesets: {fileset_insertion_count}. Filesets automatically merged: {auto_merged_filesets}. Filesets dropped early(no candidate/ extra variant) - {dropped_early_filesets}. Filesets requiring manual merge: {manual_merged_filesets}. Partial/Full filesets already present: {fully_matched_filesets}. Partial/Full filesets with mismatch {mismatch_filesets}."
         create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
 
@@ -1034,13 +1056,13 @@ def set_perform_match(
             for candidate_fileset in candidate_filesets:
                 (is_match, _) = is_full_checksum_match(candidate_fileset, fileset, conn)
                 if is_match:
-                    update_fileset_status(cursor, matched_fileset_id, "partial")
-                    set_populate_file(fileset, matched_fileset_id, conn, detection)
+                    update_fileset_status(cursor, candidate_fileset, "partial")
+                    set_populate_file(fileset, candidate_fileset, conn, detection)
                     auto_merged_filesets += 1
                     log_matched_fileset(
                         src,
                         fileset_id,
-                        matched_fileset_id,
+                        candidate_fileset,
                         "partial",
                         user,
                         conn,


Commit: 6b41b1b7084dabbbfb949d99de3161a85a3104e3
    https://github.com/scummvm/scummvm-sites/commit/6b41b1b7084dabbbfb949d99de3161a85a3104e3
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Stop processing for the fileset if it already exists - checked by key

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index d8dc5d9..f97edb5 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -140,7 +140,7 @@ def insert_fileset(
             )
             update_history(existing_entry, existing_entry, conn, log_last)
 
-        return existing_entry
+        return (existing_entry, True)
 
     # $game and $key should not be parsed as a mysql string, hence no quotes
     query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last))"
@@ -172,7 +172,7 @@ def insert_fileset(
             f"INSERT INTO transactions (`transaction`, fileset) VALUES ({transaction}, {fileset_last})"
         )
 
-    return fileset_id
+    return (fileset_id, False)
 
 
 def insert_file(file, detection, src, conn):
@@ -910,9 +910,11 @@ def set_process(
         megakey = ""
         log_text = f"State {source_status}."
 
-        fileset_id = insert_new_fileset(
+        (fileset_id, existing) = insert_new_fileset(
             fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
         )
+        if existing:
+            continue
 
         candidate_filesets = set_filter_candidate_filesets(
             fileset_id, fileset, transaction_id, conn
@@ -1206,7 +1208,7 @@ def process_fileset(
     else:
         matched_map = matching_set(fileset, conn)
 
-    fileset_id = insert_new_fileset(
+    (fileset_id, _) = insert_new_fileset(
         fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
     )
 
@@ -1584,7 +1586,7 @@ def set_populate_file(fileset, fileset_id, conn, detection):
 def insert_new_fileset(
     fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=""
 ):
-    fileset_id = insert_fileset(
+    (fileset_id, existing) = insert_fileset(
         src,
         detection,
         key,
@@ -1601,7 +1603,7 @@ def insert_new_fileset(
             for key, value in file.items():
                 if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:
                     insert_filechecksum(file, key, conn)
-    return fileset_id
+    return (fileset_id, existing)
 
 
 def log_matched_fileset(src, fileset_last, fileset_id, state, user, conn):


Commit: 91bb26cf74775a50733cddf24d3bf16405b00ccd
    https://github.com/scummvm/scummvm-sites/commit/91bb26cf74775a50733cddf24d3bf16405b00ccd
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
Skip certain logs while processing set.dat with skiplog flag

Changed paths:
    dat_parser.py
    db_functions.py


diff --git a/dat_parser.py b/dat_parser.py
index 5fefbce..b3ce12e 100644
--- a/dat_parser.py
+++ b/dat_parser.py
@@ -155,7 +155,7 @@ def main():
     if args.match:
         for filepath in args.match:
             # print(parse_dat(filepath)[2])
-            match_fileset(parse_dat(filepath), args.user)
+            match_fileset(parse_dat(filepath), args.user, args.skiplog)
 
 
 if __name__ == "__main__":
diff --git a/db_functions.py b/db_functions.py
index f97edb5..15208d0 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -796,7 +796,7 @@ def populate_matching_games():
             print("Updating matched games failed")
 
 
-def match_fileset(data_arr, username=None):
+def match_fileset(data_arr, username=None, skiplog=False):
     """
     data_arr -> tuple : (header, game_data, resources, filepath).
     header -> dict : Information like author, version, description, etc.
@@ -851,6 +851,7 @@ def match_fileset(data_arr, username=None):
             version,
             source_status,
             user,
+            skiplog,
         )
     else:
         for fileset in game_data:
@@ -884,6 +885,7 @@ def set_process(
     version,
     source_status,
     user,
+    skiplog,
 ):
     """
     Entry point for processing set.dat.
@@ -897,8 +899,8 @@ def set_process(
     auto_merged_filesets = 0
     manual_merged_filesets = 0
     mismatch_filesets = 0
-    dropped_early_filesets = 0
-
+    dropped_early_no_candidate = 0
+    dropped_early_single_candidate_multiple_sets = 0
     # A mapping from set filesets to candidate filesets list
     set_to_candidate_dict = defaultdict(list)
     id_to_fileset_dict = defaultdict(dict)
@@ -911,7 +913,16 @@ def set_process(
         log_text = f"State {source_status}."
 
         (fileset_id, existing) = insert_new_fileset(
-            fileset, conn, detection, src, key, megakey, transaction_id, log_text, user
+            fileset,
+            conn,
+            detection,
+            src,
+            key,
+            megakey,
+            transaction_id,
+            log_text,
+            user,
+            skiplog=skiplog,
         )
         if existing:
             continue
@@ -922,7 +933,7 @@ def set_process(
 
         # Mac files in set.dat are not represented properly and they won't find a candidate fileset for a match, so we can drop them.
         if len(candidate_filesets) == 0:
-            category_text = "Drop set fileset"
+            category_text = "Drop set fileset - A"
             fileset_name = fileset["name"] if "name" in fileset else ""
             fileset_description = (
                 fileset["description"] if "description" in fileset else ""
@@ -931,7 +942,7 @@ def set_process(
             create_log(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
-            dropped_early_filesets += 1
+            dropped_early_no_candidate += 1
             delete_original_fileset(fileset_id, conn)
 
         id_to_fileset_dict[fileset_id] = fileset
@@ -945,7 +956,17 @@ def set_process(
     for candidate, set_filesets in value_to_keys.items():
         if len(set_filesets) > 1:
             for set_fileset in set_filesets:
-                dropped_early_filesets += 1
+                fileset = id_to_fileset_dict[set_fileset]
+                category_text = "Drop set fileset - B"
+                fileset_name = fileset["name"] if "name" in fileset else ""
+                fileset_description = (
+                    fileset["description"] if "description" in fileset else ""
+                )
+                log_text = f"Drop fileset, multiple filesets mapping to single detection. Name: {fileset_name}, Description: {fileset_description}"
+                create_log(
+                    escape_string(category_text), user, escape_string(log_text), conn
+                )
+                dropped_early_single_candidate_multiple_sets += 1
                 delete_original_fileset(set_fileset, conn)
                 del set_to_candidate_dict[set_fileset]
                 del id_to_fileset_dict[set_fileset]
@@ -969,6 +990,7 @@ def set_process(
             manual_merged_filesets,
             mismatch_filesets,
             conn,
+            skiplog,
         )
 
     # Final log
@@ -981,7 +1003,7 @@ def set_process(
         log_text = f"Completed loading DAT file, filename {filepath}, size {os.path.getsize(filepath)}. State {source_status}. Number of filesets: {fileset_insertion_count}. Transaction: {transaction_id}"
         create_log(escape_string(category_text), user, escape_string(log_text), conn)
         category_text = "Upload information"
-        log_text = f"Number of filesets: {fileset_insertion_count}. Filesets automatically merged: {auto_merged_filesets}. Filesets dropped early(no candidate/ extra variant) - {dropped_early_filesets}. Filesets requiring manual merge: {manual_merged_filesets}. Partial/Full filesets already present: {fully_matched_filesets}. Partial/Full filesets with mismatch {mismatch_filesets}."
+        log_text = f"Number of filesets: {fileset_insertion_count}. Filesets automatically merged: {auto_merged_filesets}. Filesets dropped early (no candidate) - {dropped_early_no_candidate}. Filesets dropped early (mapping to single detection) - {dropped_early_single_candidate_multiple_sets}. Filesets requiring manual merge: {manual_merged_filesets}. Partial/Full filesets already present: {fully_matched_filesets}. Partial/Full filesets with mismatch {mismatch_filesets}."
         create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
 
@@ -997,6 +1019,7 @@ def set_perform_match(
     manual_merged_filesets,
     mismatch_filesets,
     conn,
+    skiplog,
 ):
     """
     TODO
@@ -1004,6 +1027,7 @@ def set_perform_match(
     with conn.cursor() as cursor:
         if len(candidate_filesets) == 1:
             matched_fileset_id = candidate_filesets[0]
+
             cursor.execute(
                 "SELECT status FROM fileset WHERE id = %s", (matched_fileset_id,)
             )
@@ -1012,14 +1036,15 @@ def set_perform_match(
                 update_fileset_status(cursor, matched_fileset_id, "partial")
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
                 auto_merged_filesets += 1
-                log_matched_fileset(
-                    src,
-                    fileset_id,
-                    matched_fileset_id,
-                    "partial",
-                    user,
-                    conn,
-                )
+                if not skiplog:
+                    log_matched_fileset(
+                        src,
+                        fileset_id,
+                        matched_fileset_id,
+                        "partial",
+                        user,
+                        conn,
+                    )
                 delete_original_fileset(fileset_id, conn)
             elif status == "partial" or status == "full":
                 (is_match, unmatched_files) = is_full_checksum_match(
@@ -1061,14 +1086,15 @@ def set_perform_match(
                     update_fileset_status(cursor, candidate_fileset, "partial")
                     set_populate_file(fileset, candidate_fileset, conn, detection)
                     auto_merged_filesets += 1
-                    log_matched_fileset(
-                        src,
-                        fileset_id,
-                        candidate_fileset,
-                        "partial",
-                        user,
-                        conn,
-                    )
+                    if not skiplog:
+                        log_matched_fileset(
+                            src,
+                            fileset_id,
+                            candidate_fileset,
+                            "partial",
+                            user,
+                            conn,
+                        )
                     delete_original_fileset(fileset_id, conn)
                     found_match = True
                     break
@@ -1584,7 +1610,17 @@ def set_populate_file(fileset, fileset_id, conn, detection):
 
 
 def insert_new_fileset(
-    fileset, conn, detection, src, key, megakey, transaction_id, log_text, user, ip=""
+    fileset,
+    conn,
+    detection,
+    src,
+    key,
+    megakey,
+    transaction_id,
+    log_text,
+    user,
+    ip="",
+    skiplog=False,
 ):
     (fileset_id, existing) = insert_fileset(
         src,
@@ -1596,6 +1632,7 @@ def insert_new_fileset(
         conn,
         username=user,
         ip=ip,
+        skiplog=skiplog,
     )
     if fileset_id:
         for file in fileset["rom"]:


Commit: d923ae6a101ad2f8514d23a907035adf64630965
    https://github.com/scummvm/scummvm-sites/commit/d923ae6a101ad2f8514d23a907035adf64630965
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Replace detection duplicate check by filename, size and checksum instead of megakey

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 15208d0..7abb82c 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -429,14 +429,12 @@ def convert_log_text_to_links(log_text):
 def calc_key(fileset):
     key_string = ""
 
-    for key, value in fileset.items():
-        if key in ["engineid", "gameid", "rom"]:
-            continue
-        key_string += ":" + str(value)
-
     files = fileset["rom"]
+    files.sort(key=lambda x: x["name"].lower())
     for file in files:
         for key, value in file.items():
+            if key == "name":
+                value = value.lower()
             key_string += ":" + str(value)
 
     key_string = key_string.strip(":")
@@ -445,12 +443,13 @@ def calc_key(fileset):
 
 def calc_megakey(fileset):
     key_string = f":{fileset['platform']}:{fileset['language']}"
-    # print(fileset.keys())
     if "rom" in fileset.keys():
         files = fileset["rom"]
-        files.sort(key=lambda x: x["name"])
+        files.sort(key=lambda x: x["name"].lower())
         for file in fileset["rom"]:
             for key, value in file.items():
+                if key == "name":
+                    value = value.lower()
                 key_string += ":" + str(value)
     elif "files" in fileset.keys():
         for file in fileset["files"]:
@@ -458,7 +457,6 @@ def calc_megakey(fileset):
                 key_string += ":" + str(value)
 
     key_string = key_string.strip(":")
-    # print(key_string)
     return hashlib.md5(key_string.encode()).hexdigest()
 
 
@@ -481,10 +479,6 @@ def db_insert(data_arr, username=None, skiplog=False):
         print(f"Missing key in header: {e}")
         return
 
-    cursor = conn.cursor()
-    cursor.execute("SELECT COUNT(*) FROM file;")
-    is_db_empty = list(cursor.fetchone().values())[0] == 0
-
     src = "dat" if author not in ["scan", "scummvm"] else author
 
     detection = src == "scummvm"
@@ -506,8 +500,8 @@ def db_insert(data_arr, username=None, skiplog=False):
     create_log(escape_string(category_text), user, escape_string(log_text), conn)
 
     for fileset in game_data:
-        key = calc_key(fileset) if not detection else ""
-        megakey = calc_megakey(fileset) if detection else ""
+        key = calc_key(fileset)
+        megakey = calc_megakey(fileset)
 
         if detection:
             engine_name = fileset["engine"]
@@ -518,17 +512,19 @@ def db_insert(data_arr, username=None, skiplog=False):
             platform = fileset["platform"]
             lang = fileset["language"]
 
-            if is_db_empty:
-                with conn.cursor() as cursor:
-                    cursor.execute(
-                        "SELECT id FROM fileset WHERE megakey = %s", (megakey,)
-                    )
-                    existing_entry = cursor.fetchone()
-                    if existing_entry is not None:
-                        log_text = f"Skipping Entry as megakey already exsits in Fileset:{existing_entry['id']} : engineid = {engineid}, gameid = {gameid}, platform = {platform}, language = {lang}"
-                        create_log("Warning", user, escape_string(log_text), conn)
-                        print(log_text)
-                        continue
+            with conn.cursor() as cursor:
+                query = """
+                    SELECT id
+                    FROM fileset
+                    WHERE `key` = %s
+                """
+                cursor.execute(query, (key,))
+                existing_entry = cursor.fetchone()
+                if existing_entry is not None:
+                    log_text = f"Skipping Entry as similar entry already exsits - Fileset:{existing_entry['id']}. Skpped entry details - engineid = {engineid}, gameid = {gameid}, platform = {platform}, language = {lang}"
+                    create_log("Warning", user, escape_string(log_text), conn)
+                    print(log_text)
+                    continue
 
             insert_game(
                 engine_name, engineid, title, gameid, extra, platform, lang, conn


Commit: ca894b7af4d77b4b3e6d9011ca07dc24deee7e5d
    https://github.com/scummvm/scummvm-sites/commit/ca894b7af4d77b4b3e6d9011ca07dc24deee7e5d
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Improve filtering and navigation in fileset search

Changed paths:
    fileset.py
    pagination.py


diff --git a/fileset.py b/fileset.py
index 3012408..688ee3d 100644
--- a/fileset.py
+++ b/fileset.py
@@ -1045,23 +1045,29 @@ def fileset_search():
     filename = "fileset_search"
     records_table = "fileset"
     select_query = """
-    SELECT extra, platform, language, game.gameid, megakey,
-    status, fileset.id as fileset
+    SELECT fileset.id as fileset, extra, platform, language, game.gameid, megakey,
+    status, transaction, engineid
     FROM fileset
     LEFT JOIN game ON game.id = fileset.game
+    LEFT JOIN engine ON engine.id = game.engine
+    JOIN transactions ON fileset.id = transactions.fileset
     """
     order = "ORDER BY fileset.id"
     filters = {
-        "id": "fileset",
+        "fileset": "fileset",
         "gameid": "game",
         "extra": "game",
         "platform": "game",
         "language": "game",
         "megakey": "fileset",
         "status": "fileset",
+        "transaction": "transactions",
+        "engineid": "engine",
     }
     mapping = {
         "game.id": "fileset.game",
+        "engine.id": "game.engine",
+        "fileset.id": "transactions.fileset",
     }
     return render_template_string(
         create_page(filename, 25, records_table, select_query, order, filters, mapping)
diff --git a/pagination.py b/pagination.py
index 04fbfb4..cb8ba3d 100644
--- a/pagination.py
+++ b/pagination.py
@@ -56,17 +56,17 @@ def create_page(
 
         if set(request.args.keys()).difference({"page", "sort"}):
             condition = "WHERE "
-            tables = []
+            tables = set()
             for key, value in request.args.items():
                 if key in ["page", "sort"] or value == "":
                     continue
-                tables.append(filters[key])
+                tables.add(filters[key])
                 if value == "":
                     value = ".*"
                 condition += (
-                    f" AND {filters[key]}.{key} REGEXP '{value}'"
+                    f" AND {filters[key]}.{'id' if key == 'fileset' else key} REGEXP '{value}'"
                     if condition != "WHERE "
-                    else f"{filters[key]}.{key} REGEXP '{value}'"
+                    else f"{filters[key]}.{'id' if key == 'fileset' else key} REGEXP '{value}'"
                 )
 
             if condition == "WHERE ":
@@ -74,11 +74,18 @@ def create_page(
 
             # Handle multiple tables
             from_query = records_table
-            if len(tables) > 1 or (tables and tables[0] != records_table):
-                for table in tables:
+            tables_list = list(tables)
+            if records_table not in tables_list or len(tables_list) > 1:
+                for table in tables_list:
                     if table == records_table:
                         continue
-                    from_query += f" JOIN {table} ON {get_join_columns(records_table, table, mapping)}"
+                    if table == "engine":
+                        if "game" in tables:
+                            from_query += " JOIN engine ON engine.id = game.engine"
+                        else:
+                            from_query += " JOIN game ON game.id = fileset.game JOIN engine ON engine.id = game.engine"
+                    else:
+                        from_query += f" JOIN {table} ON {get_join_columns(records_table, table, mapping)}"
             cursor.execute(
                 f"SELECT COUNT({records_table}.id) AS count FROM {from_query} {condition}"
             )
@@ -112,9 +119,9 @@ def create_page(
                 if value == "":
                     value = ".*"
                 condition += (
-                    f" AND {filters[key]}.{key} REGEXP '{value}'"
+                    f" AND {filters[key]}.{'id' if key == 'fileset' else key} REGEXP '{value}'"
                     if condition != "WHERE "
-                    else f"{filters[key]}.{key} REGEXP '{value}'"
+                    else f"{filters[key]}.{'id' if key == 'fileset' else key} REGEXP '{value}'"
                 )
 
             if condition == "WHERE ":
@@ -141,18 +148,24 @@ def create_page(
         return "No results for given filters"
     if results:
         if filters:
-            html += "<tr class='filter'><td></td>"
+            if records_table != "log":
+                html += "<tr class='filter'><td></td><td></td>"
+            else:
+                html += "<tr class='filter'><td></td>"
+
             for key in results[0].keys():
                 if key not in filters:
                     html += "<td class='filter'></td>"
                     continue
                 filter_value = request.args.get(key, "")
                 html += f"<td class='filter'><input type='text' class='filter' placeholder='{key}' name='{key}' value='{filter_value}'/></td>"
-            html += "</tr><tr class='filter'><td></td><td class='filter'><input type='submit' value='Submit'></td></tr>"
+            html += "</tr><tr class='filter'><td></td><td></td><td class='filter'><input type='submit' value='Submit'></td></tr>"
 
-        html += "<th></th>"
+        html += "<th>#</th>"
+        if records_table != "log":
+            html += "<th>Fileset ID</th>"
         for key in results[0].keys():
-            if key == "fileset":
+            if key in ["fileset", "fileset_id"]:
                 continue
             vars = "&".join(
                 [f"{k}={v}" for k, v in request.args.items() if k != "sort"]
@@ -168,7 +181,7 @@ def create_page(
         for row in results:
             if counter == offset + 1:  # If it is the first run of the loop
                 if filters:
-                    html += "<tr class='filter'><td></td>"
+                    html += "<tr class='filter'><td></td><td></td>"
                     for key in row.keys():
                         if key not in filters:
                             html += "<td class='filter'></td>"
@@ -177,16 +190,18 @@ def create_page(
                         # Filter textbox
                         filter_value = request.args.get(key, "")
 
+            fileset_id = row.get("fileset_id", row.get("fileset"))
             if records_table != "log":
-                fileset_id = row["fileset"]
                 html += f"<tr class='games_list' onclick='hyperlink(\"fileset?id={fileset_id}\")'>\n"
-                html += f"<td><a href='fileset?id={fileset_id}'>{counter}.</a></td>\n"
+                html += f"<td>{counter}.</td>\n"
+                html += f"<td><a href='fileset?id={fileset_id}'>{fileset_id}</a></td>\n"
             else:
                 html += "<tr>\n"
                 html += f"<td>{counter}.</td>\n"
+                # html += f"<td>{fileset_id}</td>\n"
 
             for key, value in row.items():
-                if key == "fileset":
+                if key in ["fileset", "fileset_id"]:
                     continue
 
                 # Add links to fileset in logs table
@@ -194,7 +209,6 @@ def create_page(
                     matches = re.findall(r"Fileset:(\d+)", value)
                     for fileset_id in matches:
                         fileset_text = f"Fileset:{fileset_id}"
-
                         with conn.cursor() as cursor:
                             cursor.execute(
                                 "SELECT fileset FROM history WHERE oldfileset = %s AND oldfileset != fileset",
@@ -203,7 +217,6 @@ def create_page(
                             row = cursor.fetchone()
                             if row:
                                 fileset_id = row["fileset"]
-
                         value = value.replace(
                             fileset_text,
                             f"<a href='fileset?id={fileset_id}'>{fileset_text}</a>",
@@ -211,7 +224,6 @@ def create_page(
 
                 html += f"<td>{value}</td>\n"
             html += "</tr>\n"
-
             counter += 1
 
     html += "</table></form>"


Commit: 394c098b7a3464ffdd505211af1753d1e502ee57
    https://github.com/scummvm/scummvm-sites/commit/394c098b7a3464ffdd505211af1753d1e502ee57
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Remove punycode encoding while loading to database, further convert \ to / in filepaths for filesystem independent parsing

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 7abb82c..7aa32fd 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -175,6 +175,14 @@ def insert_fileset(
     return (fileset_id, False)
 
 
+def normalised_path(name):
+    """
+    Converts \ to / in filepaths, to avoid filesystem independent filepath parsing.
+    """
+    path_list = name.split("\\")
+    return "/".join(path_list)
+
+
 def insert_file(file, detection, src, conn):
     # Find full md5, or else use first checksum value
     checksum = ""
@@ -196,11 +204,7 @@ def insert_file(file, detection, src, conn):
         f"{checktype}-{checksize}" if checktype != "None" else f"{checktype}"
     )
 
-    name = (
-        encode_punycode(file["name"])
-        if punycode_need_encode(file["name"])
-        else file["name"]
-    )
+    name = normalised_path(file["name"])
 
     values = [name]
 
@@ -1129,11 +1133,7 @@ def is_full_checksum_match(candidate_fileset, fileset, conn):
         set_checksums = set()
         for file in fileset["rom"]:
             if "md5" in file:
-                name = (
-                    encode_punycode(file["name"])
-                    if punycode_need_encode(file["name"])
-                    else file["name"]
-                )
+                name = normalised_path(file["name"])
                 set_checksums.add((name.lower(), file["md5"]))
 
         for fname, fid in candidate_files.items():
@@ -1446,11 +1446,7 @@ def populate_file(fileset, fileset_id, conn, detection):
 
             extended_file_size = True if "size-r" in file else False
 
-            name = (
-                encode_punycode(file["name"])
-                if punycode_need_encode(file["name"])
-                else file["name"]
-            )
+            name = normalised_path(file["name"])
             escaped_name = escape_string(name)
 
             columns = ["name", "size"]
@@ -1556,12 +1552,7 @@ def set_populate_file(fileset, fileset_id, conn, detection):
             checksize, checktype, checksum = get_checksum_props("md5", file["md5"])
 
             if file["name"].lower() not in candidate_files:
-                name = (
-                    encode_punycode(file["name"])
-                    if punycode_need_encode(file["name"])
-                    else file["name"]
-                )
-
+                name = normalised_path(file["name"])
                 values = [name]
 
                 values.append(file["size"] if "size" in file else "0")


Commit: d335d91a550dc33915b0ecd7c0efd4956c03c1cf
    https://github.com/scummvm/scummvm-sites/commit/d335d91a550dc33915b0ecd7c0efd4956c03c1cf
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGIRTY: Iteratively look for extra files if romof or cloneof field is present in the set.dat metadata. Filtering update.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 7aa32fd..a45aaf5 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -467,7 +467,6 @@ def calc_megakey(fileset):
 def db_insert(data_arr, username=None, skiplog=False):
     header = data_arr[0]
     game_data = data_arr[1]
-    resources = data_arr[2]
     filepath = data_arr[3]
 
     try:
@@ -533,9 +532,6 @@ def db_insert(data_arr, username=None, skiplog=False):
             insert_game(
                 engine_name, engineid, title, gameid, extra, platform, lang, conn
             )
-        elif src == "dat":
-            if "romof" in fileset and fileset["romof"] in resources:
-                fileset["rom"] = fileset["rom"] + resources[fileset["romof"]]["rom"]
 
         log_text = f"size {os.path.getsize(filepath)}, author {author}, version {version}. State {status}."
 
@@ -854,6 +850,7 @@ def match_fileset(data_arr, username=None, skiplog=False):
             skiplog,
         )
     else:
+        game_data_lookup = {fs["name"]: fs for fs in game_data}
         for fileset in game_data:
             process_fileset(
                 fileset,
@@ -867,6 +864,7 @@ def match_fileset(data_arr, username=None, skiplog=False):
                 version,
                 source_status,
                 user,
+                game_data_lookup,
             )
         finalize_fileset_insertion(
             conn, transaction_id, src, filepath, author, version, source_status, user
@@ -905,9 +903,25 @@ def set_process(
     set_to_candidate_dict = defaultdict(list)
     id_to_fileset_dict = defaultdict(dict)
 
+    game_data_lookup = {fs["name"]: fs for fs in game_data}
+
     for fileset in game_data:
-        if "romof" in fileset and fileset["romof"] in resources:
-            fileset["rom"] += resources[fileset["romof"]]["rom"]
+        # Ideally romof should be enough, but adding in case of an edge case
+        current_name = fileset.get("romof") or fileset.get("cloneof")
+
+        # Iteratively check for extra files if linked to multiple filesets
+        while current_name:
+            if current_name in resources:
+                fileset["rom"] += resources[current_name]["rom"]
+                break
+
+            elif current_name in game_data_lookup:
+                linked = game_data_lookup[current_name]
+                fileset["rom"] += linked.get("rom", [])
+                current_name = linked.get("romof") or linked.get("cloneof")
+            else:
+                break
+
         key = calc_key(fileset)
         megakey = ""
         log_text = f"State {source_status}."
@@ -938,7 +952,7 @@ def set_process(
             fileset_description = (
                 fileset["description"] if "description" in fileset else ""
             )
-            log_text = f"Drop fileset as no matching candidates. Name: {fileset_name}, Description: {fileset_description}"
+            log_text = f"Drop fileset as no matching candidates. Name: {fileset_name}, Description: {fileset_description}."
             create_log(
                 escape_string(category_text), user, escape_string(log_text), conn
             )
@@ -955,6 +969,23 @@ def set_process(
             value_to_keys[candidates[0]].append(set_fileset)
     for candidate, set_filesets in value_to_keys.items():
         if len(set_filesets) > 1:
+            query = """
+                    SELECT e.engineid, g.gameid, g.platform, g.language
+                    FROM fileset fs
+                    JOIN game g ON fs.game = g.id
+                    JOIN engine e ON e.id = g.engine
+                    WHERE fs.id = %s
+                """
+            result = None
+            with conn.cursor() as cursor:
+                cursor.execute(query, (candidate,))
+                result = cursor.fetchone()
+
+            engine = result["engineid"]
+            gameid = result["gameid"]
+            platform = result["platform"]
+            language = result["language"]
+
             for set_fileset in set_filesets:
                 fileset = id_to_fileset_dict[set_fileset]
                 category_text = "Drop set fileset - B"
@@ -962,7 +993,7 @@ def set_process(
                 fileset_description = (
                     fileset["description"] if "description" in fileset else ""
                 )
-                log_text = f"Drop fileset, multiple filesets mapping to single detection. Name: {fileset_name}, Description: {fileset_description}"
+                log_text = f"Drop fileset, multiple filesets mapping to single detection. Name: {fileset_name}, Description: {fileset_description}. Clashed with Fileset:{candidate} ({engine}:{gameid}-{platform}-{language})"
                 create_log(
                     escape_string(category_text), user, escape_string(log_text), conn
                 )
@@ -996,7 +1027,8 @@ def set_process(
     # Final log
     with conn.cursor() as cursor:
         cursor.execute(
-            f"SELECT COUNT(fileset) from transactions WHERE `transaction` = {transaction_id}"
+            "SELECT COUNT(fileset) from transactions WHERE `transaction` = %s",
+            (transaction_id,),
         )
         fileset_insertion_count = cursor.fetchone()["COUNT(fileset)"]
         category_text = f"Uploaded from {src}"
@@ -1037,7 +1069,7 @@ def set_perform_match(
                 set_populate_file(fileset, matched_fileset_id, conn, detection)
                 auto_merged_filesets += 1
                 if not skiplog:
-                    log_matched_fileset(
+                    set_log_matched_fileset(
                         src,
                         fileset_id,
                         matched_fileset_id,
@@ -1087,7 +1119,7 @@ def set_perform_match(
                     set_populate_file(fileset, candidate_fileset, conn, detection)
                     auto_merged_filesets += 1
                     if not skiplog:
-                        log_matched_fileset(
+                        set_log_matched_fileset(
                             src,
                             fileset_id,
                             candidate_fileset,
@@ -1185,17 +1217,28 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
             FROM candidate_fileset cf
             JOIN set_fileset sf ON cf.name = sf.name AND (cf.size = sf.size OR cf.size = -1)
             GROUP BY cf.fileset_id
-            )
-            SELECT mdf.fileset_id
+            ),
+            valid_matched_detection_files AS (
+            SELECT mdf.fileset_id, mdf.match_files_count AS valid_match_files_count
             FROM matched_detection_files mdf
-            JOIN total_detection_files tdf ON mdf.fileset_id = tdf.fileset_id
-            WHERE mdf.match_files_count = tdf.detection_files_found
-            ORDER BY mdf.match_files_count DESC;
+            JOIN total_detection_files tdf ON tdf.fileset_id = mdf.fileset_id
+            WHERE tdf.detection_files_found = mdf.match_files_count
+            ),
+            max_match_count AS (
+                SELECT MAX(valid_match_files_count) AS max_count FROM valid_matched_detection_files
+            )
+            SELECT vmdf.fileset_id
+            FROM valid_matched_detection_files vmdf
+            JOIN total_detection_files tdf ON vmdf.fileset_id = tdf.fileset_id
+            JOIN max_match_count mmc ON vmdf.valid_match_files_count = mmc.max_count
+            WHERE vmdf.valid_match_files_count = tdf.detection_files_found;
         """
+
         cursor.execute(
             query, (fileset_id, fileset["sourcefile"], transaction_id, fileset_id)
         )
         rows = cursor.fetchall()
+
         candidates = []
         if rows:
             for row in rows:
@@ -1216,11 +1259,26 @@ def process_fileset(
     version,
     source_status,
     user,
+    game_data_lookup,
 ):
     if detection:
         insert_game_data(fileset, conn)
-    elif src == "dat" and "romof" in fileset and fileset["romof"] in resources:
-        fileset["rom"] += resources[fileset["romof"]]["rom"]
+
+    # Ideally romof should be enough, but adding in case of an edge case
+    current_name = fileset.get("romof") or fileset.get("cloneof")
+
+    # Iteratively check for extra files if linked to multiple filesets
+    while current_name:
+        if current_name in resources:
+            fileset["rom"] += resources[current_name]["rom"]
+            break
+
+        elif current_name in game_data_lookup:
+            linked = game_data_lookup[current_name]
+            fileset["rom"] += linked.get("rom", [])
+            current_name = linked.get("romof") or linked.get("cloneof")
+        else:
+            break
 
     key = calc_key(fileset) if not detection else ""
     megakey = calc_megakey(fileset) if detection else ""
@@ -1639,6 +1697,17 @@ def log_matched_fileset(src, fileset_last, fileset_id, state, user, conn):
     update_history(fileset_last, fileset_id, conn, log_last)
 
 
+def set_log_matched_fileset(src, fileset_last, fileset_id, state, user, conn):
+    category_text = f"Matched from {src}"
+    log_text = (
+        f"Matched Fileset:{fileset_last} with Fileset:{fileset_id}. State {state}."
+    )
+    log_last = create_log(
+        escape_string(category_text), user, escape_string(log_text), conn
+    )
+    update_history(fileset_last, fileset_id, conn, log_last)
+
+
 def finalize_fileset_insertion(
     conn, transaction_id, src, filepath, author, version, source_status, user
 ):


Commit: 22d913150fcf111e01c5df4400a621a45bf5a61e
    https://github.com/scummvm/scummvm-sites/commit/22d913150fcf111e01c5df4400a621a45bf5a61e
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Add possible merge button inside filesets dashboard

Changed paths:
    db_functions.py
    fileset.py
    schema.py


diff --git a/db_functions.py b/db_functions.py
index a45aaf5..fb233fb 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -947,7 +947,7 @@ def set_process(
 
         # Mac files in set.dat are not represented properly and they won't find a candidate fileset for a match, so we can drop them.
         if len(candidate_filesets) == 0:
-            category_text = "Drop set fileset - A"
+            category_text = "Drop fileset - No Candidates"
             fileset_name = fileset["name"] if "name" in fileset else ""
             fileset_description = (
                 fileset["description"] if "description" in fileset else ""
@@ -988,7 +988,7 @@ def set_process(
 
             for set_fileset in set_filesets:
                 fileset = id_to_fileset_dict[set_fileset]
-                category_text = "Drop set fileset - B"
+                category_text = "Drop fileset - Duplicates"
                 fileset_name = fileset["name"] if "name" in fileset else ""
                 fileset_description = (
                     fileset["description"] if "description" in fileset else ""
@@ -1098,15 +1098,15 @@ def set_perform_match(
                 else:
                     category_text = "Mismatch"
                     log_text = f"Fileset:{fileset_id} mismatched with Fileset:{matched_fileset_id} with status:{status}. Try manual merge."
-                    print(
-                        f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {len(unmatched_files)}."
-                    )
+                    print_text = f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {len(unmatched_files)}."
                     mismatch_filesets += 1
-                    # print(f"Merge Fileset:{fileset_id} manually with Fileset:{matched_fileset_id}. Unmatched files: {', '.join(filename for filename in unmatched_files)}.")
-                    create_log(
-                        escape_string(category_text),
+                    add_manual_merge(
+                        [matched_fileset_id],
+                        fileset_id,
+                        category_text,
+                        log_text,
+                        print_text,
                         user,
-                        escape_string(log_text),
                         conn,
                     )
 
@@ -1134,11 +1134,16 @@ def set_perform_match(
             if not found_match:
                 category_text = "Manual Merge Required"
                 log_text = f"Merge Fileset:{fileset_id} manually. Possible matches are: {', '.join(f'Fileset:{id}' for id in candidate_filesets)}."
-                print(log_text)
-                create_log(
-                    escape_string(category_text), user, escape_string(log_text), conn
-                )
                 manual_merged_filesets += 1
+                add_manual_merge(
+                    candidate_filesets,
+                    fileset_id,
+                    category_text,
+                    log_text,
+                    log_text,
+                    user,
+                    conn,
+                )
 
     return (
         fully_matched_filesets,
@@ -1148,6 +1153,26 @@ def set_perform_match(
     )
 
 
+def add_manual_merge(
+    child_filesets, parent_fileset, category_text, log_text, print_text, user, conn
+):
+    """
+    Adds the manual merge entries to a table called possible_merges.
+    """
+    with conn.cursor() as cursor:
+        for child_fileset in child_filesets:
+            query = """
+                    INSERT INTO possible_merges
+                    (child_fileset, parent_fileset)
+                    VALUES
+                    (%s, %s)
+                """
+            cursor.execute(query, (child_fileset, parent_fileset))
+
+    create_log(escape_string(category_text), user, escape_string(log_text), conn)
+    print(print_text)
+
+
 def is_full_checksum_match(candidate_fileset, fileset, conn):
     """
     Return type - (Boolean, List of unmatched files)
diff --git a/fileset.py b/fileset.py
index 688ee3d..9b9dc93 100644
--- a/fileset.py
+++ b/fileset.py
@@ -164,6 +164,7 @@ def fileset():
             """
             html += f"<button type='button' onclick=\"location.href='/fileset/{id}/merge'\">Manual Merge</button>"
             html += f"<button type='button' onclick=\"location.href='/fileset/{id}/match'\">Match and Merge</button>"
+            html += f"<button type='button' onclick=\"location.href='/fileset/{id}/possible_merge'\">Possible Merges</button>"
             html += f"""
                     <form action="/fileset/{id}/mark_full" method="post" style="display:inline;">
                         <button type='submit'>Mark as full</button>
@@ -603,6 +604,75 @@ def merge_fileset(id):
     """
 
 
+ at app.route("/fileset/<int:id>/possible_merge", methods=["GET", "POST"])
+def possible_merge_filesets(id):
+    base_dir = os.path.dirname(os.path.abspath(__file__))
+    config_path = os.path.join(base_dir, "mysql_config.json")
+    with open(config_path) as f:
+        mysql_cred = json.load(f)
+
+    connection = pymysql.connect(
+        host=mysql_cred["servername"],
+        user=mysql_cred["username"],
+        password=mysql_cred["password"],
+        db=mysql_cred["dbname"],
+        charset="utf8mb4",
+        cursorclass=pymysql.cursors.DictCursor,
+    )
+
+    try:
+        with connection.cursor() as cursor:
+            query = """
+                SELECT
+                    fs.*,
+                    g.name AS game_name,
+                    g.engine AS game_engine,
+                    g.platform AS game_platform,
+                    g.language AS game_language,
+                    g.extra AS extra
+                FROM
+                    fileset fs
+                LEFT JOIN
+                    game g ON fs.game = g.id
+                JOIN
+                    possible_merges pm ON pm.child_fileset = fs.id
+                WHERE pm.parent_fileset = %s
+            """
+            cursor.execute(query, (id,))
+            results = cursor.fetchall()
+
+            html = f"""
+            <!DOCTYPE html>
+            <html>
+            <head>
+                <link rel="stylesheet" type="text/css" href="{{{{ url_for('static', filename='style.css') }}}}">
+            </head>
+            <body>
+            <h2>Possible Merges for fileset-'{id}'</h2>
+            <table>
+            <tr><th>ID</th><th>Game Name</th><th>Platform</th><th>Language</th><th>Extra</th><th>Details</th><th>Action</th></tr>
+            """
+            for result in results:
+                html += f"""
+                <tr>
+                    <td>{result["id"]}</td>
+                    <td>{result["game_name"]}</td>
+                    <td>{result["game_platform"]}</td>
+                    <td>{result["game_language"]}</td>
+                    <td>{result["extra"]}</td>
+                    <td><a href="/fileset?id={result["id"]}">View Details</a></td>
+                    <td><a href="/fileset/{id}/merge/confirm?target_id={result["id"]}">Select</a></td>
+                </tr>
+                """
+            html += "</table>\n"
+            html += "</body>\n</html>"
+
+            return render_template_string(html)
+
+    finally:
+        connection.close()
+
+
 @app.route("/fileset/<int:id>/merge/confirm", methods=["GET", "POST"])
 def confirm_merge(id):
     target_id = (
diff --git a/schema.py b/schema.py
index 00cb975..09cbe7f 100644
--- a/schema.py
+++ b/schema.py
@@ -132,6 +132,15 @@ tables = {
             fileset INT NOT NULL
         )
     """,
+    "possible_merges": """
+        CREATE TABLE IF NOT EXISTS possible_merges (
+            id INT AUTO_INCREMENT PRIMARY KEY,
+            child_fileset INT,
+            parent_fileset INT,
+            FOREIGN KEY (child_fileset) REFERENCES fileset(id) ON DELETE CASCADE,
+            FOREIGN KEY (parent_fileset) REFERENCES fileset(id) ON DELETE CASCADE
+        )
+    """,
 }
 
 for table, definition in tables.items():


Commit: 365b4f210ae965e90e0dd1b35ebd5a196b273eb5
    https://github.com/scummvm/scummvm-sites/commit/365b4f210ae965e90e0dd1b35ebd5a196b273eb5
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Filter files by filename instead of entire path, as detections do not necessarily store the entire filepath.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index fb233fb..7955306 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -1240,14 +1240,18 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
             matched_detection_files AS (
             SELECT cf.fileset_id, COUNT(*) AS match_files_count
             FROM candidate_fileset cf
-            JOIN set_fileset sf ON cf.name = sf.name AND (cf.size = sf.size OR cf.size = -1)
+            JOIN set_fileset sf ON ( (
+                cf.name = sf.name
+                OR
+                REGEXP_REPLACE(cf.name, '^.*[\\\\/]', '') = REGEXP_REPLACE(sf.name, '^.*[\\\\/]', '')
+            ) AND (cf.size = sf.size OR cf.size = -1) )
             GROUP BY cf.fileset_id
             ),
             valid_matched_detection_files AS (
             SELECT mdf.fileset_id, mdf.match_files_count AS valid_match_files_count
             FROM matched_detection_files mdf
             JOIN total_detection_files tdf ON tdf.fileset_id = mdf.fileset_id
-            WHERE tdf.detection_files_found = mdf.match_files_count
+            WHERE tdf.detection_files_found <= mdf.match_files_count
             ),
             max_match_count AS (
                 SELECT MAX(valid_match_files_count) AS max_count FROM valid_matched_detection_files
@@ -1256,7 +1260,6 @@ def set_filter_candidate_filesets(fileset_id, fileset, transaction_id, conn):
             FROM valid_matched_detection_files vmdf
             JOIN total_detection_files tdf ON vmdf.fileset_id = tdf.fileset_id
             JOIN max_match_count mmc ON vmdf.valid_match_files_count = mmc.max_count
-            WHERE vmdf.valid_match_files_count = tdf.detection_files_found;
         """
 
         cursor.execute(
@@ -1619,13 +1622,16 @@ def populate_file(fileset, fileset_id, conn, detection):
 
 def set_populate_file(fileset, fileset_id, conn, detection):
     """
-    TODO
+    Updates the old fileset in case of a match. Further deletes the newly created fileset which is not needed anymore.
     """
     with conn.cursor() as cursor:
-        cursor.execute(f"SELECT id, name FROM file WHERE fileset = {fileset_id}")
+        # Extracting the filename from the filepath.
+        cursor.execute(
+            f"SELECT id, REGEXP_REPLACE(name, '^.*[\\\\/]', '') AS name, size FROM file WHERE fileset = {fileset_id}"
+        )
         target_files = cursor.fetchall()
         candidate_files = {
-            target_file["name"].lower(): target_file["id"]
+            target_file["name"].lower(): [target_file["id"], target_file["size"]]
             for target_file in target_files
         }
 
@@ -1634,7 +1640,15 @@ def set_populate_file(fileset, fileset_id, conn, detection):
                 continue
             checksize, checktype, checksum = get_checksum_props("md5", file["md5"])
 
-            if file["name"].lower() not in candidate_files:
+            filename = os.path.basename(normalised_path(file["name"]))
+
+            if filename.lower() not in candidate_files or (
+                filename.lower() in candidate_files
+                and (
+                    candidate_files[filename.lower()][1] != -1
+                    and candidate_files[filename.lower()][1] != file["size"]
+                )
+            ):
                 name = normalised_path(file["name"])
                 values = [name]
 
@@ -1658,11 +1672,18 @@ def set_populate_file(fileset, fileset_id, conn, detection):
             else:
                 query = """
                     UPDATE file
-                    SET size = %s
+                    SET size = %s,
+                    name = %s
                     WHERE id = %s
                 """
+                # Filtering was by filename, but we are still updating the file with the original filepath.
                 cursor.execute(
-                    query, (file["size"], candidate_files[file["name"].lower()])
+                    query,
+                    (
+                        file["size"],
+                        normalised_path(file["name"]),
+                        candidate_files[filename.lower()][0],
+                    ),
                 )
                 query = """
                     INSERT INTO filechecksum (file, checksize, checktype, checksum)
@@ -1671,7 +1692,7 @@ def set_populate_file(fileset, fileset_id, conn, detection):
                 cursor.execute(
                     query,
                     (
-                        candidate_files[file["name"].lower()],
+                        candidate_files[filename.lower()][0],
                         checksize,
                         checktype,
                         checksum,


Commit: 4a3626afe54eb976fd7b1431d024d83f573f7bae
    https://github.com/scummvm/scummvm-sites/commit/4a3626afe54eb976fd7b1431d024d83f573f7bae
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Avoid adding duplicate files from detection entries

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 7955306..bf1e2dc 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -546,7 +546,16 @@ def db_insert(data_arr, username=None, skiplog=False):
             username=username,
             skiplog=skiplog,
         ):
-            for file in fileset["rom"]:
+            # Some detection entries contain duplicate files.
+            unique_files = []
+            seen = set()
+            for file_dict in fileset["rom"]:
+                dict_tuple = tuple(sorted(file_dict.items()))
+                if dict_tuple not in seen:
+                    seen.add(dict_tuple)
+                    unique_files.append(file_dict)
+
+            for file in unique_files:
                 insert_file(file, detection, src, conn)
                 for key, value in file.items():
                     if key not in ["name", "size", "size-r", "size-rd", "sha1", "crc"]:


Commit: 961e678cdccf333ae9ae0ec5ee755d9e940ccf26
    https://github.com/scummvm/scummvm-sites/commit/961e678cdccf333ae9ae0ec5ee755d9e940ccf26
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Avoid detection file overriding on a match, when a similar file exist in a different directory. Add it as a normal non detection file.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index bf1e2dc..30896cf 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -1644,6 +1644,8 @@ def set_populate_file(fileset, fileset_id, conn, detection):
             for target_file in target_files
         }
 
+        seen_detection_files = set()
+
         for file in fileset["rom"]:
             if "md5" not in file:
                 continue
@@ -1651,11 +1653,14 @@ def set_populate_file(fileset, fileset_id, conn, detection):
 
             filename = os.path.basename(normalised_path(file["name"]))
 
-            if filename.lower() not in candidate_files or (
-                filename.lower() in candidate_files
-                and (
-                    candidate_files[filename.lower()][1] != -1
-                    and candidate_files[filename.lower()][1] != file["size"]
+            if ((filename.lower(), file["size"]) in seen_detection_files) or (
+                filename.lower() not in candidate_files
+                or (
+                    filename.lower() in candidate_files
+                    and (
+                        candidate_files[filename.lower()][1] != -1
+                        and candidate_files[filename.lower()][1] != file["size"]
+                    )
                 )
             ):
                 name = normalised_path(file["name"])
@@ -1707,6 +1712,7 @@ def set_populate_file(fileset, fileset_id, conn, detection):
                         checksum,
                     ),
                 )
+                seen_detection_files.add((filename.lower(), file["size"]))
 
 
 def insert_new_fileset(


Commit: 10a017e1decd951d170037e35ac6ae726c3f39c7
    https://github.com/scummvm/scummvm-sites/commit/10a017e1decd951d170037e35ac6ae726c3f39c7
Author: ShivangNagta (shivangnag at gmail.com)
Date: 2025-07-02T23:43:33+02:00

Commit Message:
INTEGRITY: Create copy of the game data during lookup map creation to avoid issues due to mutability of python dictionaries.

Changed paths:
    db_functions.py


diff --git a/db_functions.py b/db_functions.py
index 30896cf..45adc5d 100644
--- a/db_functions.py
+++ b/db_functions.py
@@ -8,6 +8,7 @@ import os
 from pymysql.converters import escape_string
 from collections import defaultdict
 import re
+import copy
 
 SPECIAL_SYMBOLS = '/":*|\\?%<>\x7f'
 
@@ -912,7 +913,8 @@ def set_process(
     set_to_candidate_dict = defaultdict(list)
     id_to_fileset_dict = defaultdict(dict)
 
-    game_data_lookup = {fs["name"]: fs for fs in game_data}
+    # Deep copy to avoid changes in game_data in the loop affecting the lookup map.
+    game_data_lookup = {fs["name"]: copy.deepcopy(fs) for fs in game_data}
 
     for fileset in game_data:
         # Ideally romof should be enough, but adding in case of an edge case