Updated get_file_info to handle files with multiple periods in their … by bdgregg · Pull Request #7 · ulsdevteam/scan-batch-dir

bdgregg · 2026-06-03T19:54:24Z

…name.

Had a case where a partner submitted files like "a.mp3.vtt" instead of "a.vtt". To handle this properly the PID is everything before the first period and the Extension is everything after the last period.

…name.

ojas-uls-dev · 2026-06-03T20:03:38Z

The function os.path.splitext handles this gracefully, without need for custom string logic

>>> import os.path as path
>>> path.splitext('a/b/c.vtt')
('a/b/c', '.vtt')
>>> path.splitext('a/b/c.mp3.vtt')
('a/b/c.mp3', '.vtt')
>>>

ctgraham · 2026-06-03T20:12:57Z

We have plenty of historic PIDs with dots in them.

everything before the first period and the Extension is everything after the last period.

This assumption would fail with historic data.

bdgregg · 2026-06-03T20:24:48Z

@ctgraham Thanks for the reminder about say something like AIS.51423-7.5.0.mp3.vtt - yes that would fail pretty well. Any other suggestions for weeding out the embedded file extension in the example ".mp3"? My only thought is to parse the string against a known list of extensions but that could be a bit consuming to identify what possible extensions people would create. The current work around is to eliminate the ".mp3" manually from the filename.

bdgregg · 2026-06-03T21:03:55Z

I guess I could look at integrating a list of known extensions from mime types such as the following and using that to determine what part of the file contains file extensions that could be scrapped.

import mimetypes

def get_known_extensions() -> set:
    """Build a set of known extensions from Python's mimetypes registry."""
    mimetypes.init()
    return {ext.lower() for ext in mimetypes.types_map.keys()}

There's also a mime-db JSON dataset that we could use available at https://cdn.jsdelivr.net/npm/mime-db@latest/db.json but would require internet access obviously.

Claude suggested both of these working together to provide the best/standard list of extensions.

ctgraham · 2026-06-04T13:53:05Z

I think the solution should be a bit of both: processing common use-case, and requiring manual workarounds for non-standard cases.

We already have an enumeration of known file extensions:

scan-batch-dir/scan-batch-dir

Lines 852 to 892 in 241bc44

    
           # Process any .tif files. 
        
           if (file_ext.lower() == ".tif"): 
        
               logger.info("File is type: TIFF") 
        
               # Create .jp2 file from .tif and set .jp2 file as the file. 
        
               # Override pid, file_path to point to jp2 file. 
        
               pid,file_path = process_tiff(file_path,level) 
        
               # Override Row Data. 
        
               row_data['id'] = pid 
        
               row_data['file'] = file_path 
        
           # Process any .jp2 files. 
        
           elif (file_ext.lower() == ".jp2"): 
        
               logger.info(f"File is type: JP2") 
        
           # Process any audio files. 
        
           elif (file_ext.lower() == ".mp3"): 
        
               logger.info(f"File is type: Audio") 
        
           # Process any video files. 
        
           elif (file_ext.lower() == ".mkv" or file_ext.lower() == ".mp4"): 
        
               logger.info(f"File is type: Video") 
        
           # Process any transcription files. 
        
           elif (file_ext.lower() == ".srt" or file_ext.lower() == ".vtt"): 
        
               logger.info(f"File is type: Transcript/Caption") 
        
               # Override Row Data. 
        
               # For transcription files just update the transcript column. 
        
               row_data = { 
        
                   'transcript': file_path 
        
               } 
        
           # Process any PDF files. 
        
           elif (file_ext.lower() == ".pdf"): 
        
               logger.info(f"File is type: PDF") 
        
           # Process any simple image files. 
        
           elif (file_ext.lower() == ".png" or file_ext.lower() == ".jpg"): 
        
               logger.info(f"File is type: Simple Image")

If there are standard pairings, then these can be recognized (e.g: .mp3.vtt, .mp4.vtt); but if we are expecting the partner to provide data in the format of PID.ext; then we need them or us to clean up the input when it doesn't match known patterns.

Updated get_file_info to handle files with multiple periods in their …

29d7548

…name.

bdgregg requested review from ctgraham and ojas-uls-dev June 3, 2026 19:54

bdgregg linked an issue Jun 3, 2026 that may be closed by this pull request

Handle files with multiple periods in the filename. #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated get_file_info to handle files with multiple periods in their …#7

Updated get_file_info to handle files with multiple periods in their …#7
bdgregg wants to merge 1 commit into
masterfrom
4-multiple-periods-in-filename

bdgregg commented Jun 3, 2026

Uh oh!

ojas-uls-dev commented Jun 3, 2026

Uh oh!

ctgraham commented Jun 3, 2026 •

edited

Loading

Uh oh!

bdgregg commented Jun 3, 2026

Uh oh!

bdgregg commented Jun 3, 2026

Uh oh!

ctgraham commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bdgregg commented Jun 3, 2026

Uh oh!

ojas-uls-dev commented Jun 3, 2026

Uh oh!

ctgraham commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdgregg commented Jun 3, 2026

Uh oh!

bdgregg commented Jun 3, 2026

Uh oh!

ctgraham commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ctgraham commented Jun 3, 2026 •

edited

Loading