Skip to content

Updated get_file_info to handle files with multiple periods in their …#7

Open
bdgregg wants to merge 1 commit into
masterfrom
4-multiple-periods-in-filename
Open

Updated get_file_info to handle files with multiple periods in their …#7
bdgregg wants to merge 1 commit into
masterfrom
4-multiple-periods-in-filename

Conversation

@bdgregg
Copy link
Copy Markdown
Contributor

@bdgregg bdgregg commented Jun 3, 2026

…name.

Had a case where a partner submitted files like "a.mp3.vtt" instead of "a.vtt". To handle this properly the PID is everything before the first period and the Extension is everything after the last period.

@bdgregg bdgregg requested review from ctgraham and ojas-uls-dev June 3, 2026 19:54
@bdgregg bdgregg linked an issue Jun 3, 2026 that may be closed by this pull request
@ojas-uls-dev
Copy link
Copy Markdown

The function os.path.splitext handles this gracefully, without need for custom string logic

>>> import os.path as path
>>> path.splitext('a/b/c.vtt')
('a/b/c', '.vtt')
>>> path.splitext('a/b/c.mp3.vtt')
('a/b/c.mp3', '.vtt')
>>>

@ctgraham
Copy link
Copy Markdown
Member

ctgraham commented Jun 3, 2026

We have plenty of historic PIDs with dots in them.

everything before the first period and the Extension is everything after the last period.

This assumption would fail with historic data.

@bdgregg
Copy link
Copy Markdown
Contributor Author

bdgregg commented Jun 3, 2026

@ctgraham Thanks for the reminder about say something like AIS.51423-7.5.0.mp3.vtt - yes that would fail pretty well. Any other suggestions for weeding out the embedded file extension in the example ".mp3"? My only thought is to parse the string against a known list of extensions but that could be a bit consuming to identify what possible extensions people would create. The current work around is to eliminate the ".mp3" manually from the filename.

@bdgregg
Copy link
Copy Markdown
Contributor Author

bdgregg commented Jun 3, 2026

I guess I could look at integrating a list of known extensions from mime types such as the following and using that to determine what part of the file contains file extensions that could be scrapped.

import mimetypes

def get_known_extensions() -> set:
    """Build a set of known extensions from Python's mimetypes registry."""
    mimetypes.init()
    return {ext.lower() for ext in mimetypes.types_map.keys()}

There's also a mime-db JSON dataset that we could use available at https://cdn.jsdelivr.net/npm/mime-db@latest/db.json but would require internet access obviously.

Claude suggested both of these working together to provide the best/standard list of extensions.

@ctgraham
Copy link
Copy Markdown
Member

ctgraham commented Jun 4, 2026

I think the solution should be a bit of both: processing common use-case, and requiring manual workarounds for non-standard cases.

We already have an enumeration of known file extensions:

# Process any .tif files.
if (file_ext.lower() == ".tif"):
logger.info("File is type: TIFF")
# Create .jp2 file from .tif and set .jp2 file as the file.
# Override pid, file_path to point to jp2 file.
pid,file_path = process_tiff(file_path,level)
# Override Row Data.
row_data['id'] = pid
row_data['file'] = file_path
# Process any .jp2 files.
elif (file_ext.lower() == ".jp2"):
logger.info(f"File is type: JP2")
# Process any audio files.
elif (file_ext.lower() == ".mp3"):
logger.info(f"File is type: Audio")
# Process any video files.
elif (file_ext.lower() == ".mkv" or file_ext.lower() == ".mp4"):
logger.info(f"File is type: Video")
# Process any transcription files.
elif (file_ext.lower() == ".srt" or file_ext.lower() == ".vtt"):
logger.info(f"File is type: Transcript/Caption")
# Override Row Data.
# For transcription files just update the transcript column.
row_data = {
'transcript': file_path
}
# Process any PDF files.
elif (file_ext.lower() == ".pdf"):
logger.info(f"File is type: PDF")
# Process any simple image files.
elif (file_ext.lower() == ".png" or file_ext.lower() == ".jpg"):
logger.info(f"File is type: Simple Image")

If there are standard pairings, then these can be recognized (e.g: .mp3.vtt, .mp4.vtt); but if we are expecting the partner to provide data in the format of PID.ext; then we need them or us to clean up the input when it doesn't match known patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle files with multiple periods in the filename.

3 participants