Updated get_file_info to handle files with multiple periods in their …#7
Updated get_file_info to handle files with multiple periods in their …#7bdgregg wants to merge 1 commit into
Conversation
|
The function |
|
We have plenty of historic PIDs with dots in them.
This assumption would fail with historic data. |
|
@ctgraham Thanks for the reminder about say something like AIS.51423-7.5.0.mp3.vtt - yes that would fail pretty well. Any other suggestions for weeding out the embedded file extension in the example ".mp3"? My only thought is to parse the string against a known list of extensions but that could be a bit consuming to identify what possible extensions people would create. The current work around is to eliminate the ".mp3" manually from the filename. |
|
I guess I could look at integrating a list of known extensions from mime types such as the following and using that to determine what part of the file contains file extensions that could be scrapped. There's also a mime-db JSON dataset that we could use available at https://cdn.jsdelivr.net/npm/mime-db@latest/db.json but would require internet access obviously. Claude suggested both of these working together to provide the best/standard list of extensions. |
|
I think the solution should be a bit of both: processing common use-case, and requiring manual workarounds for non-standard cases. We already have an enumeration of known file extensions: Lines 852 to 892 in 241bc44 If there are standard pairings, then these can be recognized (e.g: .mp3.vtt, .mp4.vtt); but if we are expecting the partner to provide data in the format of PID.ext; then we need them or us to clean up the input when it doesn't match known patterns. |
…name.
Had a case where a partner submitted files like "a.mp3.vtt" instead of "a.vtt". To handle this properly the PID is everything before the first period and the Extension is everything after the last period.