It all started with my friend complaining about the lack of space on his drives and I was thinking there has to be a way to fix that problem. At first, I did a comparison program, but it only saved a few GB out of his 12 TB and mapping the drive noticed a lot of videos and they were all over the place. The end goal was running a NAS server with all his files which required a certain standardization to the files. Inadvertently during the process, I found a way to compress the files by an insane rate.
Overview
The program starts at a directory and traverses that directory into sub directories taking the video files and through some checks the folder as potentially the series name. From those we come up with a standardized format naming the files and exporting to a excel document for manual user review. Once the user reviews and makes modifications they change the 2nd to last field to true and switch over to other python file to do changes. For optimal results we decided to upgrade to libx265 for format and AAC for audio. The program then runs subcommands for each ffmpeg which depending on the CPU determines the amount of time it takes to rewrite the file. If the program finishes or stop it requires the running cleaning the database and optimally run the delete format or manually doing it to update the database to not rerun the files already rewritten files.
Step 1: Setup and Retrieving Files
The setup was a little bit of a pain because I had to conversate with the customer of important data to them.
class fileObject:
source = ""
resolution = -1 # Check metadata
newFilename = ""
year = 0000
season = -1
episode = -1
endEpisode = -1
series = ""
episodeName = ""
def __init__(self, filepath, subtitles):
self.filepath = filepath
self.fullpath = os.path.abspath(filepath)
self.filename = os.path.basename(filepath)
self.parentFold = os.path.dirname(filepath)
self.extension = os.path.splitext(filepath)[1].strip() # Ex: .mkv
self.subtitles = subtitles
# Extra feature to remove files that start with ._ In addition to listing all files
def cleaningUp():
acceptedExt = [".mkv", ".mov", ".mp4", '.wmv', '.m4v', '.avi', '.flv', '.srt', '.ass', '.ssa', '.sub']
print("Starting Function to remove all files with a ._ Starting")
fileList = []
for root, dirs, files in os.walk(containingDir):
for name in files:
for x in acceptedExt:
if x in name:
fileList.append(os.path.join(root, name))
if name[0:2] == "._":
os.remove(os.path.join(root, name))
print("Removing file "+ name)
return fileList
def main():
os.chdir(containingDir)
fileList = cleaningUp()
for f in fileList:
absCurFilePath = os.path.abspath(f)
if isVideoFile(f):
myCurFile = extraction(absCurFilePath)
fileRecords.append(myCurFile)
else:
pass
Step 2: Creating new Filename
This one was simple in theory, just extract the data how hard could that be.. Apparently very hard. So many setbacks from the naming convention people used. Taking it one function at a time gave a very long program. Below is the code of those functions and what they do.
# Removes parenthesis and brackets
def removePB(str):
str = re.sub(r'\([^)]*\)', '', str)
str = re.sub(r'\[[^]]*\]', '' , str)
return str
# Removes blacklisted words
def removeBlacklistWords(phrase):
blackListWords = ['x264', 'WEB', 'Dual-Audio', 'HQ', 'BrRip','BDRip', 'Rip', 'BluRay', 'x265', 'AAC',
'DVD', 'RCVR', '10bit', 'Blu-ray', 'FLAC','Dual Audio','HEVC', 'MULTI-AUDIO', 'Subbed', '10-bit', 'HDTV',
'DTS-HD','Multi-Subs', 'CtrlHD']
for a in blackListWords: # Removing BlacklistedWords (Upper and lowercase)
phrase = smartReplace(a, phrase)
fullClean(phrase)
return phrase
# Replaces the Dots with spaces
def dotFix(filename):
newFilename = ""
split = filename.split(".")
for x in split:
x = x.strip()
if x != "":
newFilename += x + " "
return newFilename.strip()
def smartReplace(a, b): # Removing A from b if a is present
a = a.lower()
c = b.lower()
if a in c:
startIndex = c.index(a)
endIndex = startIndex + len(a)
return b[0:startIndex] + b[endIndex:]
else:
return b
I am most proud of smartReplace function finding all the things in different capitalization of words. Next step was the fun part of extraction based on various elements.
Step 3: Information Extraction
Feel free to refer to the class format or the writting to database to see what information is being extracted. We start out first with creating the file object
def createFileObject(subtitles, tags, absPath):
fileObj = fileObject(absPath, subtitles)
myPath3 = re.compile('S[0-9]+[ ]*E[0-9]+[-]*E[0-9]+',re.IGNORECASE).findall(absPath) # Season and Episode
myPath4 = re.compile('S[0-9]+[ ]*E[0-9]+',re.IGNORECASE).findall(absPath) # Season and Episode Alternate Version
if len(myPath3) == 1:
se = []
if "e" in myPath3[0]:
se = myPath3[0].split("e")
if "E" in myPath3[0]:
se = myPath3[0].split("E")
fileObj.season = int(se[0][1:])
fileObj.episode = int(se[1].strip("-"))
fileObj.endEpisode = int(se[2])
if len(myPath4) == 1:
se = []
if "e" in myPath4[0]:
se = myPath4[0].split("e")
elif "E" in myPath4[0]:
se = myPath4[0].split("E")
fileObj.season = int(se[0][1:])
fileObj.episode = int(se[1])
resolution = re.compile('[0-9]+p', re.IGNORECASE).findall(fileObj.filename) # Resolution
if len(resolution) == 1:
fileObj.resolution = resolution[0]
for x in tags: # Assigning the Tags
possibleYears = re.compile(r'([1-2][0-9]{3})').findall(x) # Year
known = len(resolution) == 1 or len(possibleYears) == 1 or len(myPath3) == 1 or len(myPath4) == 1
if len(possibleYears) >= 1:
totalPossible = re.compile(r'([1-2][0-9]{3})').findall(fileObj.filename)
if len(totalPossible) == len(resolution) + len(possibleYears):
if len(resolution) == 1 and resolution[0][0:-1] != possibleYears[0]:
fileObj.year = possibleYears[0]
else:
if len(resolution) == 1:
if resolution[0][0:-1] != possibleYears[0]:
for years in totalPossible:
if not(years == resolution[0][0:-1] or years == possibleYears[0]):
fileObj.year = years
else:
for years in totalPossible:
if years != possibleYears[0]:
fileObj.year = years
if not known: # Check for season
if x in absPath and x in os.path.dirname(absPath): # Season name
fileObj.seasonName = x
elif fileObj.source == "":
fileObj.source = x
return fileObj
We then move to the function to extract information needed from the name and parent folder name
def extraction(curFilePath):
curSubtitles = findSubTitles(curFilePath)
similarTags = getTags(os.path.dirname(curFilePath))
# Add spaces here for the periods/
myCurFile = createFileObject(curSubtitles, similarTags, curFilePath)
print("Going to do the process now but modifying")
myCurFile.getNewName()
myCurFile.writeToCsv()
return myCurFile
def isPossibleSubNames(filename, subtitleName):
fileBaseName = os.path.relpath(filename).split(".")[0]
subtitleNameParts = subtitleName.split(".")
if len(subtitleNameParts) == 3: # filename.ENG.srt
if len(subtitleNameParts[1]) == 3: # valid Subtitle name
return True
elif fileBaseName == subtitleNameParts[0]: # Exactly Same
return True
else: # Came up empty
return False
Then we move on to subfunctions to extract tags from the names.
# Gets the absolute tags with no exception
def getTags(parentDir):
files = os.listdir(parentDir)
filename1 = ""
filename2 = ""
for x in files:
if isVideoFile(x):
if filename1 == "":
filename1 = x
elif filename2 == "":
filename2 = x
elif int(random.random() * 10) == 5:
if random.random() < .5:
filename1 = x
else:
filename2 = x
possibleTags = []
actualTags = []
split1 = filename1.split(".")
split2 = filename1.split("[")
split3 = filename1.split("(")
if len(split1) > 2: # Not just filename and extension
for x in split1:
possibleTags.append(x)
if len(split2) > 0: # Contains []
for x in split2:
if "]" in x:
endInd = x.index("]")
possibleTags.append(x[0:endInd])
if len(split3) > 0: # Contains ()
for x in split3:
if ")" in x:
endInd = x.index(")")
possibleTags.append(x[0:endInd])
# Adding the season name
parentName = smartReplace("season", parentDir.split("\\")[-1])
parentName = removePB(parentName)
seriesMaybe = longest_Substring(parentName, filename1)
if seriesMaybe in filename2:
actualTags.append(seriesMaybe)
for x in possibleTags: # Verify they can be tags
if x in filename2:
tag = removeBlacklistWords(x)
actualTags.append(tag)
return actualTags
# Finds seperate subtitle files
def findSubTitles(filename):
filename = os.path.abspath(filename)
parentFold = os.path.dirname(filename)
curFold = os.listdir(parentFold)
subsList = []
for f in curFold:
if os.path.isdir(f):
newDir = os.listdir(f)
for x in newDir: # SubfolderFiles
if os.path.isdir(x):
loggingFuckups.write("Nested Loops with folder " + os.path.abspath(x) + "\n")
loggingFuckups.flush()
elif isPossibleSubNames(filename, os.path.relpath(x)):
subsList.append(x)
elif isVideoFile(filename):
pass
elif isPossibleSubNames(filename, os.path.relpath(f)):
subsList.append(f)
else:
print("Not a folder. Not a video. Not a valid subtitle. What are you then. Just invalid I guess")
return subsList
Infering information based on other factors
Finds longest similarity between 2 strings
def longest_Substring(s1,s2):
seq_match = SequenceMatcher(None,s1,s2)
match = seq_match.find_longest_match(0, len(s1), 0, len(s2))
# return the longest substring
if (match.size!=0):
return (s1[match.a: match.a + match.size])
else:
return ('Longest common sub-string not present')
Constructing new name from information gotten1
def getNewName(self):
newFilename = ""
modFilename = re.sub("\(.*?\)","",self.filename) # () Content
modFilename = re.sub("\[.*?\]", "", modFilename) # [] Content
modFilename = re.sub("S[0-9]+[ ]*E[0-9]+[-]*E[0-9]+","", modFilename, re.IGNORECASE) # Season and Episode
modFilename = re.sub("S[0-9]+[ ]*E[0-9]+", "", modFilename,re.IGNORECASE) # Season and Episode Alternate Version
modFilename = removeBlacklistWords(modFilename)
modFilename = modFilename.replace(str(self.resolution) + "p", "").replace(str(self.resolution) + "P", "") # Resolution
if not (self.series == ""):
modFilename = modFilename.replace(self.series, "")
modFilename = modFilename.strip("-")
newFilename = (newFilename + modFilename)
if self.year != -1:
newFilename = newFilename.replace(str(self.year), "")
newFilename = newFilename.replace(self.extension, "")
if not (self.season == -1 or self.episode == -1): # Has season or episode Number
preName = f'S{self.season:02d}E{self.episode:02d}'
if not self.endEpisode == -1:
preName += f'-E{self.endEpisode:02d}'
newFilename = preName + " - " + newFilename
newFilename = dotFix(newFilename)
newFilename = dashFix(newFilename)
self.newFilename = newFilename + self.extension
return self.newFilename
Step 4: Clean up
Small things got weird in formatting as so to account for that I included a few functions to deal with them. Feel free to take a dive on what they do more specifically.
# Clean up any other small things
def fullClean(line):
split = re.split(' +', line)
newLine = split[0] +" "
for i in range(1, len(split)):
addPart = True
if split[i] == '-':
if split[i-1] == '-':
addPart = False
elif split[i] == '[]' or split[i] == '()':
addPart = False
if i + 1 == len(split) and split[i] == '-': # Remove trailing '-'
addPart = False
if addPart:
newLine += split[i] + " "
return newLine.strip()
Step 5: Writting to database
Since we didn't have an actual database and required that the person verify manually that the entries were correct due to human difference, we decided to write to a csv file as our source of truth.
The hardest part was reading through all of the fffmpeg documentation and finding what was the most optimal setting for videos with the maximum amount of benefit.
# Main class
def runProgram():
f = open(containDir + 'database.csv', 'r')
lines = f.readlines()
info = input("Are all " + str(lines) + " lines in the excel correct (Y or N)")
if info.lower() == 'n':
print('That is a problem. Rerun later')
return
remux = Remux()
for l in lines:
split = l.split(',')
verified = split[13]
replaced = split[14]
if replaced.lower() != "true":
if verified.lower() == "true":
cmd = remux.build_fmpg_cmd(split, [])
if cmd != None:
subprocess.run(cmd)
else:
cmd['ssa'] = 'copy'
subprocess.run(cmd)
So after a lot of research upgrading to libx265 was the most optimal and variable framerate as we prioritized size over computing speed it would take. So I created my own remux for each file
class Remux:
def __init__(self):
self.FFMPEG_PATH = './ffmpeg.exe'
self.LANGUAGE_DEFAULT = "eng"
my_os = sys.platform
if my_os == 'linux':
self.FFMPEG_PATH = '/usr/bin/ffmpeg'
self.DEST_FILE_FORMAT = "mine1.mkv" # Format of the name Not
self.dest_fold = './'
def build_fmpg_cmd(self,split, subtitleList):
fullpath = split[0].replace("\"", "")
oldname = split[1]
newName = split[2]
ext = split[3]
source = split[4]
year = split[5]
season = split[6]
episode = split[7]
seriesName = split[8]
episodeName = split[9]
resolution = split[10]
tune = split[11]
customArguments = ['-vcodec', 'libx264','-present', 'fast', '-vsync','vfr']
if tune != '':
customArguements.add('-tune')
customArguements.add(tune)
mapsArgument = ['-map','0:v?', '-map', '0:a?', '-map', '0:s?']
copyArguments = ['-c:a', 'aac', '-c:s', 'ssa', '-c:t', 'copy']
outputFileArguments = ['-y']
if seriesName == '':
pass
else:
if not os.path.exists('./' + seriesName):
os.mkdir('./' + seriesName)
if ext != '.mkv': # Conversion
tmp = os.path.splitext(newName)
outputFileArguments.append(seriesName + "\\" + tmp[0] + ".mkv")
else:
outputFileArguments.append(seriesName + "\\" + newName)
outputFileArguments = ['-y', newName]
command = [self.FFMPEG_PATH, '-i', fullpath]
subsArgument = []
metaArgument = []
defaultTrack = []
if year != -1:
metaArgument.append('-metadata')
metaArgument.append("year=" + str(year))
if tune != '':
customArguments.append('-tune')
customArguments.append(tune)
if resolution != -1:
customArguments.append('-vf')
customArguments.append("scale=-1:" + resolution.replace("p","").replace("P", ""))
for index, subtitle in enumerate(subtitleList):
mapsArgument += ['-map', str(index + 1) + ':0'] # Is this the old arguements?
subsArgument += ['-i', subtitle]
# check if subtitle has language before extension
subNameSplitted = subtitle.split('.')
subNameSplitted.pop(-1)
language = subNameSplitted[-1]
metaArgument += ['-metadata:s:s:' + str(index), 'language=' + language]
if (self.LANGUAGE_DEFAULT in language):
defaultTrack = ['-disposition:s:' + str(index), 'default']
command += customArguments + subsArgument + mapsArgument + metaArgument + copyArguments + defaultTrack + outputFileArguments
return command
Close runner up command was changing the buffer and max frame rates. Also a notable fix was instead of assuming the first stream was video so changed it to read video.
Step 7: Clean up
Once they are created another copy we need to verify that it was correct as some gave me a < 10 byte file which was not right and I would manually check 1 from each season of a series.
# After running this you will need to rename the files to match the new file
def cleanDatabase(dest_fold):
print("Going through database.csv and writting an updated version to database1.csv")
files = os.listdir(dest_fold)
g = open(containDir + 'database.csv','r')
lines = g.readlines()
rewritten = []
for f in files:
if os.path.getsize(f) > 100: # Replaced sucessfully
print(os.path.basename(f))
rewritten.append(os.path.basename(f))
for o in range(0, len(lines)):
l = lines[o]
split = l.split(',')
name = split[2]
ind = -1
for i in range(0, len(rewritten)):
if rewritten[i] == name:
ind = i
# print(split[0:-1],"TRUE")
if ind != -1:
lines[o] = lines[o].replace("FALSE", "TRUE")
print(rewritten)
h = open(containDir + 'database1.csv', 'w')
h.writelines(lines)
h.flush()
Then the optional of getting rid of the old version of the file
def deleteRewritten(): # Then the database file will be called left so renaming is needed
f = open(containDir + "database.csv", "r")
g = open(containDir + 'left.csv', 'w')
lines = f.readlines()
print("Rewritting the files from database.csv to left.csv")
for l in lines:
split = l.split(',')
# print(split[-1])
if split[-1].strip().lower() == "true":
print("Remove File " , split[0])
os.remove(split[0])
else:
g.write(l)
g.flush()
Final Version
Thanks to alpha tester Arturo I was able to finish this project for many more edge cases.
Notes
Have to have series names for consideration because that is the folder it is placed into