It all started with my friend complaining about the lack of space on his drives and I was thinking there has to be a way to fix that problem. At first, I did a comparison program, but it only saved a few GB out of his 12 TB and mapping the drive noticed a lot of videos and they were all over the place. The end goal was running a NAS server with all his files which required a certain standardization to the files. Inadvertently during the process, I found a way to compress the files by an insane rate.
Overview
The program starts at a directory and traverses that directory into sub directories taking the video files and through some checks the folder as potentially the series name. From those we come up with a standardized format naming the files and exporting to a excel document for manual user review. Once the user reviews and makes modifications they change the 2nd to last field to true and switch over to other python file to do changes. For optimal results we decided to upgrade to libx265 for format and AAC for audio. The program then runs subcommands for each ffmpeg which depending on the CPU determines the amount of time it takes to rewrite the file. If the program finishes or stop it requires the running cleaning the database and optimally run the delete format or manually doing it to update the database to not rerun the files already rewritten files.
Step 1: Setup and Retrieving Files
The setup was a little bit of a pain because I had to conversate with the customer of important data to them.
classfileObject: source ="" resolution =-1# Check metadata newFilename ="" year =0000 season =-1 episode =-1 endEpisode =-1 series ="" episodeName =""def__init__(self,filepath,subtitles): self.filepath = filepath self.fullpath = os.path.abspath(filepath) self.filename = os.path.basename(filepath) self.parentFold = os.path.dirname(filepath) self.extension = os.path.splitext(filepath)[1].strip()# Ex: .mkv self.subtitles = subtitles# Extra feature to remove files that start with ._ In addition to listing all filesdefcleaningUp(): acceptedExt = [".mkv",".mov",".mp4",'.wmv','.m4v','.avi','.flv','.srt','.ass','.ssa','.sub']print("Starting Function to remove all files with a ._ Starting") fileList = []for root, dirs, files in os.walk(containingDir):for name in files:for x in acceptedExt:if x in name: fileList.append(os.path.join(root, name))if name[0:2]=="._": os.remove(os.path.join(root, name))print("Removing file "+ name)return fileListdefmain(): os.chdir(containingDir) fileList =cleaningUp()for f in fileList: absCurFilePath = os.path.abspath(f)ifisVideoFile(f): myCurFile =extraction(absCurFilePath) fileRecords.append(myCurFile)else:pass
Step 2: Creating new Filename
This one was simple in theory, just extract the data how hard could that be.. Apparently very hard. So many setbacks from the naming convention people used. Taking it one function at a time gave a very long program. Below is the code of those functions and what they do.
# Removes parenthesis and bracketsdefremovePB(str):str= re.sub(r'\([^)]*\)', '', str)str= re.sub(r'\[[^]]*\]', '' , str)returnstr# Removes blacklisted wordsdefremoveBlacklistWords(phrase): blackListWords = ['x264','WEB','Dual-Audio','HQ','BrRip','BDRip','Rip','BluRay','x265','AAC','DVD','RCVR','10bit','Blu-ray','FLAC','Dual Audio','HEVC','MULTI-AUDIO','Subbed','10-bit','HDTV','DTS-HD','Multi-Subs','CtrlHD']for a in blackListWords:# Removing BlacklistedWords (Upper and lowercase) phrase =smartReplace(a, phrase)fullClean(phrase)return phrase# Replaces the Dots with spacesdefdotFix(filename): newFilename ="" split = filename.split(".")for x in split: x = x.strip()if x !="": newFilename += x +" "return newFilename.strip()defsmartReplace(a,b): # Removing A from b if a is present a = a.lower() c = b.lower()if a in c: startIndex = c.index(a) endIndex = startIndex +len(a)return b[0:startIndex]+ b[endIndex:]else:return b
I am most proud of smartReplace function finding all the things in different capitalization of words. Next step was the fun part of extraction based on various elements.
Step 3: Information Extraction
Feel free to refer to the class format or the writting to database to see what information is being extracted. We start out first with creating the file object
defcreateFileObject(subtitles,tags,absPath): fileObj =fileObject(absPath, subtitles) myPath3 = re.compile('S[0-9]+[ ]*E[0-9]+[-]*E[0-9]+',re.IGNORECASE).findall(absPath)# Season and Episode myPath4 = re.compile('S[0-9]+[ ]*E[0-9]+',re.IGNORECASE).findall(absPath)# Season and Episode Alternate Versioniflen(myPath3)==1: se = []if"e"in myPath3[0]: se = myPath3[0].split("e")if"E"in myPath3[0]: se = myPath3[0].split("E") fileObj.season =int(se[0][1:]) fileObj.episode =int(se[1].strip("-")) fileObj.endEpisode =int(se[2])iflen(myPath4)==1: se = []if"e"in myPath4[0]: se = myPath4[0].split("e")elif"E"in myPath4[0]: se = myPath4[0].split("E") fileObj.season =int(se[0][1:]) fileObj.episode =int(se[1]) resolution = re.compile('[0-9]+p', re.IGNORECASE).findall(fileObj.filename)# Resolutioniflen(resolution)==1: fileObj.resolution = resolution[0]for x in tags:# Assigning the Tags possibleYears = re.compile(r'([1-2][0-9]{3})').findall(x)# Year known =len(resolution)==1orlen(possibleYears)==1orlen(myPath3)==1orlen(myPath4)==1iflen(possibleYears)>=1: totalPossible = re.compile(r'([1-2][0-9]{3})').findall(fileObj.filename)iflen(totalPossible)==len(resolution)+len(possibleYears):iflen(resolution)==1and resolution[0][0:-1] != possibleYears[0]: fileObj.year = possibleYears[0]else:iflen(resolution)==1:if resolution[0][0:-1] != possibleYears[0]:for years in totalPossible:ifnot(years == resolution[0][0:-1] or years == possibleYears[0]): fileObj.year = yearselse:for years in totalPossible:if years != possibleYears[0]: fileObj.year = yearsifnot known:# Check for seasonif x in absPath and x in os.path.dirname(absPath):# Season name fileObj.seasonName = xelif fileObj.source =="": fileObj.source = xreturn fileObj
We then move to the function to extract information needed from the name and parent folder name
defextraction(curFilePath): curSubtitles =findSubTitles(curFilePath) similarTags =getTags(os.path.dirname(curFilePath))# Add spaces here for the periods/ myCurFile =createFileObject(curSubtitles, similarTags, curFilePath)print("Going to do the process now but modifying") myCurFile.getNewName() myCurFile.writeToCsv()return myCurFiledefisPossibleSubNames(filename,subtitleName): fileBaseName = os.path.relpath(filename).split(".")[0] subtitleNameParts = subtitleName.split(".")iflen(subtitleNameParts)==3:# filename.ENG.srtiflen(subtitleNameParts[1])==3:# valid Subtitle namereturnTrueelif fileBaseName == subtitleNameParts[0]:# Exactly SamereturnTrueelse:# Came up emptyreturnFalse
Then we move on to subfunctions to extract tags from the names.
# Gets the absolute tags with no exceptiondefgetTags(parentDir): files = os.listdir(parentDir) filename1 ="" filename2 =""for x in files:ifisVideoFile(x):if filename1 =="": filename1 = xelif filename2 =="": filename2 = xelifint(random.random() *10)==5:if random.random()<.5: filename1 = xelse: filename2 = x possibleTags = [] actualTags = [] split1 = filename1.split(".") split2 = filename1.split("[") split3 = filename1.split("(")iflen(split1)>2:# Not just filename and extensionfor x in split1: possibleTags.append(x)iflen(split2)>0:# Contains []for x in split2:if"]"in x: endInd = x.index("]") possibleTags.append(x[0:endInd])iflen(split3)>0:# Contains ()for x in split3:if")"in x: endInd = x.index(")") possibleTags.append(x[0:endInd])# Adding the season name parentName =smartReplace("season", parentDir.split("\\")[-1]) parentName =removePB(parentName) seriesMaybe =longest_Substring(parentName, filename1)if seriesMaybe in filename2: actualTags.append(seriesMaybe)for x in possibleTags:# Verify they can be tagsif x in filename2: tag =removeBlacklistWords(x) actualTags.append(tag)return actualTags# Finds seperate subtitle filesdeffindSubTitles(filename): filename = os.path.abspath(filename) parentFold = os.path.dirname(filename) curFold = os.listdir(parentFold) subsList = []for f in curFold:if os.path.isdir(f): newDir = os.listdir(f)for x in newDir:# SubfolderFilesif os.path.isdir(x): loggingFuckups.write("Nested Loops with folder "+ os.path.abspath(x) +"\n") loggingFuckups.flush()elifisPossibleSubNames(filename, os.path.relpath(x)): subsList.append(x)elifisVideoFile(filename):passelifisPossibleSubNames(filename, os.path.relpath(f)): subsList.append(f)else:print("Not a folder. Not a video. Not a valid subtitle. What are you then. Just invalid I guess")return subsList
Infering information based on other factors
Finds longest similarity between 2 stringsdeflongest_Substring(s1,s2): seq_match =SequenceMatcher(None,s1,s2) match = seq_match.find_longest_match(0, len(s1), 0, len(s2))# return the longest substring if (match.size!=0):return (s1[match.a: match.a + match.size]) else:return ('Longest common sub-string not present')
Small things got weird in formatting as so to account for that I included a few functions to deal with them. Feel free to take a dive on what they do more specifically.
# Clean up any other small thingsdeffullClean(line): split = re.split(' +', line) newLine = split[0]+" "for i inrange(1, len(split)): addPart =Trueif split[i]=='-':if split[i-1]=='-': addPart =Falseelif split[i]=='[]'or split[i]=='()': addPart =Falseif i +1==len(split)and split[i]=='-':# Remove trailing '-' addPart =Falseif addPart: newLine += split[i]+" "return newLine.strip()
Step 5: Writting to database
Since we didn't have an actual database and required that the person verify manually that the entries were correct due to human difference, we decided to write to a csv file as our source of truth.
The hardest part was reading through all of the fffmpeg documentation and finding what was the most optimal setting for videos with the maximum amount of benefit.
# Main classdefrunProgram(): f =open(containDir +'database.csv', 'r') lines = f.readlines() info =input("Are all "+str(lines) +" lines in the excel correct (Y or N)")if info.lower()=='n':print('That is a problem. Rerun later')return remux =Remux()for l in lines: split = l.split(',') verified = split[13] replaced = split[14]if replaced.lower()!="true":if verified.lower()=="true": cmd = remux.build_fmpg_cmd(split, [])if cmd !=None: subprocess.run(cmd)else: cmd['ssa']='copy' subprocess.run(cmd)
So after a lot of research upgrading to libx265 was the most optimal and variable framerate as we prioritized size over computing speed it would take. So I created my own remux for each file
Close runner up command was changing the buffer and max frame rates. Also a notable fix was instead of assuming the first stream was video so changed it to read video.
Step 7: Clean up
Once they are created another copy we need to verify that it was correct as some gave me a < 10 byte file which was not right and I would manually check 1 from each season of a series.
# After running this you will need to rename the files to match the new filedefcleanDatabase(dest_fold):print("Going through database.csv and writting an updated version to database1.csv") files = os.listdir(dest_fold) g =open(containDir +'database.csv','r') lines = g.readlines() rewritten = []for f in files:if os.path.getsize(f)>100:# Replaced sucessfullyprint(os.path.basename(f)) rewritten.append(os.path.basename(f))for o inrange(0, len(lines)): l = lines[o] split = l.split(',') name = split[2] ind =-1for i inrange(0, len(rewritten)):if rewritten[i]== name: ind = i# print(split[0:-1],"TRUE")if ind !=-1: lines[o]= lines[o].replace("FALSE", "TRUE")print(rewritten) h =open(containDir +'database1.csv', 'w') h.writelines(lines) h.flush()
Then the optional of getting rid of the old version of the file
defdeleteRewritten(): # Then the database file will be called left so renaming is needed f =open(containDir +"database.csv", "r") g =open(containDir +'left.csv', 'w') lines = f.readlines()print("Rewritting the files from database.csv to left.csv")for l in lines: split = l.split(',')# print(split[-1])if split[-1].strip().lower()=="true":print("Remove File " , split[0]) os.remove(split[0])else: g.write(l) g.flush()
Final Version
Thanks to alpha tester Arturo I was able to finish this project for many more edge cases.
Notes
Have to have series names for consideration because that is the folder it is placed into