Video Standardization and Compression

Origin

It all started with my friend complaining about the lack of space on his drives and I was thinking there has to be a way to fix that problem. At first, I did a comparison program, but it only saved a few GB out of his 12 TB and mapping the drive noticed a lot of videos and they were all over the place. The end goal was running a NAS server with all his files which required a certain standardization to the files. Inadvertently during the process, I found a way to compress the files by an insane rate.

Overview

The program starts at a directory and traverses that directory into sub directories taking the video files and through some checks the folder as potentially the series name. From those we come up with a standardized format naming the files and exporting to a excel document for manual user review. Once the user reviews and makes modifications they change the 2nd to last field to true and switch over to other python file to do changes. For optimal results we decided to upgrade to libx265 for format and AAC for audio. The program then runs subcommands for each ffmpeg which depending on the CPU determines the amount of time it takes to rewrite the file. If the program finishes or stop it requires the running cleaning the database and optimally run the delete format or manually doing it to update the database to not rerun the files already rewritten files.

Step 1: Setup and Retrieving Files

The setup was a little bit of a pain because I had to conversate with the customer of important data to them.

class fileObject:
	source = ""
	resolution = -1 # Check metadata
	newFilename = ""
	year = 0000
	season = -1
	episode = -1
	endEpisode = -1
	series = ""
	episodeName = ""
	def __init__(self, filepath, subtitles):
		self.filepath = filepath
		self.fullpath = os.path.abspath(filepath)
		self.filename = os.path.basename(filepath)
		self.parentFold = os.path.dirname(filepath)
		self.extension = os.path.splitext(filepath)[1].strip() # Ex: .mkv
		self.subtitles = subtitles

# Extra feature to remove files that start with ._ In addition to listing all files
def cleaningUp():
	acceptedExt = [".mkv", ".mov", ".mp4", '.wmv', '.m4v', '.avi', '.flv', '.srt', '.ass', '.ssa', '.sub']
	print("Starting Function to remove all files with a ._ Starting")
	fileList = []
	for root, dirs, files in os.walk(containingDir):
		for name in files:
			for x in acceptedExt:
				if x in name:
					fileList.append(os.path.join(root, name))
				if name[0:2] == "._":
					os.remove(os.path.join(root, name))
					print("Removing file "+ name)
	return fileList

def main():
	os.chdir(containingDir)
	fileList = cleaningUp()
	for f in fileList:
		absCurFilePath = os.path.abspath(f)
		if isVideoFile(f):
			myCurFile = extraction(absCurFilePath)
			fileRecords.append(myCurFile)
		else:
			pass

Step 2: Creating new Filename

This one was simple in theory, just extract the data how hard could that be.. Apparently very hard. So many setbacks from the naming convention people used. Taking it one function at a time gave a very long program. Below is the code of those functions and what they do.

# Removes parenthesis and brackets
def removePB(str):
	str = re.sub(r'\([^)]*\)', '', str)
	str = re.sub(r'\[[^]]*\]', '' , str)
	return str
# Removes blacklisted words
def removeBlacklistWords(phrase):
	blackListWords = ['x264', 'WEB', 'Dual-Audio', 'HQ', 'BrRip','BDRip', 'Rip', 'BluRay', 'x265', 'AAC', 
				   'DVD', 'RCVR', '10bit', 'Blu-ray', 'FLAC','Dual Audio','HEVC', 'MULTI-AUDIO', 'Subbed', '10-bit', 'HDTV', 
				   'DTS-HD','Multi-Subs', 'CtrlHD']
	for a in blackListWords: # Removing BlacklistedWords (Upper and lowercase)
		phrase = smartReplace(a, phrase)
	fullClean(phrase)
	return phrase

# Replaces the Dots with spaces
def dotFix(filename):
	newFilename = ""
	split = filename.split(".")
	for x in split:
		x = x.strip()
		if x != "":
			newFilename += x + " "
	return newFilename.strip()

def smartReplace(a, b): # Removing A from b if a is present
	a = a.lower()
	c = b.lower()
	if a in c:
		startIndex = c.index(a)
		endIndex = startIndex + len(a)
		return b[0:startIndex] + b[endIndex:]
	else:
		return b

I am most proud of smartReplace function finding all the things in different capitalization of words. Next step was the fun part of extraction based on various elements.

Step 3: Information Extraction

Feel free to refer to the class format or the writting to database to see what information is being extracted. We start out first with creating the file object

def createFileObject(subtitles, tags, absPath):
	fileObj = fileObject(absPath, subtitles) 
	myPath3 = re.compile('S[0-9]+[ ]*E[0-9]+[-]*E[0-9]+',re.IGNORECASE).findall(absPath) # Season and Episode
	myPath4 = re.compile('S[0-9]+[ ]*E[0-9]+',re.IGNORECASE).findall(absPath) # Season and Episode Alternate Version
	if len(myPath3) == 1:
		se = []
		if "e" in myPath3[0]:
			se = myPath3[0].split("e")
		if "E" in myPath3[0]:
			se = myPath3[0].split("E")
		fileObj.season = int(se[0][1:])
		fileObj.episode = int(se[1].strip("-"))
		fileObj.endEpisode = int(se[2])
	if len(myPath4) == 1:
		se = []
		if "e" in myPath4[0]:
			se = myPath4[0].split("e")
		elif "E" in myPath4[0]:
			se = myPath4[0].split("E")
		fileObj.season = int(se[0][1:])
		fileObj.episode = int(se[1])
	resolution = re.compile('[0-9]+p', re.IGNORECASE).findall(fileObj.filename) # Resolution
	if len(resolution) == 1:
		fileObj.resolution = resolution[0]
	for x in tags: # Assigning the Tags
		possibleYears = re.compile(r'([1-2][0-9]{3})').findall(x) # Year
		known = len(resolution) == 1 or len(possibleYears) == 1 or len(myPath3) == 1 or len(myPath4) == 1
		if len(possibleYears) >= 1:
			totalPossible = re.compile(r'([1-2][0-9]{3})').findall(fileObj.filename)
			if len(totalPossible) == len(resolution) + len(possibleYears):
				if len(resolution) == 1 and resolution[0][0:-1] != possibleYears[0]:
					fileObj.year = possibleYears[0]
			else:
				if len(resolution) == 1:
					if resolution[0][0:-1] != possibleYears[0]:
						for years in totalPossible:
							if not(years == resolution[0][0:-1] or years == possibleYears[0]):
								fileObj.year = years
				else:
					for years in totalPossible:
						if years != possibleYears[0]:
							fileObj.year = years
		if not known: # Check for season
			if x in absPath and x in os.path.dirname(absPath): # Season name
				fileObj.seasonName = x
			elif fileObj.source == "":
				fileObj.source = x
	return fileObj

We then move to the function to extract information needed from the name and parent folder name

def extraction(curFilePath):
	curSubtitles = findSubTitles(curFilePath)
	similarTags = getTags(os.path.dirname(curFilePath))
	# Add spaces here for the periods/
	myCurFile = createFileObject(curSubtitles, similarTags, curFilePath)
	print("Going to do the process now but modifying")
	myCurFile.getNewName()
	myCurFile.writeToCsv()
	return myCurFile
	
def isPossibleSubNames(filename, subtitleName):
	fileBaseName = os.path.relpath(filename).split(".")[0]
	subtitleNameParts = subtitleName.split(".")
	if len(subtitleNameParts) == 3: # filename.ENG.srt
		if len(subtitleNameParts[1]) == 3: # valid Subtitle name
			return True
	elif fileBaseName == subtitleNameParts[0]: # Exactly Same
		return True
	else: # Came up empty
		return False

Then we move on to subfunctions to extract tags from the names.

# Gets the absolute tags with no exception
def getTags(parentDir):
	files = os.listdir(parentDir)
	filename1 = ""
	filename2 = ""
	for x in files:
		if isVideoFile(x):
			if filename1 == "":
				filename1 = x
			elif filename2 == "":
				filename2 = x
			elif int(random.random() * 10) == 5:
				if random.random() < .5:
					filename1 = x
				else:
					filename2 = x
	possibleTags = []
	actualTags = []
	split1 = filename1.split(".")
	split2 = filename1.split("[")
	split3 = filename1.split("(")
	if len(split1) > 2: # Not just filename and extension
		for x in split1:
			possibleTags.append(x)
	if len(split2) > 0: # Contains []
		for x in split2:
			if "]" in x:
				endInd = x.index("]")
				possibleTags.append(x[0:endInd])
	if len(split3) > 0: # Contains ()
		for x in split3:
			if ")" in x:
				endInd = x.index(")")
				possibleTags.append(x[0:endInd])
	# Adding the season name
	parentName = smartReplace("season", parentDir.split("\\")[-1])
	parentName = removePB(parentName)
	seriesMaybe = longest_Substring(parentName, filename1)
	
	if seriesMaybe in filename2:
		actualTags.append(seriesMaybe)
	for x in possibleTags: # Verify they can be tags
		if x in filename2:
			tag = removeBlacklistWords(x)
			actualTags.append(tag)
	return actualTags
	
# Finds seperate subtitle files
def findSubTitles(filename):
	filename = os.path.abspath(filename)
	parentFold = os.path.dirname(filename)
	curFold = os.listdir(parentFold)
	subsList = []
	for f in curFold:
		if os.path.isdir(f):
			newDir = os.listdir(f)
			for x in newDir: # SubfolderFiles
				if os.path.isdir(x):
					loggingFuckups.write("Nested Loops with folder " + os.path.abspath(x) + "\n")
					loggingFuckups.flush()
				elif isPossibleSubNames(filename, os.path.relpath(x)):
					subsList.append(x)
		elif isVideoFile(filename):
			pass
		elif isPossibleSubNames(filename, os.path.relpath(f)):
			subsList.append(f)
		else:
			print("Not a folder. Not a video. Not a valid subtitle. What are you then. Just invalid I guess")
	return subsList

Infering information based on other factors

Finds longest similarity between 2 strings
def longest_Substring(s1,s2):
	seq_match = SequenceMatcher(None,s1,s2) 
	match = seq_match.find_longest_match(0, len(s1), 0, len(s2)) 
	# return the longest substring 
	if (match.size!=0): 
		return (s1[match.a: match.a + match.size])  
	else:
		return ('Longest common sub-string not present')  

Constructing new name from information gotten1

def getNewName(self):
	newFilename = ""
	modFilename = re.sub("\(.*?\)","",self.filename) # () Content
	modFilename = re.sub("\[.*?\]", "", modFilename) # [] Content
	modFilename = re.sub("S[0-9]+[ ]*E[0-9]+[-]*E[0-9]+","", modFilename, re.IGNORECASE) # Season and Episode
	modFilename = re.sub("S[0-9]+[ ]*E[0-9]+", "", modFilename,re.IGNORECASE) # Season and Episode Alternate Version
	modFilename = removeBlacklistWords(modFilename)
	modFilename = modFilename.replace(str(self.resolution) + "p", "").replace(str(self.resolution) + "P", "") # Resolution
	if not (self.series == ""):
		modFilename = modFilename.replace(self.series, "")
	modFilename = modFilename.strip("-")
	newFilename = (newFilename + modFilename)
	if self.year != -1:
		newFilename = newFilename.replace(str(self.year), "")
	newFilename = newFilename.replace(self.extension, "")
	if not (self.season == -1 or self.episode == -1): # Has season or episode Number
		preName = f'S{self.season:02d}E{self.episode:02d}'
		if not self.endEpisode == -1:
			preName += f'-E{self.endEpisode:02d}'
		newFilename = preName + " - " + newFilename
	newFilename = dotFix(newFilename)
	newFilename = dashFix(newFilename)
	self.newFilename = newFilename + self.extension
	return self.newFilename

Step 4: Clean up

Small things got weird in formatting as so to account for that I included a few functions to deal with them. Feel free to take a dive on what they do more specifically.

# Clean up any other small things
def fullClean(line):
	split = re.split(' +', line)
	newLine = split[0] +" "
	for i in range(1, len(split)):
		addPart = True
		if split[i] == '-':
			if split[i-1] == '-':
				addPart = False
		elif split[i] == '[]' or split[i] == '()':
			addPart = False
		if i + 1 == len(split) and split[i] == '-': # Remove trailing '-'
			addPart = False
		if addPart:
			newLine += split[i] + " "
	return newLine.strip()

Step 5: Writting to database

Since we didn't have an actual database and required that the person verify manually that the entries were correct due to human difference, we decided to write to a csv file as our source of truth.

def writeToCsv(self):
	dumpData = True
	header = "fullPath,oldname,newname,extension,source,year,season,episode,seriesName,episodeName,resolution,toReplace,verified,replaced\n"
	newFilename = self.newFilename
	extension = self.extension
	if newFilename == "":
		newFilename = self.getNewName()
	print(newFilename)
	lineWrite = "\"" + self.fullpath+ "\"," 
	lineWrite += "\""+ self.filename + "\",\"" + newFilename + "\",\"" + extension + "\"," + self.source + "," + str(self.year) + "," + str(self.season) + "," + str(self.episode) + ",\"" + self.series + "\",\"" + self.episodeName + "\"," + str(self.resolution) + ",,True,False,False\n"
	try:
		loggingChanges.write(lineWrite)
		loggingChanges.flush()
	except:
		pass
	if dumpData:
		try:
			metaDump = FFPROBE_PATH + " -i \"" + self.filename + "\" -print_format json -hide_banner"
			dumps.write(self.filename + ",\"" + metaDump + "\"\n")
			dumps.flush()
		except:
			pass

Step 6: Video Conversion

The hardest part was reading through all of the fffmpeg documentation and finding what was the most optimal setting for videos with the maximum amount of benefit.

# Main class
def runProgram():
	f = open(containDir + 'database.csv', 'r')
	lines = f.readlines()
	info = input("Are all " + str(lines) + " lines in the excel correct (Y or N)")
	if info.lower() == 'n':
		print('That is a problem. Rerun later')
		return
	remux = Remux()
	for l in lines:
		split = l.split(',')
		verified = split[13]
		replaced = split[14]
		if replaced.lower() != "true":
			if verified.lower() == "true":
				cmd = remux.build_fmpg_cmd(split, [])
				if cmd != None:
					subprocess.run(cmd)
				else:
					cmd['ssa'] = 'copy'
					subprocess.run(cmd)

So after a lot of research upgrading to libx265 was the most optimal and variable framerate as we prioritized size over computing speed it would take. So I created my own remux for each file

class Remux:
        def __init__(self):
                self.FFMPEG_PATH = './ffmpeg.exe'
                self.LANGUAGE_DEFAULT = "eng"
                my_os = sys.platform
                if my_os == 'linux':
                        self.FFMPEG_PATH = '/usr/bin/ffmpeg'
        	self.DEST_FILE_FORMAT = "mine1.mkv" # Format of the name Not
                self.dest_fold = './'

	def build_fmpg_cmd(self,split, subtitleList):
		fullpath = split[0].replace("\"", "")
		oldname = split[1]
		newName = split[2]
		ext = split[3]
		source = split[4]
		year = split[5]
		season = split[6]
		episode = split[7]
		seriesName = split[8]
		episodeName = split[9]
		resolution = split[10]
		tune = split[11]
		customArguments = ['-vcodec', 'libx264','-present', 'fast', '-vsync','vfr']
		if tune != '':
                        customArguements.add('-tune')
                        customArguements.add(tune)
		mapsArgument = ['-map','0:v?', '-map', '0:a?', '-map', '0:s?']
		copyArguments = ['-c:a', 'aac', '-c:s', 'ssa', '-c:t', 'copy']
		outputFileArguments = ['-y']
		if seriesName == '':
                        pass
		else:
			if not os.path.exists('./' + seriesName):
				os.mkdir('./' + seriesName)
			if ext != '.mkv': # Conversion
				tmp = os.path.splitext(newName)
				outputFileArguments.append(seriesName + "\\" + tmp[0] + ".mkv")
			else:
				outputFileArguments.append(seriesName + "\\" + newName)
		outputFileArguments = ['-y', newName]
		command = [self.FFMPEG_PATH, '-i', fullpath]
		subsArgument = []
		metaArgument = []
		defaultTrack = []
		if year != -1:
			metaArgument.append('-metadata')
			metaArgument.append("year=" + str(year))
		if tune != '':
			customArguments.append('-tune')
			customArguments.append(tune)
		if resolution != -1:
			customArguments.append('-vf')
			customArguments.append("scale=-1:" + resolution.replace("p","").replace("P", ""))
		for index, subtitle in enumerate(subtitleList):
			mapsArgument += ['-map', str(index + 1) + ':0'] # Is this the old arguements?
			subsArgument += ['-i', subtitle]
			# check if subtitle has language before extension
			subNameSplitted = subtitle.split('.')
			subNameSplitted.pop(-1)
			language = subNameSplitted[-1]
			metaArgument += ['-metadata:s:s:' + str(index), 'language=' + language]
			if (self.LANGUAGE_DEFAULT in language):
				defaultTrack = ['-disposition:s:' + str(index), 'default']
		command +=  customArguments + subsArgument + mapsArgument + metaArgument  + copyArguments + defaultTrack + outputFileArguments 
		return command

Close runner up command was changing the buffer and max frame rates. Also a notable fix was instead of assuming the first stream was video so changed it to read video.

Step 7: Clean up

Once they are created another copy we need to verify that it was correct as some gave me a < 10 byte file which was not right and I would manually check 1 from each season of a series.

# After running this you will need to rename the files to match the new file
def cleanDatabase(dest_fold):
	print("Going through database.csv and writting an updated version to database1.csv")
	files = os.listdir(dest_fold)
	g = open(containDir + 'database.csv','r')
	lines = g.readlines()
	rewritten = []
	for f in files:
		if os.path.getsize(f) > 100: # Replaced sucessfully
			print(os.path.basename(f))
			rewritten.append(os.path.basename(f))
	for o in range(0, len(lines)):
		l = lines[o]
		split = l.split(',')
		name = split[2]
		ind = -1
		for i in range(0, len(rewritten)):
			if rewritten[i] == name:
				ind = i
#		print(split[0:-1],"TRUE")
		if ind != -1:
			lines[o] = lines[o].replace("FALSE", "TRUE")
	print(rewritten)
	h = open(containDir + 'database1.csv', 'w')
	h.writelines(lines)
	h.flush()
	
	

Then the optional of getting rid of the old version of the file

def deleteRewritten(): # Then the database file will be called left so renaming is needed
	f = open(containDir + "database.csv", "r")
	g = open(containDir + 'left.csv', 'w')
	lines = f.readlines()
	print("Rewritting the files from database.csv to left.csv")
	for l in lines:
		split = l.split(',')
#		print(split[-1])
		if split[-1].strip().lower() == "true":
			print("Remove File " , split[0])
			os.remove(split[0])
		else:
			g.write(l)
			g.flush()

Final Version

Thanks to alpha tester Arturo I was able to finish this project for many more edge cases.

Notes

  • Have to have series names for consideration because that is the folder it is placed into

Last updated