The idea came from when I kept on getting people annoyed with calls from spam callers. That and the car extended warranty meme. The first thought is to get the database. Then I would create my own local version and this was based on how long it would take for online services to search up a number. Also, it seemed that some entries in the database were clearly flags and incorrectly inputted information.
Overview
I love that this project was straightforward. I was essentially grabbing from an API and creating my own database. That simple. All the other things were based on my determination to optimize it further. I will say this is one of my proudest projects so far.
Step 1: Outlining
It was easy to outline what needed to be accomplished in the project. That is where I created the main class and the following classes just fell into place
Main Code
The main driving user interface code
defmain():print("Welcome to the Do Not Call Database Builder") running =True search =Falsewhile running:print("Here is what we can do")print("0. Help MEEE.")print("1. Create a database")print("2. Update the Database")print("3. More data in the database")print("4. Crush Database")print("5. Clean Database")print("6. Mega Good-Bad ReBase")print("7. Prepare Searchable Database")print("8. Search Number")print("9. Quit") selection =3try:if waitBeforeStart: time.sleep(600) selection =int(input("What are we doing: "))except:print("That doesn't seem to be an int")if selection ==0:print("Help Selection")print("Please select which one you need help explaining")helpMenu()# Do the dialog here of what each one is forif selection ==1:print("Creating the database")print(initalizeDatabase())elif selection ==2:print("Updating the database")print(updateDatabase(line.strip()))elif selection ==3:print("Adding more data")print(FullDayData())elif selection ==4:print("Crushing Database") fileLog.write("["+getTimeNow() +"] Selection: Update Database\n") fileLog.flush()crushDatabase()elif selection ==5:print("Cleaning Database")dupCleanDatabase()# Duplicates in toRead fileelif selection ==6:print("Merging all done files")rebase()elif selection ==7:print("Preparing Database")prepDatabase()elif selection ==8:print("Done Building")print("seach Number")elif selection ==9:print("Quitting") running =False search =Falseelse:print("That is not an option")
Class Names
The separate classes that I used were the
Create.py
Clean.py
Crush.py
Recal.py
Richer.py
Update.py
Precautions
The issue that comes up is regarding the API. Because of an API, we have to verify things are correct and sending at the correct rate. This API in particular has a limit of 50 requests before a block is placed. For that reason, I am not using 1 key but instead rotating through 50 keys that way it takes longer to catch, and if it does with a 1-second break that would still mean I got 50 * 50 * 50 entries which are 50x faster than 1 key.
This is the most professional and complex project so in the code I present a lot of cases that may happen. This is due to the lack of control the user has when creating the code. The massive amount of data doesn't give much so if something screws up you can fix it midway instead of being screwed and have to start all over.
API response
The verification based on the response status code
defvalidResponse(statusCode):if (statusCode ==200):returnTrueelif (statusCode ==429):print("Rate has been exceeded")elif (statusCode ==404):print("API server cannot be reached")elif (statusCode ==403):print("Api key missing or invalid")elif (statusCode ==400):print("URL Does not use HTTPS:")returnFalse
Suspended
What is done when a key is suspended. I just wait and remove the key.
Here I am building the initial database, so we assume that the database doesn't exist or could not be found. Also, there could be a case when the computer did not finish the building phase due to a key expiring or being blocked or power loss. Many things could happen that we have to account for.
Initializing Database
Initially starting to write to the database
definitalizeDatabase(): f =open("DncApiKey.txt" , "r") lines = f.readlines()print(lines) offset =1# 1 to skip today because that is not publishedif os.path.exists("Done/toRead.txt"):# Read last line to know where left offwithopen('Done/toRead.txt', 'rb')as f: f.seek(-2, os.SEEK_END)while f.read(1)!=b'\n': f.seek(-2, os.SEEK_CUR) last_line = f.readline().decode() currentDay = datetime.today() offset =dateOffset(last_line.split(" ")[0], currentDay.strftime("%Y-%m-%d"))# Could do without split but just want data neater# Add a part to continue where left off in the toRead Filefor line in lines: line=line.strip() offset =createDatabase(line,offset) createdDatabase = offset ==-1ifnot createdDatabase:print("Fuck that didn't work") offset +=1else:print("That was successful and we created the database") time.sleep(5)# Probably wait a little time and write where stopped
Create Folder
This function is used a lot as it creates a folder if it does not exist which is a common error to overlook
Continues from whatever point left off, which can be the start, and moves towards the current day. After a few attempts, I found that it started records on 02-14-2020. It then writes that to a file in the folder 'DONE' for that day.
defcreateDatabase(dncApiKey,offset): baseUrl ="https://api.ftc.gov/v0/dnc-complaints?api_key="createFolder("Done")print("We are grabbing the days all the way back to February 14, 2020") currentDay = datetime.today() curDay =open("Done/lastDate.txt", "w")# Last date updated curDay.write(currentDay.strftime("%Y-%m-%d")) curDay.flush() curDay.close() moreDays =True subDays =open("Done/toRead.txt", "a")# For the sub partswhile moreDays: d = currentDay -timedelta(days=offset) form = d.strftime("%Y-%m-%d")print(form)if os.path.exists(str("Base\\"+ form+".json")):print(form +" already exists") offset +=1continue# Don't want to recall a file that already exists response = requests.get(baseUrl + dncApiKey +"&created_date=\""+ form +"\"")if form =="2020-02-14": moreDays =FalseifnotvalidResponse(response.status_code): subDays.close()print("Exiting at "+ form)return offset # If done
Cleaning Data
We assume we got successful data, it is a lot of data that I don't need because I am building a simple map. I parse out the information that is needed and get a record count. Record count is critical for later on.
defcleanJson(data):del data['meta']del data['links']for element in data['data']:del element['type']del element['relationships']del element['meta']del element['links'] element['number']= element['attributes']['company-phone-number'] element['area-code']= element['attributes']['consumer-area-code']del element['attributes']return data
Step 3: Cleaning and Validating
During the process, we use a few methods over and over. This is so I can keep it as organized as possible so in the end, we don't have a lot of extra files. We do have a lot during but these methods clean out a lot of extra information from records and validate to get rid of invalid entries.
Clean to Read
When I do another day I overwrite the old to read and update it turning the old one to .old and thus overwriting the older version.
f =open("Done/toRead.txt", "r")lines = f.readlines()f.close()if os.path.exists("Done/toRead.txt.old"): send2trash.send2trash("Done/toRead.txt.old")os.rename("Done/toRead.txt", "Done/toRead.txt.old")mySet =set([])# Get rid of duplicatesfor line in lines: mySet.add(line)# Write cleaned Readmef2 =open("Done/toRead.txt", "w")for s in mySet: f2.write(s)f2.flush()f2.close()
Clean JSON
There are a lot of JSON files and once they are crushed we don't need them anymore so I get rid of the whole folder that contained them. This is part of the process when compressing the folders into 1 file.
for file in doneJson:if".json"in file: dateString = file.replace(".json","")if os.path.exists("Base/"+ file): send2trash.send2trash("Base/"+ file)if os.path.exists("Base/"+ dateString): send2trash.send2trash("Base/"+ dateString)else:# Not a json file. Ignorecontinue
Clean Records
This would clean up each record to make sure in the final records there are no invalid numbers. This is mostly a user error. I would suggest someone fix it and put in the suggestion but that most likely won't change.
defclean(phoneNumber):# I could have just done non digits. Wow phoneNumber = phoneNumber.replace("+", "").replace("-","").replace("(","").replace(")","")iflen(phoneNumber)==10:returnstr(phoneNumber)eliflen(phoneNumber)==7:return"area missing"eliflen(phoneNumber)<10:return"too short"else:# Could potentially be a double hit on a numberreturn"too long"
Rewinding
Going back and writing a file for each day. Technically both building and validating as there can be invalid days
while moreDays: d = datetime.today()-timedelta(days=offset) form = d.strftime("%Y-%m-%d")if os.path.exists(str("Base\\"+ form+".json")):print(form +" already exists") offset -=1continue# Don't want to recall a file that already existsif offset <=1:# Last possible day moreDays =False response = requests.get(baseUrl + dncApiKey +"&created_date=\""+ form +"\"")ifnotvalidResponse(response.status_code): subDays.close()print("Exiting at "+ form)# ("Add to the log file information about where left offreturn"Please check Log File for more detail"# If done
Compare Cleaning
On the rare occasion that the compare.csv is created and want to clean it afterward, I created a method for that. It is not with the flow but may help in some cases if someone forgot to clean it at some other stage.
for line in lines: splits = line.strip().split(",") newLine = splits[0]+","+ splits[1]# Removing areaif splits[0].isdigit():# Check if valid number curNumb =int(splits[0])if curNumb <=2000000000:if curNumb ==0: zeros +=1else: rem +=1else:# Valid Number newArray.append(newLine) sortingArray.append(int(splits[0]))
Step 4: Enriching
The problem is I needed more information Right now I have only gotten each day and not more than the first page. This means I need to enrich but that is after updating everything. This is where I am a little confusing. The algorithm needs to be simplified but that would be better if I had unlimited space. I am trying to keep the whole project under 1GB which is tough.
Finding folder
Finding the folder that we left off on because we might have not crushed the database yet
for line in lines:# Check to see if file already exist so I don't waste time splits = line.split(" ") offsetCount =0 currentDir ="Base/"+ splits[0]if os.path.exists(currentDir):# Check if everything thereif os.path.exists(currentDir +"/"+ splits[1] +".json"):# Last file existsprint("Skip this folder because it is already done")continueelse:# Continue where left offprint("Continue at the biggest number file in that folder") subFiles = os.listdir(currentDir) offsetCount =maximum(subFiles)
Finding File
We don't want to overwrite and waste time so once we verified it is the folder we need to find the file.
defmaximum(fileList): numbers = []iflen(fileList)<=1:# Might be issue if only 1 file was writtenreturn0# Just rewrite the filefor name in fileList:if"-"in name and".json"in name: name = name.replace(".json", "").split("-")[1] numbers.append(int(name))else:print("We have a problem with file "+ name)input()returnmax(numbers)
Step 5: Updating Database
This part is whenever we have to update the database. It is the most troubling because we have to reload all the information. I know it is inefficient but it was the best part at the time. Adding in a rebase made the file smaller and faster to access.
Updated the good and bad number files with the numbers corresponding.
defreBase(): baseDir ="Done/Completed/" mergeFiles = os.listdir(baseDir) goodNumbers = [] # Good number dict badNumbers = [] # Bad number dict goodNumb = [] # Same area code badNumb = [] # Different Area Codefor fi in mergeFiles:# Load Numberstry:withopen(baseDir + fi, "r")as read_file: developer = json.load(read_file)for x in developer: n =Node(x)# print("this")if n.good: goodNumbers.append(x)# Technically can delete area-code goodNumb.append(n)else: badNumbers.append(x) badNumb.append(n)except:print("the file is having an issue "+ fi)input("Input something to continue ")# Go through the numbers and sortprint(str(len(goodNumb)))print(str(len(badNumb)))# Turning dict to filewithopen('badNumbers.json', 'w')as fp: json.dump(badNumbers, fp)withopen('goodNumbers.json', 'w')as fp: json.dump(goodNumbers, fp)
Step 6: Transferring
Sometimes people want it in a compressed version and I think I liked CSV file format when trying to view and sort. It did cut it most of the time but that was ok. This changed a lot but for the most part, I liked the CSV to compare both good and bad numbers.
JSON to CSV
Preparing a combined CSV database
defprepDatabase(): compsF =open("compare.csv", "w") compsF.write("Number,Id,Area\n") compsF.flush() filename ="goodNumbers.json"try:withopen(filename, "r")as read_file: developer = json.load(read_file)for x in developer: newStr = x['number']+","+ x['id']+","+ x['area-code'] compsF.write(newStr +"\n") filename ="badNumbers.json"withopen(filename, "r")as read_file: developer = json.load(read_file)for x in developer: newStr = x['number']+","+ x['id']+","+ x['area-code'] compsF.write(newStr +"\n")except:print(filename +" is having an issue") compsF.close() compsF.flush() compsF.close()
Step 7: Crushing
I really liked to call this crushing because it takes everything and compresses it together. You may ask why not call it compressing. That is a good point, it is because I already have compression algorithms and don't like confusion. This is where we take everything and at that point crush. This can follow any step because of its nature. Most of the time it follows the enrichment, but not always.
Merging
Merging the files together in a folder.
for a in ar:# All the files mergingtry:withopen(("Base\\"+ foldername +"\\"+ a), "r")as read_file: developer = json.load(read_file)for x in developer['data']: dex.append(x)except:return"Leaving the file along because an issue occured"
Double-check
Just want to double-check that everything is right in case we didn't call cleaning earlier
for x in developer['data']: number =clean(x['number'])if number =="too short"or number =="too long":print(x)continueelif number =="area missing": x['number']= x['area-code']+ number dex.append(x)
Step 8: Recall
What the hell would I be doing if I forgot to recall the phone numbers. That is the whole point of the algorithm/project.
Queueing Entries
Loading up what the user wants into the objective numbers
defsearchNumber(): phoneNumber ="000-000-0000" done =False allNumbers = [] filename ="compareSort.csv" searchAll="Y"# Including bad numbers; Implement later onif os.path.exists(filename):print("Reading in "+ filename)else:print("That file doesn't exist. Check and run the function again")if os.path.exists("goodNumbers.json"):print("good file exists")else:print("good file doesn't exist")if os.path.exists("badNumbers.json"):print("bad files exists")else:print("bad files doesn't exists")if os.path.exists("goodNumbers.json")and os.path.exists("badNumbers.json"):print("Combining files to hopefully solve the issue")# Call prep database here listNumbers = [] dupCount =0withopen(filename , "r")as read_file: lines = read_file.readlines() lines.remove(lines[0])# Remove the headerfor line in lines: splits = line.strip().split(",") listNumbers.append(line) allNumbers.append(splits)print("Loaded ", str(len(listNumbers)) +" entries")print("Duplicate Count", str(dupCount))
Finding entries
I could have written this with a binary search but a linear search was fast enough for me. Most of the time was spent on building the database.
iflen(idList)>0:print("Found "+str(len(idList)) +" entries for the number "+ phoneNumber) extendedInfo ="Y" extended =input("Would you like all the information (Y/N)? ")if extended.lower()=="y":for i inrange(0, len(idList)):print("Id "+str(i) +" : "+ idList[i])getRef(idList[i])
Finally
In the end, this project took at least 2 weeks because the API keys kept on pausing and I made a mistake halfway through that deleted the whole database. Also when running the program every action preformed is logged. That is how I kept track of it if an issue occurred and what step. I did not remove it since I am the only one to test it and barely tested it because 2 weeks is a long time for a project.
Files
Included the files so people don't have to go all the way back.