Do Not Call Database

Getting from the database to detect spam callers

Origin

The idea came from when I kept on getting people annoyed with calls from spam callers. That and the car extended warranty meme. The first thought is to get the database. Then I would create my own local version and this was based on how long it would take for online services to search up a number. Also, it seemed that some entries in the database were clearly flags and incorrectly inputted information.

Overview

I love that this project was straightforward. I was essentially grabbing from an API and creating my own database. That simple. All the other things were based on my determination to optimize it further. I will say this is one of my proudest projects so far.

Step 1: Outlining

It was easy to outline what needed to be accomplished in the project. That is where I created the main class and the following classes just fell into place

Main Code

The main driving user interface code

def main():
    print("Welcome to the Do Not Call Database Builder")
    running = True
    search = False
    while running:
        print("Here is what we can do")
        print("0. Help MEEE.")
        print("1. Create a database")
        print("2. Update the Database")
        print("3. More data in the database")
        print("4. Crush Database")
        print("5. Clean Database")
        print("6. Mega Good-Bad ReBase")
        print("7. Prepare Searchable Database")
        print("8. Search Number")
        print("9. Quit")

        selection = 3
        try:
            if waitBeforeStart:
                time.sleep(600)
            selection = int(input("What are we doing: "))
        except:
            print("That doesn't seem to be an int")
        if selection == 0:
            print("Help Selection")
            print("Please select which one you need help explaining")
            helpMenu()
            # Do the dialog here of what each one is for
        if selection == 1:
            print("Creating the database")
            print(initalizeDatabase())
        elif selection == 2:
            print("Updating the database")
            print(updateDatabase(line.strip()))
        elif selection == 3:
            print("Adding more data")
            print(FullDayData())
        elif selection == 4:
            print("Crushing Database")
            fileLog.write("[" + getTimeNow() + "] Selection: Update Database\n")
            fileLog.flush()
            crushDatabase()
        elif selection == 5:
            print("Cleaning Database")
            dupCleanDatabase() # Duplicates in toRead file
        elif selection == 6:
            print("Merging all done files")
            rebase()
        elif selection == 7:
            print("Preparing Database")
            prepDatabase()
        elif selection == 8:
            print("Done Building")
            print("seach Number")
        elif selection == 9:
            print("Quitting")
            running = False
            search = False
        else:
            print("That is not an option")

Class Names

The separate classes that I used were the Create.py Clean.py Crush.py Recal.py Richer.py Update.py

Precautions

The issue that comes up is regarding the API. Because of an API, we have to verify things are correct and sending at the correct rate. This API in particular has a limit of 50 requests before a block is placed. For that reason, I am not using 1 key but instead rotating through 50 keys that way it takes longer to catch, and if it does with a 1-second break that would still mean I got 50 * 50 * 50 entries which are 50x faster than 1 key.

This is the most professional and complex project so in the code I present a lot of cases that may happen. This is due to the lack of control the user has when creating the code. The massive amount of data doesn't give much so if something screws up you can fix it midway instead of being screwed and have to start all over.

API response

The verification based on the response status code

def validResponse(statusCode):
    if (statusCode == 200):
        return True
    elif (statusCode == 429):
        print("Rate has been exceeded")
    elif (statusCode == 404):
        print("API server cannot be reached")
    elif (statusCode == 403):
        print("Api key missing or invalid")
    elif (statusCode == 400):
        print("URL Does not use HTTPS:")
    return False

Suspended

What is done when a key is suspended. I just wait and remove the key.

while (offsetCount < lastEntry):
    if len(dncApiKeys) == 0:
        return "Did not finish. Please ReRun"
    curIndex = count % len(dncApiKeys)
    dncApiKey = dncApiKeys[curIndex].strip()
    count += 1
    response = requests.get(baseUrl + dncApiKey + "&created_date=\"" + splits[0] + "\"&offset=" + str(offsetCount))
    time.sleep(.85)
    if not validResponse(response.status_code):
        time.sleep(10)
        print(dncApiKeys)
        print(dncApiKey)
        try:
            dncApiKeys.remove(dncApiKey)
        except:
            try:
                dncApiKeys.remove(curIndex)
            except:
                print("That fucked up")
                return "This fucked up"
        print("Removed Key " + dncApiKey + ". " + str(len(dncApiKeys))+" keys left")

Step 2: Building Database

Here I am building the initial database, so we assume that the database doesn't exist or could not be found. Also, there could be a case when the computer did not finish the building phase due to a key expiring or being blocked or power loss. Many things could happen that we have to account for.

Initializing Database

Initially starting to write to the database

def initalizeDatabase():
    f = open("DncApiKey.txt" , "r")
    lines = f.readlines()
    print(lines)
    offset = 1 # 1 to skip today because that is not published
    if os.path.exists("Done/toRead.txt"): # Read last line to know where left off
        with open('Done/toRead.txt', 'rb') as f:
            f.seek(-2, os.SEEK_END)
            while f.read(1) != b'\n':
                f.seek(-2, os.SEEK_CUR)
            last_line = f.readline().decode()
            currentDay = datetime.today()
            offset = dateOffset(last_line.split(" ")[0], currentDay.strftime("%Y-%m-%d"))
            # Could do without split but just want data neater

    # Add a part to continue where left off in the toRead File
    for line in lines:
        line=line.strip()
        offset = createDatabase(line,offset)
        createdDatabase = offset == -1
        if not createdDatabase:
            print("Fuck that didn't work")
            offset += 1
        else:
            print("That was successful and we created the database")
        time.sleep(5) # Probably wait a little time and write where stopped

Create Folder

This function is used a lot as it creates a folder if it does not exist which is a common error to overlook

def createFolder(folderpath):
    if os.path.exists(folderpath):
        print(folderpath + " already Exists")
    else:
        os.makedirs(folderpath)

Create Database

Continues from whatever point left off, which can be the start, and moves towards the current day. After a few attempts, I found that it started records on 02-14-2020. It then writes that to a file in the folder 'DONE' for that day.

def createDatabase(dncApiKey, offset):
    baseUrl = "https://api.ftc.gov/v0/dnc-complaints?api_key="
    createFolder("Done")
    print("We are grabbing the days all the way back to February 14, 2020")
    currentDay = datetime.today()
    curDay = open("Done/lastDate.txt", "w") # Last date updated
    curDay.write(currentDay.strftime("%Y-%m-%d"))
    curDay.flush()
    curDay.close()
    moreDays = True
    subDays = open("Done/toRead.txt", "a") # For the sub parts
    while moreDays:
        d = currentDay - timedelta(days=offset)
        form = d.strftime("%Y-%m-%d")
        print(form)
        if os.path.exists(str("Base\\" + form+".json")):
            print(form + " already exists")
            offset += 1
            continue # Don't want to recall a file that already exists
        response = requests.get(baseUrl + dncApiKey + "&created_date=\"" + form + "\"")
        if form == "2020-02-14":
            moreDays = False
        if not validResponse(response.status_code):
            subDays.close()
            print("Exiting at " + form)
            return offset # If done
        

Cleaning Data

We assume we got successful data, it is a lot of data that I don't need because I am building a simple map. I parse out the information that is needed and get a record count. Record count is critical for later on.

def cleanJson(data):
    del data['meta']
    del data['links']
    for element in data['data']:
        del element['type']
        del element['relationships']
        del element['meta']
        del element['links']
        element['number'] = element['attributes']['company-phone-number']
        element['area-code'] = element['attributes']['consumer-area-code']
        del element['attributes']
    return data

Step 3: Cleaning and Validating

During the process, we use a few methods over and over. This is so I can keep it as organized as possible so in the end, we don't have a lot of extra files. We do have a lot during but these methods clean out a lot of extra information from records and validate to get rid of invalid entries.

Clean to Read

When I do another day I overwrite the old to read and update it turning the old one to .old and thus overwriting the older version.

f = open("Done/toRead.txt", "r")
lines = f.readlines()
f.close()
if os.path.exists("Done/toRead.txt.old"):
   send2trash.send2trash("Done/toRead.txt.old")
os.rename("Done/toRead.txt", "Done/toRead.txt.old")
mySet = set([]) # Get rid of duplicates
for line in lines:
    mySet.add(line)
    # Write cleaned Readme
f2 = open("Done/toRead.txt", "w")
for s in mySet:
    f2.write(s)
f2.flush()
f2.close()

Clean JSON

There are a lot of JSON files and once they are crushed we don't need them anymore so I get rid of the whole folder that contained them. This is part of the process when compressing the folders into 1 file.

for file in doneJson:
    if ".json" in file:
        dateString = file.replace(".json","")
        if os.path.exists("Base/" + file):
            send2trash.send2trash("Base/" + file)
        if os.path.exists("Base/" + dateString):
            send2trash.send2trash("Base/" + dateString)
    else: # Not a json file. Ignore
        continue

Clean Records

This would clean up each record to make sure in the final records there are no invalid numbers. This is mostly a user error. I would suggest someone fix it and put in the suggestion but that most likely won't change.

def clean(phoneNumber):
    # I could have just done non digits. Wow
    phoneNumber = phoneNumber.replace("+", "").replace("-","").replace("(","").replace(")","")
    if len(phoneNumber) == 10:
        return str(phoneNumber)
    elif len(phoneNumber) == 7:
        return "area missing"
    elif len(phoneNumber) < 10:
        return "too short"
    else: # Could potentially be a double hit on a number
        return "too long"

Rewinding

Going back and writing a file for each day. Technically both building and validating as there can be invalid days

while moreDays:
    d = datetime.today() - timedelta(days=offset)
    form = d.strftime("%Y-%m-%d")
    
    if os.path.exists(str("Base\\" + form+".json")):
        print(form + " already exists")
        offset -= 1
        continue # Don't want to recall a file that already exists
    if offset <= 1: # Last possible day
        moreDays = False
    response = requests.get(baseUrl + dncApiKey + "&created_date=\"" + form + "\"")
    if not validResponse(response.status_code):
        subDays.close()
        print("Exiting at " + form)
#            ("Add to the log file information about where left off
        return "Please check Log File for more detail" # If done

Compare Cleaning

On the rare occasion that the compare.csv is created and want to clean it afterward, I created a method for that. It is not with the flow but may help in some cases if someone forgot to clean it at some other stage.

for line in lines:
    splits = line.strip().split(",")
    newLine = splits[0] + "," + splits[1] # Removing area
    if splits[0].isdigit(): # Check if valid number
        curNumb = int(splits[0])
        if curNumb <= 2000000000:
            if curNumb == 0:
                zeros += 1
            else:
                rem += 1
        else: # Valid Number
            newArray.append(newLine)
            sortingArray.append(int(splits[0]))

Step 4: Enriching

The problem is I needed more information Right now I have only gotten each day and not more than the first page. This means I need to enrich but that is after updating everything. This is where I am a little confusing. The algorithm needs to be simplified but that would be better if I had unlimited space. I am trying to keep the whole project under 1GB which is tough.

Finding folder

Finding the folder that we left off on because we might have not crushed the database yet

for line in lines:
    # Check to see if file already exist so I don't waste time
    splits = line.split(" ")
    offsetCount = 0
    currentDir = "Base/" + splits[0]
    if os.path.exists(currentDir): # Check if everything there
        if os.path.exists(currentDir + "/" + splits[1] +".json"): # Last file exists
            print("Skip this folder because it is already done")
            continue
        else: # Continue where left off
            print("Continue at the biggest number file in that folder")
            subFiles = os.listdir(currentDir)
            offsetCount = maximum(subFiles)

Finding File

We don't want to overwrite and waste time so once we verified it is the folder we need to find the file.

def maximum(fileList):
    numbers = []
    if len(fileList) <= 1: # Might be issue if only 1 file was written
        return 0 # Just rewrite the file
    for name in fileList:
        if "-" in name and ".json" in name:
            name = name.replace(".json", "").split("-")[1]
            numbers.append(int(name))
        else:
            print("We have a problem with file " + name)
            input()
    return max(numbers)

Step 5: Updating Database

This part is whenever we have to update the database. It is the most troubling because we have to reload all the information. I know it is inefficient but it was the best part at the time. Adding in a rebase made the file smaller and faster to access.

Node Class

Creating a node class for each number

class Node:
    def __init__(self, value, nextNode=None):
        self.id = value['id']
        self.numb = value['number']
        self.area = value['area-code']
        self.good = value['number'][0:3] == value['area-code']
        self.nextNode = nextNode

    def getId(self):
        return self.id

    def getNumb(self):
        return self.numb

    def getArea(self):
        return self.area

Good and Bad

Updated the good and bad number files with the numbers corresponding.

def reBase():
    baseDir = "Done/Completed/"
    mergeFiles = os.listdir(baseDir)
    goodNumbers = [] # Good number dict
    badNumbers = [] # Bad number dict
    goodNumb = [] # Same area code
    badNumb = [] # Different Area Code
    for fi in mergeFiles: # Load Numbers
        try:
            with open(baseDir + fi, "r") as read_file:
                developer = json.load(read_file)
                for x in developer:
                    n = Node(x)
                   # print("this")
                    if n.good:
                        goodNumbers.append(x) # Technically can delete area-code
                        goodNumb.append(n)
                    else:
                        badNumbers.append(x)
                        badNumb.append(n)
        except:
            print("the file is having an issue " + fi)
            input("Input something to continue ")
    # Go through the numbers and sort
    print(str(len(goodNumb)))
    print(str(len(badNumb)))
    # Turning dict to file
    with open('badNumbers.json', 'w') as fp:
        json.dump(badNumbers, fp)

    with open('goodNumbers.json', 'w') as fp:
        json.dump(goodNumbers, fp)

Step 6: Transferring

Sometimes people want it in a compressed version and I think I liked CSV file format when trying to view and sort. It did cut it most of the time but that was ok. This changed a lot but for the most part, I liked the CSV to compare both good and bad numbers.

JSON to CSV

Preparing a combined CSV database

def prepDatabase():
    compsF = open("compare.csv", "w")
    compsF.write("Number,Id,Area\n")
    compsF.flush()
    filename = "goodNumbers.json"
    try:
        with open(filename, "r") as read_file:
            developer = json.load(read_file)
            for x in developer:
                newStr = x['number'] + "," + x['id'] + "," + x['area-code']
                compsF.write(newStr + "\n")
        filename = "badNumbers.json"
        with open(filename, "r") as read_file:
            developer = json.load(read_file)
            for x in developer:
                newStr = x['number'] + "," + x['id'] + "," + x['area-code']
                compsF.write(newStr + "\n")

    except:
        print(filename + " is having an issue")
        compsF.close()
        
    compsF.flush()
    compsF.close()

Step 7: Crushing

I really liked to call this crushing because it takes everything and compresses it together. You may ask why not call it compressing. That is a good point, it is because I already have compression algorithms and don't like confusion. This is where we take everything and at that point crush. This can follow any step because of its nature. Most of the time it follows the enrichment, but not always.

Merging

Merging the files together in a folder.

for a in ar: # All the files merging
    try:
        with open(("Base\\" + foldername + "\\" + a), "r") as read_file:
            developer = json.load(read_file)
            for x in developer['data']:
                dex.append(x)
    except:
        return "Leaving the file along because an issue occured"

Double-check

Just want to double-check that everything is right in case we didn't call cleaning earlier

for x in developer['data']:
    number = clean(x['number'])
    if number == "too short" or number == "too long":
        print(x)
        continue
    elif number == "area missing":
        x['number'] = x['area-code'] + number
    dex.append(x)

Step 8: Recall

What the hell would I be doing if I forgot to recall the phone numbers. That is the whole point of the algorithm/project.

Queueing Entries

Loading up what the user wants into the objective numbers

def searchNumber():
    phoneNumber = "000-000-0000"
    done = False
    allNumbers = []
    filename = "compareSort.csv"
    searchAll= "Y" # Including bad numbers; Implement later on

    if os.path.exists(filename):
        print("Reading in " + filename)
    else:
        print("That file doesn't exist. Check and run the function again")
        if os.path.exists("goodNumbers.json"):
            print("good file exists")
        else:
            print("good file doesn't exist")
        if os.path.exists("badNumbers.json"):
            print("bad files exists")
        else:
            print("bad files doesn't exists")
        if os.path.exists("goodNumbers.json") and os.path.exists("badNumbers.json"):
            print("Combining files to hopefully solve the issue")
            # Call prep database here
    listNumbers = []
    dupCount = 0
    with open(filename , "r") as read_file:
        lines = read_file.readlines()
        lines.remove(lines[0]) # Remove the header
        for line in lines:
            splits = line.strip().split(",")
            listNumbers.append(line)
            allNumbers.append(splits)

    print("Loaded ", str(len(listNumbers)) + " entries")
    print("Duplicate Count", str(dupCount))

Finding entries

I could have written this with a binary search but a linear search was fast enough for me. Most of the time was spent on building the database.

if len(idList) > 0:
    print("Found " + str(len(idList)) + " entries for the number "+ phoneNumber)
    extendedInfo = "Y"
            
            extended = input("Would you like all the information (Y/N)? ")
            if extended.lower() == "y":
                for i in range(0, len(idList)):
                    print("Id " + str(i) + " : " + idList[i])
                    getRef(idList[i])

Finally

In the end, this project took at least 2 weeks because the API keys kept on pausing and I made a mistake halfway through that deleted the whole database. Also when running the program every action preformed is logged. That is how I kept track of it if an issue occurred and what step. I did not remove it since I am the only one to test it and barely tested it because 2 weeks is a long time for a project.

Files

Included the files so people don't have to go all the way back.

Last updated