Tools and environment

Debug a python program

As I mentioned in Homework assignment 0 I am using neovim as an editor. Before this assignment I actually never bothered installing a debugger in neovim since I usually have written less complex pieces of software. However, lately I have been writing more and more complex software (relatively) and in some cases I could have benefited from using one.

Therefore I installed dap and dap-python in my neovim environment according to this setup: nvim-dap-python. This did not work out of the box for me, since I am using conda/mamba as a environment handler I had to tweak this setup a little bit since debugpy have to be installed for the dap to work in neovim. The configuration I used assumed that debugpy was installed globally and I want debugpy to work from my environment.

This meant that I had to change the path from where neovim reads the python binary:

local python_install_path = vim.fn.exepath('python')
dap_python.setup(python_install_path)

That change ensured that neovim used the path for my active environment when I started neovim. Using this debugger was quite easy below is some of the keyboard shortcuts that I use:

Space + db → Toggle breakpoint
Space + dc → Start the debugger
Space + dc → Start the debugger
Space + di → Step into
Space + do → Step over
Space + dO → Step out
Space + dq → Terminate
Space + du → Toggle UI

I tested the debugger on the interleave assignment and it worked great.

Inventory of tools

Aerospace

Aerospace is a tiling window manager made for MaxOS and I have it configured similarly as how my i3 setup is working on my personal Linux system. I usually only have dedicated applications on single workspaces. I have two separate terminal windows in workspace 1 and 2, on for a terminal locally and one terminal that I am connected to a remote sever via ssh. On workspace 3 I usually have a browser opened, so I easily can browse for example code documentation. Workspace 5 is dedicated to communication so on this I have my email application and Slack opened and on workspace 6 I usually have Spotify opened. See below how I switch between the workspaces:

super+1 → Local terminal
super+2 → Remote terminal
super+3 → Browser
super+5 → Communication
super+6 → Spotify

Tmux

Tmux is a terminal multiplexer, meaning that you can have several terminal instances within one single terminal in different panes and windows. I am probably not using the full power of this awesome tool, but how I work with it works very good for the work I do. I usually try to keep it as clean as possible with maybe 2 active windows and usually just one pane within each window. Tmux have a multitude of available plugins to choose from, and the one I use for workflow is Vim-Navigaton which enables me to navigate the panes I have within tmux using vim movement, without the need of using the original prefix Ctrl+b. Below is the majority of the shortcuts I use:

Ctrl + " → Create a new horizontal pane
Ctrl + % → Create a new vertical pane
Ctrl + l → Move to the pane right of the active one
Ctrl + h → Move to the pane left of the active one
Ctrl + k → Move to the pane above of the active one
Ctrl + j → Move to the pane below of the active one
Ctrl + b + c → Create a new window
Ctrl + b + , → Rename window
Ctrl + b + w → Get a list of widows which I can navigate and choose from
Ctrl + b + l → Select previous window
Ctrl + b + $ → Prompt to rename the current session

One really nice feature of tmux is that you can also detach your tmux session, meaning that whatever process you have running in that tmux session can be run only within tmux. I use this when running really time consuming scripts, for example when I run my retrievals. I usually do this on a dedicated server I use for these calculations over ssh. So I can start the retrieval with in a local tmux session on the server. I can then detach that session on the server and then log out from the ssh. This ensures that the process is continuing locally on the server within tmux and I do not longer have to be connected to it from my local system. I detach and attach a session with the below shortcuts and commands:

Ctrl + b + d → Detach session
tmux ls → List tmux sessions
tmux ls → List tmux sessions
tmux attach-sessions -t session_name → Attach session with the name session_name

vim movement

I know that some of the basics still is not in my repertoire when it comes to vim motions, especially when it comes to moving within a line. I still am spamming hjkl way to much, when I instead can use some built in functionality. What I learn when watching the some of the vim videos seen in the handout is that I will now incorporate w, b and f much more. With w you jump forward a word and with b you jump backward. f is used with the addition with the character you want to jump forward to.

Find duplicates

In this exercise Daniel wrote a script to randomize files in folders with text in them. This code can be seen below

from pathlib import Path
import random

TARGET = Path(".") / "a_few_files"
FILES_N = 69
random.seed(323847)

content = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
    "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.",
    "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.",
    "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
]

file_content = [
    "\n".join(random.sample(content, k=2))
    for ind in range(FILES_N)
]
filenames = [
    f"{ind}_{ind % 3}.txt"
    for ind in range(FILES_N)
]

if __name__ == "__main__":
    if TARGET.is_file():
        raise FileExistsError(f"{TARGET=} already exists as a file")
    TARGET.mkdir(exist_ok=True)

    for fname, content in zip(filenames, file_content):
        file = TARGET / fname
        with open(file, "w") as fh:
            fh.write(content)

I solved this problem using a pretty common approach for a problem of this character, using maps to identify the duplicates. For this I first used a naive approach comparing the content in the files with each other seen below in the function duplicates. But first I need to present the helper functions find_files and save_duplicates:

My second approach was to hash all the content in the files, using various hashing algorithms and then compare the hashes with each other to find duplicate files. A much more reliable, but also fast way to solve this problem. I wrote a wrapper function called duplicates_with_hashing I call this wrapper from a easy cli seen below:

PATH = "a_few_files"
PATTERN = "*.txt"
DESC_MAP = {
    "nohash": "Find duplicates without hashing content",
    "sha256_fd": "Find duplicates using hashed content with sha256 and digest file directly",
    "sha256_chunks": "Find duplicates using hashed content with sha256 in 4kB chunks",
    "sha256": "Find duplicates using hashed content with sha256 from Path object",
    "md5": "Find duplicates using hashed content with md5 from Path object",
    "blake2b": "Find duplicates using hashed content with blake2b from Path object",
    "blake2s": "Find duplicates using hashed content with blake2s from Path object",
}
FUNCTION_MAP = {
    "nohash": duplicates,
    "sha256_fd": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=sha256_filedigest, save=save
    ),
    "sha256_chunks": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=sha256_chunks, save=save
    ),
    "sha256": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=sha256_path, save=save
    ),
    "md5": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=md5_path, save=save
    ),
    "blake2b": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=blake2b_path, save=save
    ),
    "blake2s": lambda save: duplicate_with_hashing(
        PATH, PATTERN, hash_func=blake2s_path, save=save
    ),
}


def main():
    save_desc = "Saves the files found with name of the first argument"
    time_desc = "Times the execution of the function"
    print_desc = "Prints the duplicate files to stdout"
    description = "Program to find files with duplicate content"
    parser = argparse.ArgumentParser(add_help=True, description=description)
    subparsers = parser.add_subparsers(
        dest="method", required=True, help="Methods available"
    )

    for method in FUNCTION_MAP.keys():
        description = DESC_MAP.get(method, "")
        subparser = subparsers.add_parser(
            method, help=description, description=description
        )
        subparser.add_argument("-s", "--save", action="store_true", help=save_desc)
        subparser.add_argument("-t", "--timeit", action="store_true", help=time_desc)
        subparser.add_argument("-p", "--printit", action="store_true", help=print_desc)

    args = parser.parse_args()
    func = FUNCTION_MAP[args.method]

    if args.method == "nohash":
        st_ex = time.time()
        dupes = func(PATH, PATTERN, args.save)
        execution_time = time.time() - st_ex
    else:
        st_ex = time.time()
        dupes = func(save=args.save)
        execution_time = time.time() - st_ex

    if args.timeit:
        print(f"Execution time: {execution_time} s")

    if args.printit:
        for files in dupes:
            print(files)


if __name__ == "__main__":
    main()

All the source code for the functions mentioned here can seen in the files module

`blake2b_path(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Disest of file

Source code in src/course_package/files.py

def blake2b_path(path: Path) -> str:
    """blake2b hash

    Args:
        path: Path to file

    Returns:
        Disest of file
    """
    hasher = hashlib.blake2b()
    bstr = path.read_text().encode("utf-8")
    hasher.update(bstr)
    return hasher.hexdigest()

`blake2s_path(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Digest of file

Source code in src/course_package/files.py

def blake2s_path(path: Path) -> str:
    """blake2s hash

    Args:
        path: Path to file

    Returns:
        Digest of file
    """
    hasher = hashlib.blake2s()
    bstr = path.read_text().encode("utf-8")
    hasher.update(bstr)
    return hasher.hexdigest()

`duplicate_with_hashing(base_dir, pattern, hash_func, save=False)`

Parameters:

Name	Type	Description	Default
`base_dir`	`str`	base directory	required
`pattern`	`str`	pattern to look for	required
`hash_func`	`Callable`	hash function	required
`save`		bool if saving files	`False`

Returns:

Type	Description
`Generator`	Files

Source code in src/course_package/files.py

def duplicate_with_hashing(
    base_dir: str, pattern: str, hash_func: Callable, save: bool = False
) -> Generator:
    """Wrapper to apply hashing to find duplicate

    Args:
        base_dir: base directory
        pattern: pattern to look for
        hash_func: hash function
        save : bool if saving files

    Returns:
        Files
    """
    files = find_files(base_dir, pattern)
    dictionary = defaultdict(list)

    for path in files:
        content_hash = hash_func(path)
        dictionary[content_hash].append(path.name)

    files = [val for val in dictionary.values() if len(val) > 1]

    if save:
        filename = hash_func.__name__ + ".txt"
        save_duplicates(files, filename)
    return files

`duplicates(base_dir, pattern, save=False)`

Parameters:

Name	Type	Description	Default
`base_dir`	`str`	base directory to search from	required
`pattern`	`str`	pattern to look for in the directory	required
`save`	`bool`	boolean if it should be saved	`False`

Returns:

Type	Description
`Generator`	List of duplicate files

Source code in src/course_package/files.py

def duplicates(base_dir: str, pattern: str, save: bool = False) -> Generator:
    """Find duplicate files from the full content

    Args:
        base_dir: base directory to search from
        pattern: pattern to look for in the directory
        save: boolean if it should be saved

    Returns:
        List of duplicate files
    """
    files = find_files(base_dir, pattern)
    dictionary = defaultdict(list)
    for path in files:
        dictionary[path.read_text(encoding="utf-8")].append(path.name)

    files = [val for val in dictionary.values() if len(val) > 1]
    if save:
        filename = "nohash.txt"
        save_duplicates(files, filename)
    return files

`find_files(base_dir, pattern)`

Parameters:

Name	Type	Description	Default
`base_dir`	`str`	Base directory	required
`pattern`	`str`	Pattern to look for	required

Returns:

Type	Description
`Generator`	Path to files

Source code in src/course_package/files.py

def find_files(base_dir: str, pattern: str) -> Generator:
    """Find files

    Args:
        base_dir: Base directory
        pattern: Pattern to look for

    Returns:
        Path to files
    """
    return Path(base_dir).rglob(pattern)

`md5_path(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Digest of file

Source code in src/course_package/files.py

def md5_path(path: Path) -> str:
    """md5 hash

    Args:
        path: Path to file

    Returns:
        Digest of file
    """
    hasher = hashlib.md5()
    bstr = path.read_text().encode("utf-8")
    hasher.update(bstr)
    return hasher.hexdigest()

`save_duplicates(files_list, filename)`

Parameters:

Name	Type	Description	Default
`lists_of_files`		List of files	required
`filename`	`str`	File name	required

Source code in src/course_package/files.py

def save_duplicates(files_list: Generator, filename: str) -> None:
    """Save duplicate files

    Args:
        lists_of_files: List of files
        filename: File name
    """
    if not os.path.exists("duplicates"):
        os.makedirs("duplicates")
    path = Path("duplicates") / filename

    with open(path, "w") as fh:
        for files in files_list:
            fh.writelines(str(files) + "\n")

`sha256_chunks(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Digest from hash

Source code in src/course_package/files.py

def sha256_chunks(path: Path) -> str:
    """sha256 hash in 4kB chunks

    Args:
        path: Path to file

    Returns:
        Digest from hash
    """
    hasher = hashlib.sha256()
    with open(path, "rb") as file:
        chunk = file.read(4096)
        while chunk:
            hasher.update(chunk)
            chunk = file.read(4096)
        return hasher.hexdigest()

`sha256_filedigest(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Digest from file

Source code in src/course_package/files.py

def sha256_filedigest(path: Path) -> str:
    """sha256 hash

    Args:
        path: Path to file

    Returns:
        Digest from file
    """
    with open(path, "rb") as file:
        return hashlib.file_digest(file, "sha256").hexdigest()

`sha256_path(path)`

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to file	required

Returns:

Type	Description
`str`	Digest from file

Source code in src/course_package/files.py

def sha256_path(path: Path) -> str:
    """sha256 hash

    Args:
        path: Path to file

    Returns:
        Digest from file
    """
    hasher = hashlib.sha256()
    bstr = path.read_text().encode("utf-8")
    hasher.update(bstr)
    return hasher.hexdigest()