Practical Tools for Computational Linguistics Git basics Git is an invaluable tool to programmers and computational linguists alike. Version control is a powerful tool that can make integration with large groups of people a breeze and allow one to rectify regressions promptly. This is not a very intricate tutorial for how git works under the hood (though there is a small bit of discussion about this). The primary focus is in how to use git effectively. Cloning a Repository This chapter discusses how to clone a repository. This is most likely the first thing one will do with a git repo. With modern git platforms like github, bitbucket, gitea and gitlab, initialization of repositories is not typically done. Instead, the repo is initialized on the server managed by one of these git platforms and then the repository is cloned onto your local machine. To clone a repository There are two protocols used to clone a repository: ssh and https. Both are secure and encrypted in their transfer and both are rather quick. I prefer to use ssh as my git workflow is improved with the use of this protocol. This is because, particularly with the lengthy passphrases required by Indiana University, the IU Github instance is a pain to pull and push from. By using ssh, my key can have whatever password I want or no password at all while still maintaining a high level of security. Here are some of the primary differences between the two protocols with regard to how one interracts with them in git. You can switch the protocol used after the initial clone. However, it is simpler to clone using the protocol that you prefer right from the start. SSH More initial setup is required to use ssh with git. The urls for cloning using ssh typically look something like this git@bitbucket.org/ksteimel/test_project.git Let's break this down really quick. The beginning git@ part says that this ssh protocol is actually using the git user on the server. In order for this to work, the server needs to have a preshared key from our local machine. To generate this key and put it on the server, follow the following instructions on a mac or linux machine: Check to see if there is a file in ~/.ssh/ that ends in .pub. The default file is normally called id_rsa.pub on most modern systems. If this file does exist, and you know the password associated with this key, simply open the file in your favorite text editor, copy the content to your clipboard and paste the contents into a new key in the web interface to the git server. Creating a new key is usually done by accessing your user settings (by clicking on your user icon and then clicking 'Settings') and then going to the submenu dealing with SSH & GPG keys. If this file does not exist, create the file Run the command ssh-keygen on your local machine as the user you would like to use for this repo. This program will prompt you for the location where you would like to store the keys as well as the password for the keys. Typically, I leave these both at their defaults on OpenSuse which is ~/.ssh/id_rsa.pub for the location and nothing for the password. If you have a different key location/name, you will need to add some content to your ssh config file. In ~/.ssh/config put the following: Host github.com IdentityFile User git Never upload your private key. This will not work and it will also leave your system exposed. Private keys are to be guarded closely, public keys are to share around. After the public key file has been created, you should add this public key to the git server's web management system using the instructions in 2 HTTPS To use https, simply select the https url option when cloning. There is no additional setup required. However, every time you push or pull, you will have to enter your username and password. If you use the credential helper, this can allow you to avoid this annoyance. To do this, run the following git config credential.helper store. You should only be prompted for your username and password one more time and then git should remember. If you need to change your credentials for any reason, just rerun that command and then you'll be prompted for credentials the next time you pull.Git hist This is an excellent alias for git hist to add to your .gitconfig file git log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short This page has an excellent discussion of how to add this alias to your .gitconfig file. This website is also the source of this handy alias. The result is a description of the repository's history with color charts showing the different branches. It's all done in the terminal too! This is an example of what the hist command produces. Introduction The Guided Git Tutorial is an invaluable resource for learning how to effectively use git. This chapter is simliar and draws from this tutorial in many ways. However, the focus here is to provide a more step by step explanation of how to use the pieces of git that are important for computational linguists to know how to use. Yue Chen and I have also prepared other materials to assist with learning git as well. Namely the presentation in the sidebar of this page. In addition, I have a git repository that walks you through many of the topics discussed in this chapter.Checkout One of the most powerful functionalities provided by git is the checkout functionality. This command can be used to do a number of different things including: Creating a new branch Switching to a different branch Updating the working tree Moving to a different commit on the current branch Creating a new branch at a previous commit point I'll go over each of these functionalities and provide a use case for them as well. Creating a new branch This can be useful if there are multiple people working on a project where there are subgroups in the project that are working on different parameter settings or different feature extraction methods. Rather than create two different directories where the files live, you can create two separate branches and use the checkout command to switch between them. To create a new branch use the following syntax git checkout -b Switching to a different branch Now that this new branch is created though, how do you quickly switch back and forth between them? The answer is to essentially leave out the -b flag. git checkout Updating the working tree The most common case where I use this functionality is to get back a file that I accidentally deleted from my file system. Running git checkout will bring the latest version of that file into your repo, even if that file has been deleted by mistake. Moving to a different commit on the current branch If there's a previous commit that you would like to roll back to, for example, if you need to examine the previous way that some system was running, you can checkout an individual commit from the past. To do this, you will need to obtain the shortened hexidecimal commit hash. One way to obtain this is using the git log command. I recommend using the git hist alias discussed elsewhere in this chapter. Another way is to look at the repository in the git web interface on the server where your repository is hosted. There will typically be a place to view the commit history on these web interfaces. Then you can copy the commit code and paste it into the appropriate spot in your command line. For example, git checkout 79b47a4 Where 79b47a4 is an example commit code. Creating a new branch at the previous commit point However, the previous command will put your local repository into a detached head state. You can do most git things but you are not currently on a branch terminal so some operations like merging will not work. The solution is to create a new branch when you rewind to a previous commit. This essentially combines the methods from two of the previous sections. Simply run git checkout -b Supercomputer tools This describes some information about the super computer tools that are available at IU as well as information on how to use these tools. Containers on IU supercomputers Karst, Carbonate and Big Red II have the singularity package available which allows users to run docker or singularity images on the supercomputers. This can be a great way to run programs that are not installed on the supercomputers. To use singularity, load the module module add singularity Then pull down an image. You can pull an image from dockerhub or singularity's hub. singularity pull docker://julia:latest Unlike docker, singularity creates a file that contains the image specification. To run the container use the image file generated by your pull command. # singularity exec singularity exec julia-latest.simg julia For more information, please see the singularity documentation.Supercomputer info Carbonate node: 710.223 GFLOPS on intel mkl linpack If you need to get 32GB of VRAM on Carbonate-dl From a previous help ticket: "dl[11,12] are in fact v100-16GB parts. If the user needs v100-32GB they’ll need to add a “-w “, where is dl1 or dl2." Notes on Screen It is a good idea to use screen sessions On super computers like those at IU, it is essential to use a screen session for submitting interactive jobs. Screen basically emulates a user sitting at the computer screen. It accepts output and can give input. However, you can detach from screen sessions and log out from the computer without causing running jobs to terminate. Here are some notes provided by IU's Knowledge Base When you can't re-attach to your screen session after a lost connection In some cases, your previous screen session may not have detached properly when you lost your connection. If this happens, you can detach your session manually. To see your existing screen sessions, enter: screen -list This will display a list of your current screen sessions. For instance, if you had one attached and one dead screen, you would see: There are screens on: 25542.pts-28.hostname (Dead ???) 1636.pts-21.hostname (Attached) Remove dead screens with 'screen -wipe'. 2 Sockets in /tmp/screens/S-username. To detach an attached screen, enter: screen -D If you have more than one attached screen, you can specify a particular screen to detach. For example, to detach the screen in the above example, you would enter: screen -D 1636.pts-21.hostname Once you've done this, you can resume the screen by entering the screen -r command. (In the above example, the dead screen isn't causing problems, but you should probably enter the screen -wipe command to get rid of it.)AMD optimized crfsuite Problem The bundled binary from python-crfsuite and sklearn-crfsuite performs rather badly on amd processors. For example, in a trial run with training a small pos tagger on an AMD Epyc 7601, each iteration in the hyperparameter search took about 1 minute 15 seconds. This is only with a small training set of about 400 sentences. With the full 6,000 sentences available it takes upwards of 4 days to finish a single run. Rough. However, on a dual intel E5-2680 system (16 cores at 3.2 Ghz all core turbo), the performance is much faster. On that same small dataset of 400 sentences, each iteration takes about 30 seconds, the entire hyperparameter search over 50 combinations takes 2 minutes. This appears to be due to the fact that the binaries that ship with sklearn-crfsuite were compiled on an intel platform. Solution To fix this, I created a fork of python-crfsuite that uses the avx2 instructions available on amd's zen processors (this should also help with newer intel processors that have avx extensions). This fork is available at https://github.com/ksteimel/python-crfsuite.git To use this, clone the repo git clone --recurse-submodules https://github.com/ksteimel/python-crfsuite.git Then change into the new directory. cd python-crfsuite it's a good idea to create a virtual environment so if you decide you don't want to use this version, you don't have to. virtualenv -p source /bin/activate Then, we need to build the package. python setup.py build python setup.py install Now you should have an optimized build of python-crfsuite You can then install sklearn-crfsuite . pip install sklearn-crfsuite pip install scikit-learn To get out of your virtual environment run, deactivate How did we do? The whole point of this was to speed up performance on amd processors with crfsuite. If we go back to our small training dataset, we see a substantial boost in performance. We've now gone from one minute 15 seconds per iteration to only 30ish seconds per iteration. It may seem somewhat shocking that the amd processor is only about as fast per iteration as the intel processor. However, the amd processor has double the number of cores (32 instead of 16) with a lower clock speed. (2.2 Ghz for the epyc processor compared to 3.2 for the intel procesor). If we look at the entire grid search across 50 parameters, we see the core advantage of the epyc processor emerge: the grid search finished in only 1.1 minutes on the AMD processor while it took 2.0 minutes on the pair of intel processors. Not too shabby for 2 minutes of work.Organizational Information ClingDing Spring 2019 January 23rd Coffee @ Pourhouse January 30th NACLO grading party February 6th Coffee February 13th git tutorial (Yue Chen & Ken Steimel) February 20th Noon -- 1: Mel Andresen 4 -- 5: Coffee February 27th Job/Internship hunting March 6th Coffee March 20th Yue Chen Ken Steimel :: Cross Language Tagging in Luyia March 27th Coffee April 3rd Yue, Ken, Leah, Noor, Zhouyu :: Abusive Language Detection Hai Hu April 10th Coffee April 17th Noor Abomokh April 24th Automated testing in python (Ken Steimel) May 1st Coffee CLINGDING Fall 2019 9-4: Coffee (crumble 10th and college) 9-11: Internship: noor, josephine, ken, hai 9-18: Coffee 9-25: Yue 10-02: Coffee 10-9: Nastia 10-16: Coffee 10-23: 10-30: Coffee 11-6: 11-13: Coffee 11-20: 11-27: Thanksgiving 12-4: Coffee 12-11: Becca 12-18: Coffee julia Graphing The GR package seems to be much quicker to get to first graphing than the pyplots.jl, Gadfly.jl or other packages. However, I do like the way gadly looks better. VegaLite also seems to be a rather quick packagePython notes Docstring example class Albatross(object): """A bird with a flight speed exceeding that of an unladen swallow. Attributes: flight_speed The maximum speed that such a bird can attain. nesting_grounds The locale where these birds congregate to reproduce. """ flight_speed = 691 nesting_grounds = "Throatwarbler Man Grove" CRF-suite on AVX2 cpus using sklearn-crfsuite or python-crfsuite on an AMD system can be very slow due to optimizations in the precompiled wheel files that are specific to intel processors. I have a branch of python-crfsuite that has flags for avx2 instructions. To see if your cpu supports avx2 instructions, examine the output of lscpu | grep avx2. If anything is returned, then your cpu supports the avx2 instruction set. To use this fork: Create a virtual environment for this version of python-crfsuite virtualenv ~/venvs/crfsuite source ~/venvs/crfsuite/bin/activate Clone my fork of python-crfsuite git clone --recurse-submodules git@github.com:ksteimel/python-crfsuite.git Build python-crfsuite python setup.py build python setup.py install Install additional dependencies pip install sklearn-crfsuite pip install scikit-learn How much does this help? Even on intel cpus that support avx2 instructions, the time taken to complete a grid search is reduced. For example, a 5 fold grid search with 10 parameter combinations (e.g. 50 total runs) takes 4 minutes to complete using the version in pypi (on a POS tagging problem in a Turkic language). The avx2 compiled version completes in 3 minutes. This becomes more pronounced as the size of the tagset increases. Using avx only (e.g. changing -mavx2 to -mavx in the setup.py scipt) still results in improvements to performance. On a pair of intel e5-2680, here are the runtimes for this same benchmark script in minutes. Note that the avx on setting for 16 is having trouble even keeping the cpus loaded because the grid search runs are finishing too fast. This is an issue when the number of tags is small as each task in grid search finishes very quickly and most of the time is spent on task overhead. with longer running tasks, the difference is more noticable. n_jobs avx off avx on 8 3.5 2.7 16 2.8 2.5 AllenNLP Notes If you're running pytest and you have your own modules, sometimes it's necessary to run pytest as a python module. python -m pytest This automatically imports the current directory into your PYTHONPATH. Just using pytest by itself can cause issues even with python_paths = ./ in your pytest.ini.