Introduction
The Tokyo Institute of Technology has created and maintains a collection of exercises on NLP called "NLP 100 Exercise".
In this article, I will find sample answers to "Chapter 2: UNIX Commands".
Environment settings
The file popular-names.txt stores names of babies born in US with their genders, numbers of births, and years of births in tab-separated format. Create a program with the specifications below. Run the program with popular-names.txt as an input. Furthermore, confirm that the same (similar) result can be obtained by running a UNIX command.
$ wget https://nlp100.github.io/data/popular-names.txt
10. Line count
Count the number of lines of the file. Confirm the result by using wc command.
$ wc -l popular-names.txt
2780 popular-names.txt
11. Replace tabs into spaces
Replace every occurrence of a tab character into a space. Confirm the result by using sed, tr, or expand command.
$ head -5 popular-names.txt
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
$ sed -e 's/\t/ /g' popular-names.txt | head -n 5
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
12. col1.txt from the first column, col2.txt from the second column
Extract the value of the first column of each line, and store the output into col1.txt. Extract the value of the second column of each line, and store the output into col2.txt. Confirm the result by using cut command.
$ cut -f 1 popular-names.txt > col1.txt
$ cut -f 2 popular-names.txt > col2.txt
13. Merging col1.txt and col2.txt
Join the contents of col1.txt and col2.txt, and create a text file whose each line contains the values of the first and second columns (separated by tab character) of the original file. Confirm the result by using paste command.
$ paste col1.txt col2.txt | head -n 5
Mary F
Anna F
Emma F
Elizabeth F
Minnie F
14. First N lines
Receive a natural number
from a command-line argument, and output the first N lines of the file. Confirm the result by using head command. N
$ head -n 5 popular-names.txt
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
15. Last N lines
Receive a natural number
from a command-line argument, and output the last N lines of the file. Confirm the result by using tail command. N
$ tail -n 5 popular-names.txt
Benjamin M 13381 2018
Elijah M 12886 2018
Lucas M 12585 2018
Mason M 12435 2018
Logan M 12352 2018
16. Split a file into N pieces
Receive a natural number
from a command-line argument, and split the input file into N pieces at line boundaries. Confirm the result by using split command. N
$ split -l 200 popular-names.txt
17. Distinct strings in the first column
Find distinct strings (a set of strings) of the first column of the file. Confirm the result by using cut, sort, and uniq commands.
$ cut -f 1 popular-names.txt | sort -s | uniq
Abigail
Aiden
Alexander
.
.
.
Virginia
Walter
William
18. Sort lines in descending order of the third column
Sort the lines in descending numeric order of the third column (sort lines without changing the content of each line). Confirm the result by using sort command.
$ sort -nrsk 3 ./popular-names.txt | head -n 5
Linda F 99689 1947
Linda F 96211 1948
James M 94757 1947
Michael M 92704 1957
Robert M 91640 1947
19. Frequency of a string in the first column in descending order
Find the frequency of a string in the first column, and sort the strings by descending order of their frequencies. Confirm the result by using cut, uniq, and sort commands.
$ cut -f 1 ./popular-names.txt | sort | uniq -c | sort -rn
118 James
111 William
108 Robert
.
.
.
1 Julie
1 Crystal
1 Carolyn
References