2023-02-03

NLP 100 Exercise ch2：UNIX Commands

Machine Learning

NLP

Introduction

The Tokyo Institute of Technology has created and maintains a collection of exercises on NLP called "NLP 100 Exercise".

https://nlp100.github.io/en/ch02.html

In this article, I will find sample answers to "Chapter 2: UNIX Commands".

Environment settings

The file popular-names.txt stores names of babies born in US with their genders, numbers of births, and years of births in tab-separated format. Create a program with the specifications below. Run the program with popular-names.txt as an input. Furthermore, confirm that the same (similar) result can be obtained by running a UNIX command.

$ wget https://nlp100.github.io/data/popular-names.txt

10. Line count

Count the number of lines of the file. Confirm the result by using wc command.

$ wc -l popular-names.txt

2780 popular-names.txt

11. Replace tabs into spaces

Replace every occurrence of a tab character into a space. Confirm the result by using sed, tr, or expand command.

$ head -5 popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880

$ sed -e 's/\t/ /g' popular-names.txt | head -n 5

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880

12. col1.txt from the first column, col2.txt from the second column

Extract the value of the first column of each line, and store the output into col1.txt. Extract the value of the second column of each line, and store the output into col2.txt. Confirm the result by using cut command.

$ cut -f 1 popular-names.txt > col1.txt
$ cut -f 2 popular-names.txt > col2.txt

13. Merging col1.txt and col2.txt

Join the contents of col1.txt and col2.txt, and create a text file whose each line contains the values of the first and second columns (separated by tab character) of the original file. Confirm the result by using paste command.

$ paste col1.txt col2.txt | head -n 5

Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F

14. First N lines

Receive a natural number $N$ from a command-line argument, and output the first $N$ lines of the file. Confirm the result by using head command.

$ head -n 5 popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880

15. Last N lines

Receive a natural number $N$ from a command-line argument, and output the last $N$ lines of the file. Confirm the result by using tail command.

$ tail -n 5 popular-names.txt

Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

16. Split a file into N pieces

Receive a natural number $N$ from a command-line argument, and split the input file into $N$ pieces at line boundaries. Confirm the result by using split command.

$ split -l 200 popular-names.txt

17. Distinct strings in the first column

Find distinct strings (a set of strings) of the first column of the file. Confirm the result by using cut, sort, and uniq commands.

$ cut -f 1 popular-names.txt | sort -s | uniq

Abigail
Aiden
Alexander
.
.
.
Virginia
Walter
William

18. Sort lines in descending order of the third column

Sort the lines in descending numeric order of the third column (sort lines without changing the content of each line). Confirm the result by using sort command.

$ sort -nrsk 3 ./popular-names.txt | head -n 5

Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947

19. Frequency of a string in the first column in descending order

Find the frequency of a string in the first column, and sort the strings by descending order of their frequencies. Confirm the result by using cut, uniq, and sort commands.

$ cut -f 1 ./popular-names.txt | sort | uniq -c | sort -rn

 118 James
 111 William
 108 Robert
 .
 .
 .
   1 Julie
   1 Crystal
   1 Carolyn

References

https://nlp100.github.io/en/about.html
https://nlp100.github.io/en/ch02.html

NLP 100 Exercise ch1：Warm-up

NLP 100 Exercise ch3：Regular Expression

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS