2023-02-03

NLP 100 Exercise ch3:Regular Expression

Introduction

The Tokyo Institute of Technology has created and maintains a collection of exercises on NLP called "NLP 100 Exercise".

https://nlp100.github.io/en/ch03.html

In this article, I will find sample answers to "Chapter 3: Regular Expression".

Environment settings

The file enwiki-country.json.gz stores Wikipedia articles in the format:

  • Each line stores a Wikipedia article in JSON format
  • Each JSON document has key-value pairs:
    • Title of the article as the value for the title key
    • Body of the article as the value for the text key
  • The entire file is compressed by gzip

Write codes that perform the following jobs.

$ wget https://nlp100.github.io/data/enwiki-country.json.gz

20. Read JSON documents

Read the JSON documents and output the body of the article about the United Kingdom. Reuse the output in problems 21-29.

import pandas as pd

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
print(wiki_uk)
{{reflist|group=note}}

==References==
{{reflist|colwidth=30em}}

==External links==
{{Sister project links|n=Category:United Kingdom|voy=United Kingdom|d=Q145}}
; Government
.
.
.

21. Lines with category names

Extract lines that define the categories of the article.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^(.*\[\[Category:.*\]\].*)$'
result = '\n'.join(re.findall(pattern, wiki_uk, re.MULTILINE))
print(result)
[[Category:United Kingdom| ]]
[[Category:British Islands]]
[[Category:Countries in Europe]]
[[Category:English-speaking countries and territories]]
[[Category:G7 nations]]
[[Category:Group of Eight nations]]
[[Category:G20 nations]]
[[Category:Island countries]]
[[Category:Northern European countries]]
[[Category:Former member states of the European Union]]
[[Category:Member states of NATO]]
[[Category:Member states of the Commonwealth of Nations]]
[[Category:Member states of the Council of Europe]]
[[Category:Member states of the Union for the Mediterranean]]
[[Category:Member states of the United Nations]]
[[Category:Priority articles for attention after Brexit]]
[[Category:Western European countries]]

22. Category names

Extract the category names of the article.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^.*\[\[Category:(.*?)(?:\|.*)?\]\].*$'
result = '\n'.join(re.findall(pattern, wiki_uk, re.MULTILINE))
print(result)
United Kingdom
British Islands
Countries in Europe
English-speaking countries and territories
G7 nations
Group of Eight nations
G20 nations
Island countries
Northern European countries
Former member states of the European Union
Member states of NATO
Member states of the Commonwealth of Nations
Member states of the Council of Europe
Member states of the Union for the Mediterranean
Member states of the United Nations
Priority articles for attention after Brexit
Western European countries

23. Section structure

Extract section names in the article with their levels. For example, the level of the section is 1 for the MediaWiki markup "== Section name ==".

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^(\={2,})\s*(.+?)\s*(\={2,}).*$'
result = '\n'.join(i[1] + ':' + str(len(i[0]) - 1) for i in re.findall(pattern, uk_wiki, re.MULTILINE))
print(result)
Etymology and terminology:1
History:1
Background:2
.
.
.
Notes:1
References:1
External links:1

24. Media references

Extract references to media files linked from the article.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'\[\[File:(.+?)\|'
result = '\n'.join(re.findall(pattern, wiki_uk))
print(result)
Royal Coat of Arms of the United Kingdom.svg
Royal Coat of Arms of the United Kingdom (Scotland).svg
Europe-UK (orthographic projection).svg
.
.
.

25. Infobox

Extract field names and their values in the Infobox “country”, and store them in a dictionary object.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
conventional_long_name:United Kingdom of Great Britain and Northern Ireland
image_flag:Flag of the United Kingdom.svg
.
.
.

26. Remove emphasis markups

In addition to the process of the problem 25, remove emphasis MediaWiki markups from the values. See Help:Cheatsheet.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]: pattern.sub('', i[1]) for i in result}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
 linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

27. Remove internal links

In addition to the process of the problem 26, remove internal links from the values. See Help:Cheatsheet.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{基礎情報.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

28. Remove MediaWiki markups

In addition to the process of the problem 27, remove MediaWiki markups from the values as much as you can, and obtain the basic information of the country in plain text format.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{基礎情報.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

# remove outer link
pattern = r'https?://[\w!?/\+\-_~=;\.,*&@#$%\(\)\'\[\]]+'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# remove html tag
pattern = r'<.+?>'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
 linking_name:the United Kingdom
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

29. Country flagPermalink

Obtain the URL of the country flag by using the analysis result of Infobox. (Hint: convert a file reference to a URL by calling imageinfo in MediaWiki API)

import pandas as pd
import re
import requests

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

# remove outer link
pattern = r'https?://[\w!?/\+\-_~=;\.,*&@#$%\(\)\'\[\]]+'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# remove html tag
pattern = r'<.+?>'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# get url
url_file = result[' image_flag'].replace(' ', '_')
url = 'https://commons.wikimedia.org/w/api.php?action=query&titles=File:' + url_file + '&prop=imageinfo&iiprop=url&format=json'
data = requests.get(url)
print(re.search(r'"url":"(.+?)"', data.text).group(1))
https://upload.wikimedia.org/wikipedia/commons/8/83/Flag_of_the_United_Kingdom_%283-5%29.svg

References

https://nlp100.github.io/en/about.html
https://nlp100.github.io/en/ch03.html

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!