Traffine I/O

Bahasa Indonesia

2023-02-03

NLP 100 Exercise bab 3:Regular Expression

Pengantar

Tokyo Institute of Technology telah menciptakan dan memelihara koleksi latihan NLP yang disebut "NLP 100 Exercise".

https://nlp100.github.io/en/ch03.html

Dalam artikel ini, saya akan memberikan contoh jawaban untuk "Chapter 3: Regular Expression".

Pengaturan lingkungan

Berkas enwiki-country.json.gz menyimpan artikel Wikipedia dalam format tersebut:

  • Setiap baris menyimpan sebuah artikel Wikipedia dalam format JSON
  • Setiap dokumen JSON memiliki pasangan kunci dan nilai:
    • Judul artikel sebagai nilai untuk kunci judul
    • Isi artikel sebagai nilai untuk kunci teks
  • Seluruh berkas dikompresi dengan gzip

Tulis kode yang melakukan pekerjaan berikut.

$ wget https://nlp100.github.io/data/enwiki-country.json.gz

20. Read JSON documents

Baca dokumen JSON dan keluarkan isi artikel tentang Britania Raya. Gunakan kembali hasil keluarannya di soal 21-29.

import pandas as pd

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
print(wiki_uk)
{{reflist|group=note}}

==References==
{{reflist|colwidth=30em}}

==External links==
{{Sister project links|n=Category:United Kingdom|voy=United Kingdom|d=Q145}}
; Government
.
.
.

21. Lines with category names

Ekstrak baris yang menentukan kategori artikel.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^(.*\[\[Category:.*\]\].*)$'
result = '\n'.join(re.findall(pattern, wiki_uk, re.MULTILINE))
print(result)
[[Category:United Kingdom| ]]
[[Category:British Islands]]
[[Category:Countries in Europe]]
[[Category:English-speaking countries and territories]]
[[Category:G7 nations]]
[[Category:Group of Eight nations]]
[[Category:G20 nations]]
[[Category:Island countries]]
[[Category:Northern European countries]]
[[Category:Former member states of the European Union]]
[[Category:Member states of NATO]]
[[Category:Member states of the Commonwealth of Nations]]
[[Category:Member states of the Council of Europe]]
[[Category:Member states of the Union for the Mediterranean]]
[[Category:Member states of the United Nations]]
[[Category:Priority articles for attention after Brexit]]
[[Category:Western European countries]]

22. Category names

Ekstrak nama kategori artikel.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^.*\[\[Category:(.*?)(?:\|.*)?\]\].*$'
result = '\n'.join(re.findall(pattern, wiki_uk, re.MULTILINE))
print(result)
United Kingdom
British Islands
Countries in Europe
English-speaking countries and territories
G7 nations
Group of Eight nations
G20 nations
Island countries
Northern European countries
Former member states of the European Union
Member states of NATO
Member states of the Commonwealth of Nations
Member states of the Council of Europe
Member states of the Union for the Mediterranean
Member states of the United Nations
Priority articles for attention after Brexit
Western European countries

23. Section structure

Ekstrak nama bagian dalam artikel dengan levelnya. Sebagai contoh, tingkat bagian adalah 1 untuk markah MediaWiki "== Nama bagian ==".

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'^(\={2,})\s*(.+?)\s*(\={2,}).*$'
result = '\n'.join(i[1] + ':' + str(len(i[0]) - 1) for i in re.findall(pattern, uk_wiki, re.MULTILINE))
print(result)
Etymology and terminology:1
History:1
Background:2
.
.
.
Notes:1
References:1
External links:1

24. Media references

Mengekstrak referensi ke file media yang ditautkan dari artikel.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]
pattern = r'\[\[File:(.+?)\|'
result = '\n'.join(re.findall(pattern, wiki_uk))
print(result)
Royal Coat of Arms of the United Kingdom.svg
Royal Coat of Arms of the United Kingdom (Scotland).svg
Europe-UK (orthographic projection).svg
.
.
.

25. Infobox

Ekstrak nama bidang dan nilainya di Infobox "country", dan simpan dalam objek kamus.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
conventional_long_name:United Kingdom of Great Britain and Northern Ireland
image_flag:Flag of the United Kingdom.svg
.
.
.

26. Remove emphasis markups

Sebagai tambahan dari proses pada masalah 25, hapus penekanan markah MediaWiki dari nilai. Lihat Bantuan:Cheatsheet.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]: pattern.sub('', i[1]) for i in result}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
 linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

27. Remove internal links

Sebagai tambahan dari proses pada masalah 26, hapus tautan internal dari nilai. Lihat Bantuan:Cheatsheet.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{基礎情報.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
linking_name:the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

28. Remove MediaWiki markups

Sebagai tambahan dari proses pada masalah 27, hapuslah markah MediaWiki dari nilai-nilai yang ada, dan dapatkan informasi dasar negara tersebut dalam format teks biasa.

import pandas as pd
import re

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{基礎情報.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

# remove outer link
pattern = r'https?://[\w!?/\+\-_~=;\.,*&@#$%\(\)\'\[\]]+'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# remove html tag
pattern = r'<.+?>'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

result = dict(result)
for k, v in result.items():
    print(k + ':' + v)
common_name:United Kingdom
 linking_name:the United Kingdom
 conventional_long_name:United Kingdom of Great Britain and Northern Ireland
 image_flag:Flag of the United Kingdom.svg
.
.
.

29. Country flag

Dapatkan URL bendera negara dengan menggunakan hasil analisis Infobox. (Petunjuk: ubah referensi berkas menjadi URL dengan memanggil imageinfo di API MediaWiki)

import pandas as pd
import re
import requests

df = pd.read_json('enwiki-country.json.gz', lines=True)
wiki_uk = df.query('title == "United Kingdom"')['text'].values[0]

# extract template
pattern = r'^\{\{Infobox.*?$(.*?)^\}\}'
template = re.findall(pattern, wiki_uk, re.MULTILINE + re.DOTALL)

# get field name and value
pattern = r'^\|(.+?)\s*=\s*(.+?)(?:(?=\n\|)| (?=\n$))'
result = re.findall(pattern, template[0], re.MULTILINE + re.DOTALL)

# remove emphasis
pattern = re.compile(r'\'{2,5}', re.MULTILINE + re.S)
result = {i[0]:pattern.sub('', i[1]) for i in result}

# remove inner link
pattern = r'\[\[(?:[^|]*?\|)??([^|]*?)\]\]'
result = {k: re.sub(pattern, r'\1', v) for k, v in result.items()}

# remove outer link
pattern = r'https?://[\w!?/\+\-_~=;\.,*&@#$%\(\)\'\[\]]+'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# remove html tag
pattern = r'<.+?>'
result = {k: re.sub(pattern, '', v) for k, v in result.items()}

# get url
url_file = result[' image_flag'].replace(' ', '_')
url = 'https://commons.wikimedia.org/w/api.php?action=query&titles=File:' + url_file + '&prop=imageinfo&iiprop=url&format=json'
data = requests.get(url)
print(re.search(r'"url":"(.+?)"', data.text).group(1))
https://upload.wikimedia.org/wikipedia/commons/8/83/Flag_of_the_United_Kingdom_%283-5%29.svg

References

https://nlp100.github.io/en/about.html
https://nlp100.github.io/en/ch03.html

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!