Commit @93fca0ad3d864d80ed211cd6c40c249104de33bc - yjyoon/구미시-네이버-블로그-스크레퍼

윤영준 2023-12-12

fixed a critical bug where title is not properlly gathered.

@93fca0ad3d864d80ed211cd6c40c249104de33bc

e754d84

93fca0a

naver_blog_info_gatherer.py

--- naver_blog_info_gatherer.py

+++ naver_blog_info_gatherer.py


 HEADER = {"User-Agent": "Mozilla/119.0 (Windows NT 10.0; Win64; x64) Chrome/98.0.4758.102"}
 
 
+def remove_html(strr):
+    # print(strr)
+    cleaning = regex.sub(r'<.*?>', '', strr.strip().replace("\n",' '))
+    cleaning = regex.sub(r' +', ' ', cleaning)
+    return cleaning
+
 # Function to remove tags
 def remove_tags(html):
     # parse html content

         href = regex.search(
             r"https:\/\/blog\.naver\.com\/[\uAC00-\uD7AFa-zA-Z0-9_-]+\/[\uAC00-\uD7AFa-zA-Z0-9_-]+",
             content, re.DOTALL)
-        title = regex.search(r'<strong class="title_post">(.*?)<\/strong>', content)
+        # title is using bit different approach since the search engine highlights for search keywords
+        title = regex.search(r'<strong class="title_post">(.*?)<\/span>', content)
         text = regex.search(r'<!-- ngIf: post.contents -->(.*?)<\/a>', content)
         author_name = regex.search(r'<em class="name_author">(.*?)<\/em>', content)
         post_date = regex.search(r'<span class="date">(.*?)<\/span>', content)
         href = href.group(0)
-        title = remove_tags(title.group(1))
+        title = remove_html(title.group(1))
         text = remove_tags(text.group(1))
         author_name = remove_tags(author_name.group(1))
         post_date = remove_tags((post_date.group(1)))

 
 
 if __name__ == "__main__":
-    naver_blog_scrapper("구미 송정동", "2022-01-01", "2023-10-31", 7, 50, 1, 12)
+    #TODO start_page_num must be not working as intended
+    naver_blog_scrapper("도개면", "2022-01-01", "2023-10-31", 100, 50, 1, 12)

Add a comment

Open 0
Closed 0

List

...	...	@@ -23,6 +23,12 @@
23	23	HEADER = {"User-Agent": "Mozilla/119.0 (Windows NT 10.0; Win64; x64) Chrome/98.0.4758.102"}
24	24
25	25
	26	+def remove_html(strr):
	27	+ # print(strr)
	28	+ cleaning = regex.sub(r'<.*?>', '', strr.strip().replace("\n",' '))
	29	+ cleaning = regex.sub(r' +', ' ', cleaning)
	30	+ return cleaning
	31	+
26	32	# Function to remove tags
27	33	def remove_tags(html):
28	34	# parse html content
...	...	@@ -118,12 +124,13 @@
118	124	href = regex.search(
119	125	r"https:\/\/blog\.naver\.com\/[\uAC00-\uD7AFa-zA-Z0-9_-]+\/[\uAC00-\uD7AFa-zA-Z0-9_-]+",
120	126	content, re.DOTALL)
121		- title = regex.search(r'<strong class="title_post">(.*?)<\/strong>', content)
	127	+ # title is using bit different approach since the search engine highlights for search keywords
	128	+ title = regex.search(r'<strong class="title_post">(.*?)<\/span>', content)
122	129	text = regex.search(r'<!-- ngIf: post.contents -->(.*?)<\/a>', content)
123	130	author_name = regex.search(r'<em class="name_author">(.*?)<\/em>', content)
124	131	post_date = regex.search(r'<span class="date">(.*?)<\/span>', content)
125	132	href = href.group(0)
126		- title = remove_tags(title.group(1))
	133	+ title = remove_html(title.group(1))
127	134	text = remove_tags(text.group(1))
128	135	author_name = remove_tags(author_name.group(1))
129	136	post_date = remove_tags((post_date.group(1)))
...	...	@@ -228,4 +235,5 @@
228	235
229	236
230	237	if __name__ == "__main__":
231		- naver_blog_scrapper("구미 송정동", "2022-01-01", "2023-10-31", 7, 50, 1, 12) (파일 끝에 줄바꿈 문자 없음)
	238	+ #TODO start_page_num must be not working as intended
	239	+ naver_blog_scrapper("도개면", "2022-01-01", "2023-10-31", 100, 50, 1, 12) (파일 끝에 줄바꿈 문자 없음)

Delete comment