Question

A Read aloud A Big-Data Processing Task: We need to find out 15 most frequently used words on a set of Wikipedia pages. SpecProblem 2: Write a python program to finish the above big-data processing task. Use urllib or urllib2 module to download a pa

The URL is part of the assignment, it is a web scraper in Wiki.

0 0
Add a comment Improve this question Transcribed image text
Answer #1
Below is a screen shot of the python program to check indentation. Comments are given on every line explaining the code.
1 -import re 2 import urllib.request 3 from bs4 import BeautifulSoup 4 5 #Add the list of urls here 6 Curls = [https://en.wi
Below is the output of the program:
229 a 201 0 131 the 120 of 118 in 92 ext 80 aa 66 disambiguation 61 edit 60 to 59 and 51 ready 43 ac 40 mW 40 value

Below is the code to copy:
#CODE STARTS HERE----------------
import re
import urllib.request
from bs4 import BeautifulSoup

#Add the list of urls here
urls = ['https://en.wikipedia.org/wiki/AA', 'https://en.wikipedia.org/wiki/AB',
      'https://en.wikipedia.org/wiki/AC','https://en.wikipedia.org/wiki/ZY']

web_text = "" #Used to store all the text
for url in urls:
   res  = urllib.request.urlopen(url) #Make a request to the url
   soup = BeautifulSoup(res.read(),'html.parser') #Convert the html into soup element
   web_text = web_text+soup.text #Get only the text and leave all the html tags

#Remove punctuations and digits
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~123456789'''
for ele in web_text: #loop through each character
   if ele in punc: #Check if its a punctuation
      web_text = web_text.replace(ele, " ") #Replace punctuation by " "

counts = dict() #Dictionary for counter
for word in re.findall(r'\b\S+\b', web_text): #Find words using regex
   word = word.lower() #Make the words into lowercase
   counts[word] = counts.get(word, 0) + 1 #Count the words

#Sort the words and print the 15 most common words
for k, v in sorted(counts.items(), key=lambda item: item[1], reverse=True)[:15]:
   print(v,k)
#CODE ENDS HERE------------------
Add a comment
Know the answer?
Add Answer to:
The URL is part of the assignment, it is a web scraper in Wiki. A' Read...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Write a batch script, which combines a few tools in Linux to finish a big-data processing...

    Write a batch script, which combines a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages. The execution of the script generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages. 126...

  • I will like to compare automobile producers. This assignment suppose to read data like div tags...

    I will like to compare automobile producers. This assignment suppose to read data like div tags a etc. And count occurrence of them. Reading from a URL while working with an API (using Mediawiki API as an example) Input: Will be obtained from a URL using Mediawiki API -- starter code below Output: Up to you... sort of. What to submit: Upload a report (.pdf preferred) containing screenshots of code, output, and discussion/conclusions to d2l dropbox. Please also submit your...

  • In Java please Only use methods in the purpose. Thank you The purpose of this assignment is to help you learn Java iden...

    In Java please Only use methods in the purpose. Thank you The purpose of this assignment is to help you learn Java identifiers, assignments, input/output nested if and if/else statements, switch statements and non-nested loops. Purpose Question 2-String variables/Selection & loops. (8.5 points) Write a complete Java program which prompts the user for a sentence on one line where each word is separated by one space, reads the line into one String variable using nextline), converts the string into Ubbi...

  • In this assignment, you will explore more on text analysis and an elementary version of sentiment...

    In this assignment, you will explore more on text analysis and an elementary version of sentiment analysis. Sentiment analysis is the process of using a computer program to identify and categorise opinions in a piece of text in order to determine the writer’s attitude towards a particular topic (e.g., news, product, service etc.). The sentiment can be expressed as positive, negative or neutral. Create a Python file called a5.py that will perform text analysis on some text files. You can...

  • CIS 221 Loan Calculator Enhancement Introduction You are a systems analyst working for a company that...

    CIS 221 Loan Calculator Enhancement Introduction You are a systems analyst working for a company that provides loans to customers. Your manager has asked you to enhance and correct their existing Loan Calculator program, which is designed to calculate monthly and total payments given the loan amount, the annual interest rate, and the duration of the loan. Although the current version of the program (hereby termed the “As Is” version) has some functionality, there are several missing pieces, and the...

  • JAVA Primitive Editor (Please help, I am stuck on this assignment which is worth a lot...

    JAVA Primitive Editor (Please help, I am stuck on this assignment which is worth a lot of points. Make sure that the program works because I had someone answer this incorrectly!) The primary goal of the assignment is to develop a Java based primitive editor. We all know what an editor of a text file is. Notepad, Wordpad, TextWrangler, Pages, and Word are all text editors, where you can type text, correct the text in various places by moving the...

  • JAVA Primitive Editor The primary goal of the assignment is to develop a Java based primitive...

    JAVA Primitive Editor The primary goal of the assignment is to develop a Java based primitive editor. We all know what an editor of a text file is. Notepad, Wordpad, TextWrangler, Pages, and Word are all text editors, where you can type text, correct the text in various places by moving the cursor to the right place and making changes. The biggest advantage with these editors is that you can see the text and visually see the edits you are...

  • Python program This assignment requires you to write a single large program. I have broken it...

    Python program This assignment requires you to write a single large program. I have broken it into two parts below as a suggestion for how to approach writing the code. Please turn in one program file. Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of...

  • Recursion and Trees Application – Building a Word Index Make sure you have read and understood...

    Recursion and Trees Application – Building a Word Index Make sure you have read and understood ·         lesson modules week 10 and 11 ·         chapters 9 and 10 of our text ·         module - Lab Homework Requirements before submitting this assignment. Hand in only one program, please. Background: In many applications, the composition of a collection of data items changes over time. Not only are new data items added and existing ones removed, but data items may be duplicated. A list data structure...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT