Today World Wide Web is flooded with billions of static and dynamic web pages created with programming languages such as HTML, PHP and ASP. Web is great source of information offering a lush playground for data mining. Since the data stored on web is in various formats and are dynamic in nature, it's a significant challenge to search, process and present the unstructured information available on the web.
Complexity of a Web page far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization while traditional books and text documents are much simpler in their consistency. Further, search engines with their limited capacity can not index all the web pages which makes data mining extremely inefficient.
Moreover, Internet is a highly dynamic knowledge resource and grows at a rapid pace. Sports, News, Finance and Corporate sites update their websites on hourly or daily basis. Today Web reaches to millions of users having different profiles, interests and usage purposes. Every one of these requires good information but don't know how to retrieve relevant data efficiently and with least efforts.
It is important to note that only a small section of the web possesses really useful information. There are three usual methods that a user adopts when accessing information stored on the internet:
• Random surfing i.e. following large numbers of hyperlinks available on the web page.
• Query based search on Search Engines - use Google or Yahoo to find relevant documents (entering specific keywords queries of interest in search box)
• Deep query searches i.e. fetching searchable database from eBay.com's product search engines or Business.com's service directory, etc.
To use the web as an effective resource and knowledge discovery researchers have developed efficient data mining techniques to extract relevant data easily, smoothly and cost-effectively.
Source: http://ezinearticles.com/?Basics-of-Web-Data-Mining-and-Challenges-in-Web-Data-Mining-Process&id=4937441
Complexity of a Web page far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization while traditional books and text documents are much simpler in their consistency. Further, search engines with their limited capacity can not index all the web pages which makes data mining extremely inefficient.
Moreover, Internet is a highly dynamic knowledge resource and grows at a rapid pace. Sports, News, Finance and Corporate sites update their websites on hourly or daily basis. Today Web reaches to millions of users having different profiles, interests and usage purposes. Every one of these requires good information but don't know how to retrieve relevant data efficiently and with least efforts.
It is important to note that only a small section of the web possesses really useful information. There are three usual methods that a user adopts when accessing information stored on the internet:
• Random surfing i.e. following large numbers of hyperlinks available on the web page.
• Query based search on Search Engines - use Google or Yahoo to find relevant documents (entering specific keywords queries of interest in search box)
• Deep query searches i.e. fetching searchable database from eBay.com's product search engines or Business.com's service directory, etc.
To use the web as an effective resource and knowledge discovery researchers have developed efficient data mining techniques to extract relevant data easily, smoothly and cost-effectively.
Source: http://ezinearticles.com/?Basics-of-Web-Data-Mining-and-Challenges-in-Web-Data-Mining-Process&id=4937441
ReplyDeleteChallenges facing data scraping
It is very important to note that getting data through data scraping is not very easy, it encounters quite a number of problems including, but not limited to.
Metadata: only a few datasets are thoroughly explained for a person to understand easily what they mean. It can therefore be very difficult for the web scrapper to know what the web designer meant by some statements.
Scale: it is rather apparent that the differences in which data is represented in terms of units of measure can be a big challenge during data scraping. The data’s terabytes can be a problem to some file systems.
Complexity of the source: an exact answer to a specific question is what is required by the web user, so if the source from which the data to be scrapped is complicated and not easy to comprehend, data scraping process may fail since proper and accurate information may not be extracted.
Reference from: http://www.loginworks.com/blogs/web-scraping-blogs/data-scraping-considerations-challenges-benefits/
This comment has been removed by the author.
ReplyDeleteThanks for the nice information. We At, Web Parsing also providing web scraping and Data Mining Service. If you have any requirement then visit: www.web-parsing.com
ReplyDelete