Friday, 30 December 2016

Data Mining and Financial Data Analysis

Data Mining and Financial Data Analysis

Introduction:

Most marketers understand the value of collecting financial data, but also realize the challenges of leveraging this knowledge to create intelligent, proactive pathways back to the customer. Data mining - technologies and techniques for recognizing and tracking patterns within data - helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs as well as financial need. In this accessible introduction, we provides a business and technological overview of data mining and outlines how, along with sound business processes and complementary technologies, data mining can reinforce and redefine for financial analysis.

Objective:

1. The main objective of mining techniques is to discuss how customized data mining tools should be developed for financial data analysis.

2. Usage pattern, in terms of the purpose can be categories as per the need for financial analysis.

3. Develop a tool for financial analysis through data mining techniques.

Data mining:

Data mining is the procedure for extracting or mining knowledge for the large quantity of data or we can say data mining is "knowledge mining for data" or also we can say Knowledge Discovery in Database (KDD). Means data mining is : data collection , database creation, data management, data analysis and understanding.

There are some steps in the process of knowledge discovery in database, such as

1. Data cleaning. (To remove nose and inconsistent data)

2. Data integration. (Where multiple data source may be combined.)

3. Data selection. (Where data relevant to the analysis task are retrieved from the database.)

4. Data transformation. (Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining. (An essential process where intelligent methods are applied in order to extract data patterns.)

6. Pattern evaluation. (To identify the truly interesting patterns representing knowledge based on some interesting measures.)

7. Knowledge presentation.(Where visualization and knowledge representation techniques are used to present the mined knowledge to the user.)

Data Warehouse:

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema and which usually resides at a single site.

Text:

Most of the banks and financial institutions offer a wide verity of banking services such as checking, savings, business and individual customer transactions, credit and investment services like mutual funds etc. Some also offer insurance services and stock investment services.

There are different types of analysis available, but in this case we want to give one analysis known as "Evolution Analysis".

Data evolution analysis is used for the object whose behavior changes over time. Although this may include characterization, discrimination, association, classification, or clustering of time related data, means we can say this evolution analysis is done through the time series data analysis, sequence or periodicity pattern matching and similarity based data analysis.

Data collect from banking and financial sectors are often relatively complete, reliable and high quality, which gives the facility for analysis and data mining. Here we discuss few cases such as,

Eg, 1. Suppose we have stock market data of the last few years available. And we would like to invest in shares of best companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing our decision making regarding stock investments.

Eg, 2. One may like to view the debt and revenue change by month, by region and by other factors along with minimum, maximum, total, average, and other statistical information. Data ware houses, give the facility for comparative analysis and outlier analysis all are play important roles in financial data analysis and mining.

Eg, 3. Loan payment prediction and customer credit analysis are critical to the business of the bank. There are many factors can strongly influence loan payment performance and customer credit rating. Data mining may help identify important factors and eliminate irrelevant one.

Factors related to the risk of loan payments like term of the loan, debt ratio, payment to income ratio, credit history and many more. The banks than decide whose profile shows relatively low risks according to the critical factor analysis.

We can perform the task faster and create a more sophisticated presentation with financial analysis software. These products condense complex data analyses into easy-to-understand graphic presentations. And there's a bonus: Such software can vault our practice to a more advanced business consulting level and help we attract new clients.

To help us find a program that best fits our needs-and our budget-we examined some of the leading packages that represent, by vendors' estimates, more than 90% of the market. Although all the packages are marketed as financial analysis software, they don't all perform every function needed for full-spectrum analyses. It should allow us to provide a unique service to clients.

The Products:

ACCPAC CFO (Comprehensive Financial Optimizer) is designed for small and medium-size enterprises and can help make business-planning decisions by modeling the impact of various options. This is accomplished by demonstrating the what-if outcomes of small changes. A roll forward feature prepares budgets or forecast reports in minutes. The program also generates a financial scorecard of key financial information and indicators.

Customized Financial Analysis by BizBench provides financial benchmarking to determine how a company compares to others in its industry by using the Risk Management Association (RMA) database. It also highlights key ratios that need improvement and year-to-year trend analysis. A unique function, Back Calculation, calculates the profit targets or the appropriate asset base to support existing sales and profitability. Its DuPont Model Analysis demonstrates how each ratio affects return on equity.

Financial Analysis CS reviews and compares a client's financial position with business peers or industry standards. It also can compare multiple locations of a single business to determine which are most profitable. Users who subscribe to the RMA option can integrate with Financial Analysis CS, which then lets them provide aggregated financial indicators of peers or industry standards, showing clients how their businesses compare.

iLumen regularly collects a client's financial information to provide ongoing analysis. It also provides benchmarking information, comparing the client's financial performance with industry peers. The system is Web-based and can monitor a client's performance on a monthly, quarterly and annual basis. The network can upload a trial balance file directly from any accounting software program and provide charts, graphs and ratios that demonstrate a company's performance for the period. Analysis tools are viewed through customized dashboards.

PlanGuru by New Horizon Technologies can generate client-ready integrated balance sheets, income statements and cash-flow statements. The program includes tools for analyzing data, making projections, forecasting and budgeting. It also supports multiple resulting scenarios. The system can calculate up to 21 financial ratios as well as the breakeven point. PlanGuru uses a spreadsheet-style interface and wizards that guide users through data entry. It can import from Excel, QuickBooks, Peachtree and plain text files. It comes in professional and consultant editions. An add-on, called the Business Analyzer, calculates benchmarks.

ProfitCents by Sageworks is Web-based, so it requires no software or updates. It integrates with QuickBooks, CCH, Caseware, Creative Solutions and Best Software applications. It also provides a wide variety of businesses analyses for nonprofits and sole proprietorships. The company offers free consulting, training and customer support. It's also available in Spanish.

ProfitSystem fx Profit Driver by CCH Tax and Accounting provides a wide range of financial diagnostics and analytics. It provides data in spreadsheet form and can calculate benchmarking against industry standards. The program can track up to 40 periods.

Source : http://ezinearticles.com/?Data-Mining-and-Financial-Data-Analysis&id=2752017

Saturday, 24 December 2016

Importance of Data Mining Services in Business

Importance of Data Mining Services in Business

Data mining is used in re-establishment of hidden information of the data of the algorithms. It helps to extract the useful information starting from the data, which can be useful to make practical interpretations for the decision making.
It can be technically defined as automated extraction of hidden information of great databases for the predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making. Although data mining is a relatively new term, the technology is not. It is thus also known as Knowledge discovery in databases since it grip searching for implied information in large databases.
It is primarily used today by companies with a strong customer focus - retail, financial, communication and marketing organizations. It is having lot of importance because of its huge applicability. It is being used increasingly in business applications for understanding and then predicting valuable data, like consumer buying actions and buying tendency, profiles of customers, industry analysis, etc. It is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, e-commerce, customer relationship management and financial services.

However, the use of some advanced technologies makes it a decision making tool as well. It is used in market research, industry research and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, scientific tests, genetics, financial services and utilities.

Data mining consists of major elements:

    Extract and load operation data onto the data store system.
    Store and manage the data in a multidimensional database system.
    Provide data access to business analysts and information technology professionals.
    Analyze the data by application software.
    Present the data in a useful format, such as a graph or table.

The use of data mining in business makes the data more related in application. There are several kinds of data mining: text mining, web mining, relational databases, graphic data mining, audio mining and video mining, which are all used in business intelligence applications. Data mining software is used to analyze consumer data and trends in banking as well as many other industries.

Outsourcing Web Research offer complete Data Mining Services and Solutions to quickly collective data and information from multiple Internet sources for your Business needs in a cost efficient manner.

Sourec : http://ezinearticles.com/?Importance-of-Data-Mining-Services-in-Business&id=2601221

Wednesday, 14 December 2016

Data Extraction Services For Better Outputs in Your Business

Data Extraction Services For Better Outputs in Your Business

Data Extraction can be defined as the process of retrieving data from an unstructured source in order to process it further or store it. It is very useful for large organizations who deal with large amount of data on a daily basis that need to be processed into meaningful information and stored for later use. The data extraction is a systematic way to extract and structure data from scattered and semi-structured electronic documents, as found on the web and in various data warehouses.

In today's highly competitive business world, vital business information such as customer statistics, competitor's operational figures and inter-company sales figures play an important role in making strategic decisions. By signing on this service provider, you will be get access to critivcal data from various sources like websites, databases, images and documents.

It can help you take strategic business decisions that can shape your business' goals. Whether you need customer information, nuggets into your competitor's operations and figure out your organization's performance, it is highly critical to have data at your fingertips as and when you want it. Your company may be crippled with tons of data and it may prove a headache to control and convert the data into useful information. Data extraction services enable you get data quickly and in the right format.

Few areas where Data Extraction can help you are:

    Capturing financial data
    Generating better sales leads
    Conducting market research, survey and analysis
    Conducting product research and analysis
    Track, extract and harvest product pricing data
    Searching for specific job postings
    Duplicating an online database
    Acquiring real estate data
    Processing auction information
    Searching online newspapers for latest pricing information
    Extracting and summarize news stories from online news sources

Outsourcing companies provide custom made data extraction services to the client's requirements. The different types of data extraction services;

    Web extraction
    Database extraction

Outsourcing is the beneficial option for large organizations seeking to manage large information. Outsourcing this services helps businesses in managing their data effectively, which in turn enables business to experience an increase in profits. By outsourcing, you can certainly increase your competitive edge and save costs too!

This article is courtesy of Web Scraping Expert - an executive at Outsourcing Web Research offer high quality and time bound comprehensive range of data extraction services at affordable rates. For more info please visit us at: http://www.webscrapingexpert.com/ or directly send your requirements at: info@webscrapingexpert.com

Source:http://ezinearticles.com/?Data-Extraction-Services-For-Better-Outputs-in-Your-Business&id=2760257

Friday, 9 December 2016

Increasing Accessibility by Scraping Information From PDF

Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source:http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Monday, 5 December 2016

Collecting Data With Web Scrapers

Collecting Data With Web Scrapers

There is a large amount of data available only through websites. However, as many people have found out, trying to copy data into a usable database or spreadsheet directly out of a website can be a tiring process. Data entry from internet sources can quickly become cost prohibitive as the required hours add up. Clearly, an automated method for collating information from HTML-based sites can offer huge management cost savings.

Web scrapers are programs that are able to aggregate information from the internet. They are capable of navigating the web, assessing the contents of a site, and then pulling data points and placing them into a structured, working database or spreadsheet. Many companies and services will use programs to web scrape, such as comparing prices, performing online research, or tracking changes to online content.

Let's take a look at how web scrapers can aid data collection and management for a variety of purposes.

Improving On Manual Entry Methods

Using a computer's copy and paste function or simply typing text from a site is extremely inefficient and costly. Web scrapers are able to navigate through a series of websites, make decisions on what is important data, and then copy the info into a structured database, spreadsheet, or other program. Software packages include the ability to record macros by having a user perform a routine once and then have the computer remember and automate those actions. Every user can effectively act as their own programmer to expand the capabilities to process websites. These applications can also interface with databases in order to automatically manage information as it is pulled from a website.

Aggregating Information

There are a number of instances where material stored in websites can be manipulated and stored. For example, a clothing company that is looking to bring their line of apparel to retailers can go online for the contact information of retailers in their area and then present that information to sales personnel to generate leads. Many businesses can perform market research on prices and product availability by analyzing online catalogues.

Data Management

Managing figures and numbers is best done through spreadsheets and databases; however, information on a website formatted with HTML is not readily accessible for such purposes. While websites are excellent for displaying facts and figures, they fall short when they need to be analyzed, sorted, or otherwise manipulated. Ultimately, web scrapers are able to take the output that is intended for display to a person and change it to numbers that can be used by a computer. Furthermore, by automating this process with software applications and macros, entry costs are severely reduced.

This type of data management is also effective at merging different information sources. If a company were to purchase research or statistical information, it could be scraped in order to format the information into a database. This is also highly effective at taking a legacy system's contents and incorporating them into today's systems.

Overall, a web scraper is a cost effective user tool for data manipulation and management.

source: http://ezinearticles.com/?Collecting-Data-With-Web-Scrapers&id=4223877

Wednesday, 30 November 2016

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

Data scrape is the process of extracting data from web by using software program from proven website only. Extracted data any one can use for any purposes as per the desires in various industries as the web having every important data of the world. We provide best of the web data extracting software. We have the expertise and one of kind knowledge in web data extraction, image scrapping, screen scrapping, email extract services, data mining, web grabbing.

Who can use Data Scraping Services?

Data scraping and extraction services can be used by any organization, company, or any firm who would like to have a data from particular industry, data of targeted customer, particular company, or anything which is available on net like data of email id, website name, search term or anything which is available on web. Most of time a marketing company like to use data scraping and data extraction services to do marketing for a particular product in certain industry and to reach the targeted customer for example if X company like to contact a restaurant of California city, so our software can extract the data of restaurant of California city and a marketing company can use this data to market their restaurant kind of product. MLM and Network marketing company also use data extraction and data scrapping services to to find a new customer by extracting data of certain prospective customer and can contact customer by telephone, sending a postcard, email marketing, and this way they build their huge network and build large group for their own product and company.

We helped many companies to find particular data as per their need for example.

Web Data Extraction

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. We help you to create a kind of API which helps you to scrape data as per your need. We provide quality and affordable web Data Extraction application

Data Collection

Normally, data transfer between programs is accomplished using info structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all. That's why the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user.

Email Extractor

A tool which helps you to extract the email ids from any reliable sources automatically that is called a email extractor. It basically services the function of collecting business contacts from various web pages, HTML files, text files or any other format without duplicates email ids.

Screen scrapping

Screen scraping referred to the practice of reading text information from a computer display terminal's screen and collecting visual data from a source, instead of parsing data as in web scraping.

Data Mining Services

Data Mining Services is the process of extracting patterns from information. Datamining is becoming an increasingly important tool to transform the data into information. Any format including MS excels, CSV, HTML and many such formats according to your requirements.

Web spider

A Web spider is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web Grabber

Web grabber is just a other name of the data scraping or data extraction.

Web Bot

Web Bot is software program that is claimed to be able to predict future events by tracking keywords entered on the Internet. Web bot software is the best program to pull out articles, blog, relevant website content and many such website related data We have worked with many clients for data extracting, data scrapping and data mining they are really happy with our services we provide very quality services and make your work data work very easy and automatic.

Source: http://ezinearticles.com/?How-Web-Data-Extraction-Services-Will-Save-Your-Time-and-Money-by-Automatic-Data-Collection&id=5159023

Saturday, 26 November 2016

How Xpath Plays Vital Role In Web Scraping Part 2

How Xpath Plays Vital Role In Web Scraping Part 2

Here is a piece of content on  Xpaths which is the follow up of How Xpath Plays Vital Role In Web Scraping

Let’s dive into a real-world example of scraping amazon website for getting information about deals of the day. Deals of the day in amazon can be found at this URL. So navigate to the amazon (deals of the day) in Firefox and find the XPath selectors. Right click on the deal you like and select “Inspect Element with Firebug”:

If you observe the image below keenly, there you can find the source of the image(deal) and the name of the deal in src, alt attribute’s respectively.

So now let’s write a generic XPath which gathers the name and image source of the product(deal).

  //img[@role=”img”]/@src  ## for image source
  //img[@role=”img”]/@alt   ## for product name

In this post, I’ll show you some tips we found valuable when using XPath in the trenches.

If you have an interest in Python and web scraping, you may have already played with the nice requests library to get the content of pages from the Web. Maybe you have toyed around using Scrapy selector or lxml to make the content extraction easier. Well, now I’m going to show you some tips I found valuable when using XPath in the trenches and we are going to use both lxml and Scrapy selector for HTML parsing.

Avoid using expressions which contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set(collection of nodes).and when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

from scrapy import Selector
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>”””
sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()           # Let’s type this only once
print xp(‘//a//text()’)                                       # Take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output of above command
print xp(‘string(//a//text())’)                           # convert it to a string
  [u’Click here to go to the ‘]                           # output of the above command

Let’s do the above one by using lxml then you can implement XPath by both lxml or Scrapy selector as XPath expression is same for both methods.

lxml code:

from lxml import html
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>””” # Parse the text into a tree
parsed_body = html.fromstring(html_code)  # Perform xpaths on the tree
print parsed_body(‘//a//text()’)                      # take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output
print parsed_body(‘string(//a//text())’)              # convert it to a string
[u’Click here to go to the ‘]                    # output

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp(‘//a[1]’)  # selects the first a node
[u'<a href=”#”>Click here to go to the <strong>Next Page</strong></a>’]

>>> xp(‘string(//a[1])’)  # converts it to string
[u’Click here to go to the Next Page’]

Beware of the difference between //node[1] and (//node)[1]//node[1] selects all the nodes occurring first under their respective parents and (//node)[1] selects all the nodes in the document, and then gets only the first of them.

from scrapy import Selector

html_code = “””<ul class=”list”>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>

<ul class=”list”>
<li>4</li>
<li>5</li>
<li>6</li>
</ul>”””

sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()

xp(“//li[1]”) # get all first LI elements under whatever it is its parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//li)[1]”) # get the first LI element in the whole document

[u'<li>1</li>’]

xp(“//ul/li[1]”)  # get all first LI elements under an UL parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//ul/li)[1]”) # get the first LI element under an UL parent in the document

[u'<li>1</li>’]

Also,

//a[starts-with(@href, ‘#’)][1] gets a collection of the local anchors that occur first under their respective parents and (//a[starts-with(@href, ‘#’)])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary.

If you want to select elements by a CSS class, the XPath way to do the same job is the rather verbose:

*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ someclass ‘)]

Let’s cook up some examples:

>>> sel = Selector(text='<p class=”content-author”>Someone</p><p class=”content text-wrap”>Some content</p>’)

>>> xp = lambda x: sel.xpath(x).extract()

BAD: because there are multiple classes in the attribute

>>> xp(“//*[@class=’content’]”)

[]

BAD: gets more content than we need

 >>> xp(“//*[contains(@class,’content’)]”)

     [u'<p class=”content-author”>Someone</p>’,
     u'<p class=”content text-wrap”>Some content</p>’]

GOOD:

>>> xp(“//*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ content ‘)]”)
[u'<p class=”content text-wrap”>Some content</p>’]

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(“.content”).extract()
[u'<p class=”content text-wrap”>Some content</p>’]

>>> sel.css(‘.content’).xpath(‘@class’).extract()
[u’content text-wrap’]

Learn to use all the different axes.

It is handy to know how to use the axes, you can follow through these examples.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents: 

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from the script and style tags and also skip whitespace-only text nodes.

Tools & Libraries Used:

Firefox
Firefox inspect element with firebug
Scrapy : 1.1.1
Python : 2.7.12
Requests : 2.11.0

 Have questions? Comment below. Please share if you found this helpful.

Source: http://blog.datahut.co/how-xpath-plays-vital-role-in-web-scraping-part-2/

Monday, 24 October 2016

What are the ethics of web scraping?

What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Thursday, 13 October 2016

How to use Web Content Extractor(WCE) as Email Scraper?

How to use Web Content Extractor(WCE) as Email Scraper?

Web Content Extractor is a great web scraping software developed by Newprosoft Team. The software has easy to use project wizard to create a scraping configuration and scrape data from websites.

One day I came to see the Visual Email Extractor which is also product of Newprosoft and similar to Web Content Extractor but it’s primary use is to scrape email addresses by crawling websites you feed to the scraper. I had noticed that with the little modification in Web Content Extractor project configuration you can use it same as Visual Email Extractor to extract email addresses.

In this post I will show you what configuration makes the Web Content Extractor to extract email addresses. I still recommend Visual Email Extractor as it has lot more features then extracting email using WCE.

Here are the configuration that makes WCE to Extract Emails.

Step 1 : Open Web Content Extractor and Create New Project and Click on Next.

Step 2:  Under Crawling Rules -> Advanced Rules Tab do the following settings

Crawling Level 1 Settings

Follow Links if link text equals:
*contact*; *feedback*; *support*; *about*

for 'Follow Links if link text equals' text box enter following values:
contact; feedback; support; about

for 'Do not Follow links if URL contains' text box enter following values:

google.; yahoo.; bing; msn.; altavista.; myspace.com; youtube.com; googleusercontent.com; =http; .jpg; .gif; .png; .bmp; .exe; .zip; .pdf;

Set 'Maximum Crawling Deapth' to 2

set 'Crawling Order' to Deapth First Crawling

Tick mark below below check boxes:

->Follow all internal links

  Crawling Level 2  Settings

set 'Follow links if link text equals' to below value

*contact*; *feedback*; *support*; *about*

set 'Follow links if url contains' text box to below value

contact; feedback; support; about

set 'DO NOT follow links if url contains' text box to below value

=http

Step 3 After doing above settings now click on Next  -> in Extraction Pattern window -> Click on Define ->  in Web Page Address (URL) give any URL where email is given.  and click on  + sign right of Date Fields to define scraping pattern.

Now inside HTML Structure selects HTML check box or Body check box which means for each page it will take whole page content to parse data.

Now last settings to extract emails from page using regular expression based email extraction function.  Open Predefined Script window and select ‘Extract_Email_Addresses‘ and click on OK. and if you have used page that contains email then in Script Result’ you will be able to see the harvested email.

Hope this will help you to use your Web Content Extractor as a Email Scraper.. Share your view in comment.

Wednesday, 21 September 2016

Things to take care while doing Web Scraping!!!

Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Sunday, 11 September 2016

How to Use Microsoft Excel as a Web Scraping Tool

How to Use Microsoft Excel as a Web Scraping Tool

Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. The immense popularity of Excel is not without reasons. It is like the Swiss army knife of data with its great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.

Web queries is a feature of Excel which is basically used to fetch data on a web page into the Excel worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain apart from just extracting data from web pages. Let’s see how web queries work and how you can scrape HTML tables off the web using them.
Getting started

We’ll start with a simple Web query to scrape data from the Yahoo! Finance page. This page is particularly easier to scrape and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the URL we will be using for the tutorial:

http://finance.yahoo.com/q/hp?s=GOOG

To create a new Web query:

1. Select the cell in which you want the data to appear.
2. Click on Data-> From Web
3. The New Web query box will pop up as shown below.

4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button.
5. Click on the yellow-black buttons next to the table you need to extract data from.

6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet.

Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting etc. to better understand or present the data in a simpler way.
Customizing the query

Once you have created a web query, you have the option to customize it according to your requirements. To do this, access Web query properties by right clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will be displayed where you can customize how the web query interacts with the target page. The options here lets you change some of the basic things related to web pages like the formatting and redirections.

Apart from this, you can also alter the data range options by right clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.

Auto refresh

Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data to be auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in data range options menu. The auto refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at scale

Although extracting data using Excel can be a great way to scrape html tables from the web, it is nowhere close to a real web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a cheap way to get your hands on some data. If data for business is your need, you will definitely have to depend on a web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process that web scraping will also give you more room to deal with other things that need extra attention such as marketing your business.

Source: https://www.promptcloud.com/blog/how-to-use-excel-to-scrape-websites

Wednesday, 31 August 2016

Why is a Web scraping service better than Scraping tools

Why is a Web scraping service better than Scraping tools

Web scraping has been making ripples across various industries in the last few years. Newer businesses can employ web scraping to gain quick market insights and equip themselves to take on their competitors. This works like clockwork if you know how to do the analysis right. Before we jump into that, there is the technical aspect of web scraping. Should your company use a scraping tool to get the required data from the web? Although this sounds like an easy solution, there is more to it than what meets the eye. We explain why it’s better to go with a dedicated web scraping service to cover your data acquisition needs rather than going by the scraping tool route.

Cost is lowered

Although this might come as a surprise, the cost of getting data from employing a data scraping tool along with an IT personnel who can get it done would exceed the cost of a good subscription based web scraping service. Not every company has the necessary resources needed to run web scraping in-house. By depending on a Data service provider, you will save the cost of software, resources and labour required to run web crawling in the firm. Besides, you will also end up having more time and less worries. More of your time and effort can therefore go into the analysis part which is crucial to you as a business owner.

Accessibility is high with a service

Multifaceted websites make it difficult for the scraping tools to extract data. A good web scraping service on the other hand can easily deal with bottlenecks in the scraping process when it may arise. Websites to be scraped often undergo changes in their structure which calls for modification of the crawler accordingly. Unlike a scraping tool, a dedicated service will be able to extract data from complex sites that use Ajax, Javascript and the like. By going with a subscription based service, you are doing yourself the favour of not being involved in this constant headache.

Accuracy in results

A DIY scraping tool might be able to get you data, but the accuracy and relevance of the acquired data will vary. You might be able to get it right with a particular website, but that might not be the case with another. This gives uncertainty to the results of your data acquisition and could even be disastrous for your business. On the other hand, a good scraping service will give you highly refined data which is in a ready to consume form.

Outcomes are instant with a service

Considering the high resource requirements of the web scraping process, your scraping tool is likely to be much slower than a reputed service that has got the right infrastructure and resources to scrape data from the web efficiently. It might not be feasible for your firm to acquire and manage the same setup since that could affect the focus of your business.

Tidying up of Data is an exhausting process

Web scrapers collect data into a dump file which would be huge in size. You will have to do a lot of tidying up in this to get data in a usable format. With the scraping tools route, you would be looking for more tools to clean up the data collected. This is a waste of time and effort that you could use in much better aspects of your business. Whereas with a web scraping service, you won’t have to worry about cleaning up of the data as it comes with the service. You get the data in a plug and use format which gives you more time to do better things.

Many sites have policies for data scraping

Sometimes, websites that you want to scrape data from might have policies discouraging the act. You wouldn’t want to act against their policies being ignorant of their existence and get into legal trouble. With a web scraping service, you don’t have to worry about these. A well-established data scraping provider will definitely follow the rules and policies set by the website. This would mean you can be relieved of such worries and go ahead with finding trends and ideas from the data that they provide.

More time to analyse the data

This is so far the best advantage of going with a scraping service rather than a tool. Since all the things related to data acquisition is dealt by the scraping service provider, you would have more time for analysing and deriving useful business decisions from this data. Being the business owner, analysing the data with care should be your highest priority. Since using a scraping tool to acquire data will cost you more time and effort, the analysis part is definitely going to suffer which defies your whole purpose.

Bottom line

It is up to you to choose between a web scraping tool and a dedicated scraping service. Being the business owner, it i s much better for you to stay away from the technical aspects of web scraping and focus on deriving a better business strategy from the data. When you have made up your mind to go with a data scraping service, it is important to choose the right web scraping service for maximum benefits.

Source: https://www.promptcloud.com/blog/web-scraping-services-better-than-scraping-tools

Wednesday, 24 August 2016

Business Intelligence & Data Warehousing in a Business Perspective

Business Intelligence & Data Warehousing in a Business Perspective

Business Intelligence

Business Intelligence has become a very important activity in the business arena irrespective of the domain due to the fact that managers need to analyze comprehensively in order to face the challenges.

Data sourcing, data analysing, extracting the correct information for a given criteria, assessing the risks and finally supporting the decision making process are the main components of BI.

In a business perspective, core stakeholders need to be well aware of all the above stages and be crystal clear on expectations. The person, who is being assigned with the role of Business Analyst (BA) for the BI initiative either from the BI solution providers' side or the company itself, needs to take the full responsibility on assuring that all the above steps are correctly being carried out, in a way that it would ultimately give the business the expected leverage. The management, who will be the users of the BI solution, and the business stakeholders, need to communicate with the BA correctly and elaborately on their expectations and help him throughout the process.

Data sourcing is an initial yet crucial step that would have a direct impact on the system where extracting information from multiple sources of data has to be carried out. The data may be on text documents such as memos, reports, email messages, and it may be on the formats such as photographs, images, sounds, and they can be on more computer oriented sources like databases, formatted tables, web pages and URL lists. The key to data sourcing is to obtain the information in electronic form. Therefore, typically scanners, digital cameras, database queries, web searches, computer file access etc, would play significant roles. In a business perspective, emphasis should be placed on the identification of the correct relevant data sources, the granularity of the data to be extracted, possibility of data being extracted from identified sources and the confirmation that only correct and accurate data is extracted and passed on to the data analysis stage of the BI process.

Business oriented stake holders guided by the BA need to put in lot of thought during the analyzing stage as well, which is the second phase. Synthesizing useful knowledge from collections of data should be done in an analytical way using the in-depth business knowledge whilst estimating current trends, integrating and summarizing disparate information, validating models of understanding, and predicting missing information or future trends. This process of data analysis is also called data mining or knowledge discovery. Probability theory, statistical analysis methods, operational research and artificial intelligence are the tools to be used within this stage. It is not expected that business oriented stake holders (including the BA) are experts of all the above theoretical concepts and application methodologies, but they need to be able to guide the relevant resources in order to achieve the ultimate expectations of BI, which they know best.

Identifying relevant criteria, conditions and parameters of report generation is solely based on business requirements, which need to be well communicated by the users and correctly captured by the BA. Ultimately, correct decision support will be facilitated through the BI initiative and it aims to provide warnings on important events, such as takeovers, market changes, and poor staff performance, so that preventative steps could be taken. It seeks to help analyze and make better business decisions, to improve sales or customer satisfaction or staff morale. It presents the information that manager's need, as and when they need it.

In a business sense, BI should go several steps forward bypassing the mere conventional reporting, which should explain "what has happened?" through baseline metrics. The value addition will be higher if it can produce descriptive metrics, which will explain "why has it happened?" and the value added to the business will be much higher if predictive metrics could be provided to explain "what will happen?" Therefore, when providing a BI solution, it is important to think in these additional value adding lines.

Data warehousing

In the context of BI, data warehousing (DW) is also a critical resource to be implemented to maximize the effectiveness of the BI process. BI and DW are two terminologies that go in line. It has come to a level where a true BI system is ineffective without a powerful DW, in order to understand the reality behind this statement, it's important to have an insight in to what DW really is.

A data warehouse is one large data store for the business in concern which has integrated, time variant, non volatile collection of data in support of management's decision making process. It will mainly have transactional data which would facilitate effective querying, analyzing and report generation, which in turn would give the management the required level of information for the decision making.

The reasons to have BI together with DW

At this point, it should be made clear why a BI tool is more effective with a powerful DW. To query, analyze and generate worthy reports, the systems should have information available. Importantly, transactional information such as sales data, human resources data etc. are available normally in different applications of the enterprise, which would obviously be physically held in different databases. Therefore, data is not at one particular place, hence making it very difficult to generate intelligent information.

The level of reports expected today, are not merely independent for each department, but managers today want to analyze data and relationships across the enterprise so that their BI process is effective. Therefore, having data coming from all the sources to one location in the form of a data warehouse is crucial for the success of the BI initiative. In a business viewpoint, this message should be passed and sold to the managements of enterprises so that they understand the value of the investment. Once invested, its gains could be achieved over several years, in turn marking a high ROI.

Investment costs for a DW in the short term may look quite high, but it's important to re-iterate that the gains are much higher and it will span over many years to come. It also reduces future development cost since with the DW any requested report or view could be easily facilitated. However, it is important to find the right business sponsor for the project. He or she needs to communicate regularly with executives to ensure that they understand the value of what's being built. Business sponsors need to be decisive, take an enterprise-wide perspective and have the authority to enforce their decisions.

Process

Implementation of a DW itself overlaps with some phases of the above explained BI process and it's important to note that in a process standpoint, DW falls in to the first few phases of the entire BI initiative. Gaining highly valuable information out of DW is the latter part of the BI process. This can be done in many ways. DW can be used as the data repository of application servers that run decision support systems, management Information Systems, Expert systems etc., through them, intelligent information could be achieved.

But one of the latest strategies is to build cubes out of the DW and allow users to analyze data in multiple dimensions, and also provide with powerful analytical supporting such as drill down information in to granular levels. Cube is a concept that is different to the traditional relational 2-dimensional tabular view, and it has multiple dimensions, allowing a manager to analyze data based on multiple factors, and not just two factors. On the other hand, it allows the user to select whatever the dimension he wish to choose for analyzing purposes and not be limited by one fixed view of data, which is called as slice & dice in DW terminology.

BI for a serious enterprise is not just a phase of a computerization process, but it is one of the major strategies behind the entire organizational drivers. Therefore management should sit down and build up a BI strategy for the company and identify the information they require in each business direction within the enterprise. Given this, BA needs to analyze the organizational data sources in order to build up the most effective DW which would help the strategized BI process.

High level Ideas on Implementation

At the heart of the data warehousing process is the extract, transform, and load (ETL) process. Implementation of this merely is a technical concern but it's a business concern to make sure it is designed in such a way that it ultimately helps to satisfy the business requirements. This process is responsible for connecting to and extracting data from one or more transactional systems (source systems), transforming it according to the business rules defined through the business objectives, and loading it into the all important data model. It is at this point where data quality should be gained. Of the many responsibilities of the data warehouse, the ETL process represents a significant portion of all the moving parts of the warehousing process.

Creation of a powerful DW depends on the correctness of data modeling, which is the responsibility of the database architect of the project, but BA needs to play a pivotal role providing him with correct data sources, data requirements and most importantly business dimensions. Business Dimensional modeling is a special method used for DW projects and this normally should be carried out by the BA and from there onwards technical experts should take up the work. Dimensions are perspectives specific to a business that could be used for analysis purposes. As an example, for a sales database, the dimensions could include Product, Time, Store, etc. Obviously these dimensions differ from one business to another and hence for each DW initiative those dimensions should be correctly identified and that could be very well done by a person who has experience in the DW domain and understands the business as well, making it apparent that DW BA is the person responsible.

Each of the identified dimensions would be turned in to a dimension table at the implementation phase, and the objective of the above explained ETL process is to fill up these dimension tables, which in turn will be taken to the level of the DW after performing some more database activities based on a strong underlying data model. Implementation details are not important for a business stakeholder but being aware of high level process to this level is important so that they are also on the same pitch as that of the developers and can confirm that developers are actually doing what they are supposed to do and would ultimately deliver what they are supposed to deliver.

Security is also vital in this regard, since this entire effort deals with highly sensitive information and identification of access right to specific people to specific information should be correctly identified and captured at the requirements analysis stage.

Advantages

There are so many advantages of BI system. More presentation of analytics directly to the customer or supply chain partner will be possible. Customer scores, customer campaigns and new product bundles can all be produced from analytic structures resulting in high customer retention and creation of unique products. More collaboration within information can be achieved from effective BI. Rather than middle managers getting great reports and making their own areas look good, information will be conveyed into other functions and rapidly shared to create collaborative decisions increasing the efficiency and accuracy. The return on human capital will be greatly increased.

Managers at all levels will save their time on data analysis, and hence saving money for the enterprise, as the time of managers is equal to money in a financial perspective. Since powerful BI would enable monitoring internal processes of the enterprises more closely and allow making them more efficient, the overall success of the organization would automatically grow. All these would help to derive a high ROI on BI together with a strong DW. It is a common experience to notice very high ROI figures on such implementations, and it is also important to note that there are many non-measurable gains whilst we consider most of the measurable gains for the ROI calculation. However, at a stage where it is intended to take the management buy-in for the BI initiative, it's important to convert all the non measurable gains in to monitory values as much as possible, for example, saving of managers time can be converted in to a monitory value using his compensation.

The author has knowledge in both Business and IT. Started career as a Software Engineer and moved to work in the business analysis area of a premier US based software company.

Source: http://ezinearticles.com/?Business-Intelligence-and-Data-Warehousing-in-a-Business-Perspective&id=35640

Friday, 12 August 2016

How to Scrape a Website into Excel without programming

This web scraping tutorial will teach you visually step by step how to scrape or extract or pull data from websites using import.io(Free Tool) without programming skills into Excel.

Personally, I use web scraping for analysing my competitors’ best-performing blog posts or content such as what blog posts or content received most comments or social media shares.

In this tutorial,We will scrape the following data from a blog:

    All blog posts URLs.
    Authors names for each post.
    Blog posts titles.
    The number of social media shares each post received.

Then we will use the extracted data to determine what are the popular blog posts and their authors,which posts received much engagement from users through social media shares and on page comments.

Let’s get started.

Step 1:Install import.io app

The first step is to install import.io app.A free web scraping tool and one of the best web scraping software.It is available for Windows,Mac and Linux platforms.Import.io offers advanced data extraction features without coding by allowing you to create custom APIs or crawl entire websites.

After installation, you will need to sign up for an account.It is completely free so don’t worry.I will not cover the installation process.Once everything is set correctly you will see something similar to the window below after your first login.

Step 2:Choose how to scrape data using import.io extractor

With import.io you can do data extraction by creating custom APIs or crawling the entire websites.It comes equipped with different tools for data extraction such as magic,extractor,crawler and connector.

In this tutorial,I will use a tool called “extractor” to create a custom API for our data extraction process.

To get started click the “new” red button on the right top of the page and then click “Start Extractor” button on the pop-up window.

After clicking  “Start Extractor” the Import.io app internal browser window will open as shown below.

Step 3:Data scraping process

Now after the import.io browser is open navigate to the blog URL you want to scrape data from. Then once you already navigated to the target blog URL turn on extraction.In this tutorial,I will use this blog URL bongo5.com  for data extraction.

You can see from the window below I already navigated to www.bongo5.com but extraction switch is still off.

Turn extraction switch “ON” as shown in the window below and move to the next step.

Step 4:Training the “columns” or specifying the data we want to scrape

In this step,I will specify exactly what kind of data I want to scrape from the blog.On import.io app specifying the data you want to scrape is referred to as “training the columns”.Columns represent the data set I want to scrape(post titles,authors’ names and posts URLs).

In order to understand this step, you need to know the difference between a blog page and a blog post.A page might have a single post or multiple posts depending on the blog configuration.

A blog might have several blog posts,even hundreds or thousands of posts.But I will take only one session to train the “extractor” about the data I want to extract.I will do so by using an import.io visual highlighter.Once the data extraction is turned on the-the highlighter will appear by default.

I will do the training session for a single post in a single blog page with multiple posts then the extractor will extract data automatically for the remaining posts on the “same” blog page.
Step 4a:Creating “post_title” column

I will start by renaming “my_column” into the name of the data I want to scrape.Our goal in this tutorial is to scrape the blog posts titles,posts URLs,authors names and get social statistics later so I will create columns for posts titles,posts URLs,authors names.Later on, I will teach you how to get social statistics for the post URLs.

After editing “my_column” into “post_title” then point the mouse cursor over to any of the Posts title on the same blog page and the visual highlighter will automatically appear.Using the highlighter I can select the data I want to extract.

You can see below I selected one of the blog post titles on the page.The rectangular box with orange border is the visual highlighter.

The app will ask you how is the data arranged on the page.Since I have more than one post in a single page then you have rows of repeating data.This blog is having 25 posts per page.So you will select “many rows”.Sometimes you might have a single post on a page for that case you need to select “Just one row”.

Source: http://nocodewebscraping.com/web-scraping-for-dummies-tutorial-with-import-io-without-coding/

Friday, 5 August 2016

Data Discovery vs. Data Extraction

Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Tuesday, 2 August 2016

Tips for scraping business directories

Tips for scraping business directories

Are you looking to scrape business directories to generate leads?

Here are a few tips for scraping business directories.

Web scraping is not rocket science. But there are good and bad and worst ways of doing it.

Generating sales qualified leads is always a headache. The old school ways are to buy a list from sites like Data.com. But they are quite expensive.

Scraping business directories can help generate sales qualified leads. The following tips can help you scrape data from business directories efficiently.

1) Choose a good framework to write the web scrapers. This can help save a lot of time and trouble. Python Scrapy is our favourite, but there are other non-pythonic frameworks too.

2) The business directories might be having anti-scraping mechanisms. You have to use IP rotating services to do the scrape. Using IP rotating services, crawl with multiple changing IP addresses which can cover your tracks.

3) Some sites really don’t want you to scrape and they will block the bot. In these cases, you may need to disguise your web scraper as a human being. Browser automation tools like selenium can help you do this.

4) Web sites will update their data quite often. The scraper bot should be able to update the data according to the changes. This is a hard task and you need professional services to do that.

One of the easiest ways to generate leads is to scrape from business directories and use enrich them. We made Leadintel for lead research and enrichment.

Source: http://blog.datahut.co/tips-for-scraping-business-directories/

Tuesday, 12 July 2016

Web Scraping Best Practices

Extracting data from the World Wide Web has several challenges as more webmasters are working day and night to lower cases of scraping and crawling of their data in order to survive in the competitive world. There are various other problems you may face when web scraping and most of them can be avoided by adapting and implementing certain web scraping best practices as discussed in this article.

Have knowledge of the scraping tools

Acquiring adequate knowledge of hurdles that may be encountered during web scraping, you will be able to have a smooth web scraping experience and be on the safe side of the law. Conduct a thorough research on the types of tools you will use for scraping and crawling. Firsthand knowledge on these tools will help you find the data you need without being blocked.

Proper proxy software that acts as the middle party works well when you know how to work around HTTP and HTML protocols. Use tools that can change crawling patterns, URLs and data retrieved even when you are crawling on one domain. This will help you abide to the rules and regulations that come with web scraping activities and escaping any legal issues.

Conduct your scraping activities during off-peak hours

You may opt to extract data during times that less people have access for instance over the weekends, during late night hours, public holidays among others. Visiting a website on several instances to retrieve the same type of data is a waste of bandwidth. It is always advisable to download the entire site content to your computer and thereafter you can access it whenever need arises.

Hide your scrapping activities

There is a thin line between ethical and unethical crawling hence you should completely evade being on the top user list of a particular website. Cover up your track as best as you can by making use of proxy IPs to avoid any legal problems. You may also use multiple IP addresses or VPN services to conceal your scrapping activities and lower chances of landing on a website’s blacklist.

Website owners today are very protective of their data and any other information existing under their unique url. Be keen when going through the terms and conditions indicated by websites as they may consider crawling as an infringement of their privacy. Simple etiquette goes a long way. Your web scraping efforts will be fruitful if the site owner supports the idea of sharing data.

Keep record of your activities

Web scraping involves large amount of data.Due to this you may not always remember each and every piece of information you have acquired, gathering statistics will help you monitor your activities.

Load data in phases

Web scraping demands a lot of patience from you when using the crawlers to get needed information. Take the process in a slow manner by loading data one piece at a time. Several parallel request to the same domain can crush the entire site or retrace the scrapping attempts back to your local machine.

Loading data small bits will save you the hustle of scrapping afresh in case that your activity has been interrupted because you will have already stored part of the data required. You can reduce the loading data on an individual domain through various techniques such as caching pages that you have scrapped to escape redundancy occurrences. Use auto throttling mechanisms to increase the amount of traffic to the website and pause for breaks between requests to prevent getting banned.

Conclusion

Through these few mentioned web scraping best practices you will be able to work around website and gather the data required as per clients’ request without major hurdles along the way. The ultimate goal of every web scraper is to be able to access vital information and at the same time remain on the good side of the law.

Source URl : http://nocodewebscraping.com/web-scraping-best-practices/

Monday, 11 July 2016

How to Avoid the Most Common Traps in Web Scraping?

A lot of industries are successfully using web scraping for creating massive data banks of applicable and actionable data which can be used on every day basis for further business interests as well as offer superior services to the customers. However, web scraping does have its own roadblocks and problems.

Using automated scraping, you could face many common problems. The web scraping spiders or programs present a definite picture to their targeted websites. Then, they use this behavior for making out between the human users as well as web scraping spiders. According to those details, a website can employ a certain web scraping traps for stopping your efforts. Here are some of the most common traps:

How Can You Avoid These Traps?

Some measures, which you can use to make sure that you avoid general web scraping traps include:

• Begin with caching pages, which you already have crawled and make sure that you are not required to load them again.
• Find out if any particular website, which you try to scratch has any particular dislikes towards the web scraping tools.
• Handle scraping in moderate phases as well as take the content required.
• Take things slower and do not overflow the website through many parallel requests, which put strain on the resources.
• Try to minimize the weight on every sole website, which you visit to scrape.
• Use a superior web scraping tool that can save and test data, patterns and URLs.
• Use several IP addresses to scrape efforts or taking benefits of VPN services and proxy servers. It will assist to decrease the dangers of having trapped as well as blacklisted through a website.

Source URL :http://www.3idatascraping.com/category/web-data-scraping

Friday, 8 July 2016

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782? or “c 1782?, even more vaguely they may be described as having “flourished 1663-1778? or “fl. 1663-1778?. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

Select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Sources URL :                             http://yellowpagesdatascraping.blogspot.in/2015/06/scraping-royal-society-membership-list.html

Thursday, 30 June 2016

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:
• Spider a government portal, extracting names of citizens for a survey
• Crawl competitor websites for product pricing and feature data
• Use web scraping to download images from a stock photography site for website design

Automated Data Collection
Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:
• Monitor price information for select stocks on hourly basis
• Collect mortgage rates from various financial firms on daily basis
• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Source URL :    http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Thursday, 12 May 2016

Web scraping in under 60 seconds: the magic of import.io

This post was written by Rubén Moya, School of Data fellow in Mexico, and originally posted on Escuela de Datos.

Import.io is a very powerful and easy-to-use tool for data extraction that has the aim of getting data from any website in a structured way.
It is meant for non-programmers that need data (and for programmers who don’t want to overcomplicate their lives).

I almost forgot!! Apart from everything, it is also a free tool (o_O)

The purpose of this post is to teach you how to scrape a website and make a dataset and/or API in under 60 seconds. Are you ready?

It’s very simple. You just have to go to http://magic.import.io; post the URL of the site you want to scrape, and push the “GET DATA” button.
Yes! It is that simple! No plugins, downloads, previous knowledge or registration are necessary. You can do this from any browser; it even
works on tablets and smartphones.

For example: if we want to have a table with the information on all items related to Chewbacca on MercadoLibre (a Latin American version
of eBay), we just need to go to that site and make a search – then copy and paste the link (http://listado.mercadolibre.com.mx/chewbacca)
on Import.io, and push the “GET DATA” button.

You’ll notice that now you have all the information on a table, and all you need to do is remove the columns you don’t need. To do this, just
place the mouse pointer on top of the column you want to delete, and an “X” will appear.

Finally, it’s enough for you to click on “download” to get it in a csv file.
In our example, we have 373 pages with 48 articles each. So this option will be very useful for us.

Good news for those of us who are a bit more technically-oriented! There is a button that says “GET API” and this one is good to, well,
generate an API that will update the data on each request. For this you need to create an account (which is also free of cost).

As you saw, we can scrape any website in under 60 seconds, even if it includes tons of results pages. This truly is magic, no? For more
complex things that require logins, entering subwebs, automatized searches, et cetera, there is downloadable import.io software… But I’ll
explain that in a different post.

Source : http://schoolofdata.org/2014/12/09/web-scraping-in-under-60-seconds-the-magic-of-import-io/

Friday, 29 April 2016

Exploring Web Data Extraction And Its Different Techniques

Web scraping or web data extraction is a distinctive process based on computer software to extract information from different websites. Mostly business organizations are dependent on the web resources for collecting crucial information relating to decision making. With the analysis of such data, they can identify the existing trends of market, details, prices, and product specification. Looking at the time consuming process of manual data extraction, the prominence of data extraction techniques increases.

Different data scraping techniques

Several data extraction techniques are available for the businesses to extract useful information for successful operations. Some of them may include:

    Logical extraction: It comprises logical data extraction of complete source system as well as incremental.
    Physical extraction: This technique involves two different mechanisms for web scrapping that include both online as well as offline.
    HTTP programming: You can also extract data from both dynamic and static websites by implying the technique of socket programming. It allows you to post HTTP requests on the remote web servers.
    Web scraping software: Several software tools are available in the market that serves your individual needs of extracting data with ease. It automatically attempts to recognize the structure of data for a page and extracts the content for further analysis.
    Web scrapping tools: Besides the availability of reliable software, numerous user-friendly web scrapping tools are also helpful in simplifying the entire web scraping process.

Hire a website scrapper

Hiring a suitable website scraper that offers website data extraction services for all your business requirements is an ideal way amongst all other techniques. It provides you filtered and reliable data according to your need for analysis. Some of the major advantages of using website scrapping services may include:

    Automation of data.
    It can retrieve web pages of both static as well as dynamic websites.
    It is also capable of transforming the content into useful information.
    Provides reliable and accurate data.
    It also recognizes several semantic annotations.

Scraping service versus tools

Web scraping services gain more privilege than other tools and software. The basic reason behind this preference is that the service providers are comparatively cheaper than the tools. In fact, they maintain better accuracy and reliability of data.

Summary: It is advisable to look out for suitable web data extraction services instead of any tools or software. This helps in acquiring customized and structured data for your business in legal manner.


 Source : http://www.web-parsing.com/blog/exploring-web-data-extraction-and-its-different-techniques/

Thursday, 28 April 2016

Extensive Benefits of Data Mining Services to Marketing – Retail and Outreach Sectors…!!!

There is a vast ocean out there – An ocean of information on internet which is massive, brimming with a lot of data; in fact, it is constantly getting updated, increase the volume with each passing day. In fact, it is believed that around 90% of total information generated in the last two years, is now available on the internet.

Picking right set of information from this heap of data is like searching a needle in the haystack. It is almost next to impossible to search it manually – You need a powerful magnet in form of data mining service provider…!!!

Data mining services work like a magnet – It helps you in finding the right kind of information from huge databases available in the digital world. And with databases getting mammoth every minute, the importance of partnering with a professional and reliable data mining company cannot be overlooked.Though, loaded with a lot of negative connotations; data mining still reigns like a king! In fact, in order to truly appreciate the concept behind data mining, one needs to know it in its entirety.

Every coin has two sides – If there is a brighter side; there tends to be a dark side as well. Though, advantages of web extraction, outweighs disadvantages the fact is it is always the dark underbelly that is highlighted and shown to the world. However, as wise men say, focus on positive sides – Lets see what amazing advantages it can offer to your business and how well you can gain from hiring a professional data mining services.

Upside or Advantage of Data Extraction Services:

While data mining is used primarily in business, it is interesting to know that benefits of data mining goes beyond and across boundaries; it helps various industries as well.

Marketing/Retailing

Data mining can prove to be extremely helpful to the marketers and retailers who are looking out for potential clients as well as aspires to maintain consumer satisfaction. This is one of the methods that allows the businesses to know their potential clients better by acquiring their personal information and preferences.

Not just data extraction helps in determining the trends in goods and services by presenting an overview of online data. With adequate information, you can improve your goods and services, along with changing or choosing the ones which are more in demand. Consequently, success in business has been made quicker and easier these days because of data mining.

Streamline Outreach

Outreach forms an integral part of any business – And to effectively carry out outreach activities; one needs to have a huge cache of database, that can help the marketers to learn how to approach a particular set of customers. Information like that includes relevant e-mail addresses, mailing addresses or social media pages needs to be streamlined any mailers to get the best results.

Data extraction makes this easier; since it gets all the updated information; and in process saves your time and money.

And as it is “the lotus flower grows in mud, but makes our world fragrant” – data mining services is marred by criticism and controversy; however, its extensive advantages outweighs these negativity to a great extent.

Source : http://www.habiledata.com/blog/extensive-benefits-of-data-mining-services-to-marketing-retail-and-outreach-sectors/