As I was learning to use Talend , I thought I would create a blog to help others like me who would be new to this tool.
So hope this post helps in achieving it.

In this blog I would show you how I was able to scrape the webpage , save it as csv file and load it into a database table.

Step 1) Writing python script for scraping web page
url: http://finance.yahoo.com/q/cp?s=^DJI
I wanted to load the details from the above site into the google SQL table. So I wrote a python script to scrape the webpage and save the detail into the csv file.

from bs4 import BeautifulSoup
import urllib
import re
import string
import csv
urlHandle = urllib.urlopen("http://finance.yahoo.com/q/cp?s=^DJI")
html = urlHandle.read()
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'id': 'yfncsumtab'})
rows = table.findAll('tr')

a=''
csvfile=open("F:/data/yahoofinance.csv",'w')
for tr in rows[5:]:
for td in tr.find_all('td',attrs={'class':'yfnc_tabledata1'}):
a += '"'+td.get_text()+'",'
a+='\n'
csvfile.write(a)
a=''

step 2) running the above script from Talend Studio
for running the python script , you can use tSystem component

tsystem

Specify the location of the python script:
script location

Save this job as RunPythonScript

Step 3: Reading the csv file and loading into the database table.
For loading the csv file into the table , first create the schema of the csv file follow the following step
go to metadata>>file Delimited and create new delimited file
csv schema
step1

step2

step3

database :

db_schema_step1

db_schema_step2

db_schema_step3

db schema:
db_retrive_schema_1

db_retrive_schema_2

db_retrive_schema_3

Advertisements

2 thoughts on “scrape web page and load into the database using Talend

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s