This article introduces the process of converting a web page from one format to another. It discusses how to convert HTML to GRML. Two examples are provided. The first demonstrates how to extract hyperlinks from an HTML web page and convert them to GRML. The second demonstrates how to do this with images. These examples require server-side processing. Here, IIS, Active Server Pages (ASP), and Perl are used.
Background
It is recommended to have some experience with ASP and Perl. Perl has regular expression support that is used to extract the hyperlinks and images from the web page. Any server-side scripting environment does this, including .NET, CGI, or PHP. However, Perl and ASP are used for this article.
While Perl is required, the server-side scripting language specifically used is Perlscript. To use Perlscript, download a Perl interpreter. To get one that works with IIS, try ActivePerl.
If not done already, read Introducing GRML and Using GRML. These articles provide explanations of what GRML is and how it is used.
an object that converts the original interface of a component to another, or target, interface.
For the purposes of this article, the definition of a converter is...
server-side processing or scripting to convert one format to another.
This definition describes converting HTML to GRML using ASP. The converter object is the ASP scripting, the original interface is HTML, and the target interface is GRML.
Depending on what is being converted, there needs to be a reader of the original interface and a writer to the target interface. In other words, a converter needs an interface reader and an interface writer. A HTML to GRML converter requires a HTML reader and a GRML writer.
HTML does not describe many elements of its content. For example, there is no way to determine the attributes of one text block from another. However, not all HTML content is without description. It does have specific tags for hyperlinks and images.
Using a specific tag to identify content makes it possible to create a script that reads only those tags. When found, the unwanted tag elements are removed, leaving only the content. The script then writes the content in the new format or markup language. This is how to convert HTML hyperlinks or images to GRML.
The Hyperlink converter.
The use of the <a href= tag allows a HTML web browser to identify what text is a hyperlink. This tag is the basis for the hyperlink converter. It extracts all hyperlinks from an HTML web page and converts them to GRML.
Using Perlscript, below is an example of a HTML to GRML hyperlinks converter.
What the above code does is create a form in HTML that extracts all the hyperlinks from a web page. The hyperlinks (and their titles) are formatted using GRML. To view GRML, a GRML web browser is required (such as Pioneer Report MDI).
All of the server-side scripting is used as the HTML reader. Only the following lines are used as the GRML writer. They are,
$Response->Write("<column>\n");
$Response->Write("<Title>\n");
$Response->Write("<Request>\n");
$Response->Write("<link>\n");
$Response->Write("</column>\n");
$Response->Write("<result>\n");
$Response->Write("<link>$temp_link\n");
$Response->Write("<title>$temp_item\n");
$Response->Write("<request>$url\n");
$Response->Write("</result>\n");
Only the last three lines format the hyperlinks using GRML. The first two lines create the form in the browser window of a GRML web browser and do not use the converted HTML hyperlinks.
To see the above in action, go to GRMLBrowser.com Hyperlink Converter. Or, copy the above script to a file and host it from a local web server. Once the web page is displayed, enter a URL and press the 'Submit' button. It displays all the hyperlinks extracted from the HTML web page formatted in GRML.
After converting hyperlinks from HTML to GRML, this is how it appears in a GRML web browser (using Pioneer Report MDI).
The Image converter.
Using the <img src= tag, a script is able to find and extract images from HTML. By reading this tag and removing unwanted tag elements, the HTML images are converted to GRML. The following script demonstrates this.
<%@ Language=PerlScript%>
<center>
<form action=translate.asp>
URL to extract images: <input type=text name=url1 length=60>
<input type=submit>
</form>
</center>
<!--
<grml>
<edit url1>
<title>Enter URL:
</edit>
<%
use HTML::LinkExtor;
use URI::URL;
use LWP;
my $url, $html;
# Parsing the Request
$url = $Request->QueryString("url1")->Item();
if ($url eq "")
{
$Response->Write("<\/GRML>\n");
}
else
{
if ($url !~ /http:\/\//)
{
$url = "http://". $url;
}
}
$Response->Write("### URL ###\n\n");
$Response->Write("The url is: $url\n\n");
# Constructing the Request
$_ = $sites;
# Retrieving the Response/Results
# - Filtering the Results (optional)
my $ua = LWP::UserAgent->new(agent => "my agent V1.00");
my $request = HTTP::Request->new('GET', $url);
my $response = $ua->request($request);
unless ($response->is_success)
{
print $response->error_as_HTML . "\n";
exit(1);
}
my $res = $response->content(); # content without HTTP header
my @imgs = ();
my @hrefs = ();
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
my $p = HTML::LinkExtor->new(\&callback);
$p->parse($res);
$Response->Write("<column>\n");
$Response->Write("<link>\n");
$Response->Write("<image>\n");
$Response->Write("</column>\n");
$Response->Write("<result>\n");
# Expand all image URLs to absolute ones
my $base = $response->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
foreach (@imgs)
{
$Response->Write("<image>$_\n");
}
$Response->Write("\nLinks:\n");
foreach (@hrefs)
{
my $temp = $_;
if ($temp !~ /$url/ && $temp !~ /\/\//)
{
$temp = $url . "\/" . $temp;
}
$Response->Write("<link>$temp\n");
}
sub callback
{
my($tag, %attr) = @_;
push(@imgs , values %attr) if $tag eq 'img';
push(@hrefs, values %attr) if $tag eq 'a';
}
%>
</result>
</GRML>
-->
The above script is used as an HTML reader, except for the following lines. These lines are the GRML writer.
$Response->Write("<column>\n");
$Response->Write("<link>\n");
$Response->Write("<image>\n");
$Response->Write("</column>\n");
$Response->Write("<result>\n");
$Response->Write("<link>$temp\n");
$Response->Write("<image>$temp\n");
$Response->Write("</result>\n");
Once the image content has been converted to GRML, this is how it looks in a GRML web browser (using Pioneer Report MDI).
HTML was converted to GRML using server-side scripting. Only content with identifiable tags is convertable from one markup language to another. In the case of HTML, there are tags to identify hyperlinks and images.
The examples described show how to convert HTML hyperlinks or images to GRML. The converter consists of a HTML reader and a GRML writer. Using this converter, a web page viewed with a HTML web browser is viewable using a GRML web browser.
01/08/05. Updated formatting.
11/06/04. Changed adapt to convert.
10/26/04. Changed title from Adapting GRML to Adapting HTML.
10/08/04. Changed article code from GRML 1.2 to GRML 2.0.
10/08/04. Pioneer Report MDI 3.64 is no longer supported by this article.
09/02/04. Updated article text and images to support Pioneer Report MDI 3.64.