Get Web Page HTML Code and Convert it to PDF

Quickly Create High Quality PDFs

Getting the HTML code of a web page can be useful when converting a web page to PDF in a certain context or state, for example, when you are already authenticated in an ASP.NET application and you want to convert a web page which is accessible only if you are authenticated, or if you want to convert an ASP.NET web page after some values were filled in a form. In these situation a possible solution is to get the HTML code being sent to browser and convert it to PDF, optionally providing a base URL used to resolve images, CSS and script files.

In this section will be presented three practical methods of getting the HTML code of web page using the HttpWebRequest class, overriding the Render method of the ASP.NET pages and calling the Server.Execute method from ASP.NET.

Using HttpWebRequest .NET Class to Get the HTML Code of a Web Page

The System.NetHttpWebRequest class can be used to retreive the HTML code of a web page. HTTP cookies and headers, authentication credentials, proxy and other options can be set before accessing the web page. Below there is a simple example of getting the HTML code of a web page and converting it to PDF.

Copy

using System.Net;
using System.IO;
using System.Text;
using HiQPdf;

protected void buttonGetHtmlCode_Click(object sender, EventArgs e)
{
    // the URL of the web page from where to retrieve the HTML code
    string url = textBoxUrl.Text;

    // create the HTTP request
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

    // Set credentials to use for this request
    request.Credentials = CredentialCache.DefaultCredentials;
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();

    long contentLength = response.ContentLength;
    string contentType = response.ContentType;

    // Get the stream associated with the response
    Stream receiveStream = response.GetResponseStream();

    // Pipes the stream to a higher level stream reader with the required encoding format
    StreamReader readStream = new StreamReader(receiveStream, Encoding.UTF8);

    // get the HTML code of the web page
    string htmlCode = readStream.ReadToEnd();

    // close the response and response stream
    response.Close();
    readStream.Close();

    // convert the HTML code to PDF

    // create the HTML to PDF converter
    HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

    // the base URL used to resolve images, CSS and script files
    string baseUrl = url;

    // convert HTML code to a PDF memory buffer
    byte[] pdfBuffer = htmlToPdfConverter.ConvertHtmlToMemory(htmlCode, baseUrl);

    // inform the browser about the binary data format
    HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

    // let the browser know how to open the PDF document, attachment or inline, and the file name
    HttpContext.Current.Response.AddHeader("Content-Disposition", 
        String.Format("attachment; filename=HtmlToPdf.pdf; size={0}", pdfBuffer.Length.ToString()));

    // write the PDF buffer to HTTP response
    HttpContext.Current.Response.BinaryWrite(pdfBuffer);

    // call End() method of HTTP response to stop ASP.NET page processing
    HttpContext.Current.Response.End();
}

Overriding the Render Method to Get the HTML Code of the Current ASP.NET Page

The PageRender(HtmlTextWriter) method of the ASP.NET page can be overridden to get the HTML code of the page as it would be sent to the browser. Using this method it is even possible to capture the values entered in a web form and posted back to ASP.NET page when a button in page is pressed. Below there is a simple example of getting the HTML code of the current ASP.NET page and converting it to PDF if a 'Convert to PDF' button was pressed.

Copy

using System.Text;
using System.IO;
using HiQPdf;

namespace WebApplication
{
    public partial class GetHtmlCode : System.Web.UI.Page
    {
        bool convertToPdf = false;

        protected override void Render(HtmlTextWriter writer)
        {
            if (convertToPdf)
            {
                // setup a TextWriter to capture the current page HTML code
                TextWriter tw = new StringWriter();
                HtmlTextWriter htw = new HtmlTextWriter(tw);

                // render the HTML markup into the TextWriter
                base.Render(htw);

                // get the current page HTML code
                string htmlCode = tw.ToString();

                // convert the HTML code to PDF

                // create the HTML to PDF converter
                HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

                // the base URL used to resolve images, CSS and script files
                string currentPageUrl = HttpContext.Current.Request.Url.AbsoluteUri;

                // convert HTML code to a PDF memory buffer
                byte[] pdfBuffer = htmlToPdfConverter.ConvertHtmlToMemory(htmlCode, currentPageUrl);

                // inform the browser about the binary data format
                HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

                // let the browser know how to open the PDF document, attachment or inline, and the file name
                HttpContext.Current.Response.AddHeader("Content-Disposition",
                    String.Format("attachment; filename=HtmlToPdf.pdf; size={0}", pdfBuffer.Length.ToString()));

                // write the PDF buffer to HTTP response
                HttpContext.Current.Response.BinaryWrite(pdfBuffer);

                // call End() method of HTTP response to stop ASP.NET page processing
                HttpContext.Current.Response.End();
            }
            else
            {
                base.Render(writer);
            }
        }

        protected void buttonConvertCurrentPageToPdf_Click(object sender, EventArgs e)
        {
            convertToPdf = true;
        }
    }
}

Calling the Server.Execute Method to Get the HTML Code of an ASP.NET Page

The HttpServerUtilityExecute(String, TextWriter) method can be called from an ASP.NET page to get the HTML code of another ASP.NET page in the same application. The ASP.NET page for which to retrieve the HTML code is accessed in the session of the calling ASP.NET page. Below there is a simple example of getting the HTML code of an ASP.NET page and converting it to PDF when a 'Convert to PDF' button from the current page is pressed.

Copy

using System.IO;
using HiQPdf;

protected void buttonConvertToPdf_Click(object sender, EventArgs e)
{
    // setup a TextWriter to capture the HTML code of the page to convert
    TextWriter tw = new StringWriter();
    // execute the 'AspNetPage.aspx' page in the same application and capture the HTML code
    Server.Execute("AspNetPage.aspx", tw);
    // get the HTML code from writer
    string htmlCode = tw.ToString();

    // convert the HTML code to PDF

    // create the HTML to PDF converter
    HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

    // the base URL used to resolve images, CSS and script files
    string baseUrl = HttpContext.Current.Request.Url.AbsoluteUri;

    // convert HTML code to a PDF memory buffer
    byte[] pdfBuffer = htmlToPdfConverter.ConvertHtmlToMemory(htmlCode, baseUrl);

    // inform the browser about the binary data format
    HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

    // let the browser know how to open the PDF document, attachment or inline, and the file name
    HttpContext.Current.Response.AddHeader("Content-Disposition",
        String.Format("attachment; filename=HtmlToPdf.pdf; size={0}", pdfBuffer.Length.ToString()));

    // write the PDF buffer to HTTP response
    HttpContext.Current.Response.BinaryWrite(pdfBuffer);

    // call End() method of HTTP response to stop ASP.NET page processing
    HttpContext.Current.Response.End();
}