Error reading page html with Html Agility Pack

Posted on

Question :

I’m reading an HTML page using the Html Agility Pack . I run the code on the notebook and it works perfectly. The problem is when I run the same on Windows Phone 7.1.

Accent characters (ç) are encoded. And the strangest thing is that the same code is used to download two pages, both have words like accents, but only one of them does not return the text as it appears on the page.

Code to load the file

    CustomEncoding enc = new CustomEncoding();
    HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
    document.OptionDefaultStreamEncoding = enc; //CustomEncoding.Unicode;

Code to run download

    HtmlWeb web = new HtmlWeb();
    //CustomEncoding is "iso-8859-2"
    web.LoadCompleted += web_LoadCompleted;
    web.LoadAsync(_filme.Detalhes, enc);// GetEncoding("iso-8859-1"));

I use the InnerHtml property to retrieve the text.

    void web_LoadCompleted(object sender, HtmlDocumentLoadCompleted e)
        HtmlDocument document = new HtmlDocument();
        document.OptionDefaultStreamEncoding = CustomEncoding.Unicode;// System.Text.Encoding.UTF8;
        HtmlNode html = GetNodeByName(e.Document.DocumentNode, "html");
        HtmlNode body = GetNodeByName(html, "body");
        HtmlNode allOut = GetNodeById(body, "all-out");
        HtmlNode allIn = GetNodeById(allOut, "all-in");
        HtmlNode content = GetNodeById(allIn, "content");
        HtmlNode lojas = GetNodeById(content, "lojas");
        HtmlNode leftSideMovie = GetNodeById(lojas, "left-side-movie");
        HtmlNode infoLoja = GetNodeById(leftSideMovie, "info-loja");

        HtmlNode censuraNode = GetNodeById(infoLoja, "censura-3d-leg-dub");
        HtmlNode sinopseNode = GetNodeById(infoLoja, "sinopse");
        HtmlNode marmota = GetNodeByNameAndClass(sinopseNode, "div", "margin_20b");
        HtmlNode preSinopseNode = marmota;

        //percorrer todos os filhos até encontrar a ultima marmota
        while (marmota != null)
            preSinopseNode = marmota;
            marmota = GetNodeByNameAndClass(marmota, "div", "margin_20b");

        string sinopse;
            //TODO: remover o try e refatorar para armazenar cada chamada de metodo em uma variavel
            //tentar com o span
            _filme.Descricao = GetNodeByName(GetNodeByName(preSinopseNode, "p"), "span").InnerHtml;
        catch (Exception ex)
            _filme.Descricao = GetNodeByName(preSinopseNode, "p").InnerHtml;

Page used

I will add just one of the methods, as they are all very similar.

    private HtmlNode GetNodeByName(HtmlNode root, string node)
        foreach (HtmlNode link in root.ChildNodes)
            if (link.Name.Equals(node))
                return link;
        return null;

One of the links that talks about CustomEncoding


Answer :

Once again my theory is correct.
Quando você não consegue encontrar uma solução, é provável que esteja bem na frente do seu nariz.

This was the solution, without having to configure the encoding.



I’ve been making a simple html and csv reader, where I encountered the same problem.
I solved my problem by replacing UTF8 with iso-8859-1 , so I suggest replacing your Enconding as follows

Code to load the file

HtmlDocument document = new HtmlDocument();
document.OptionDefaultStreamEncoding = Encoding.GetEncoding("iso-8859-1");

Code to download

HtmlWeb web = new HtmlWeb();
web.LoadCompleted += web_LoadCompleted;
web.LoadAsync(_filme.Detalhes, Encoding.GetEncoding("iso-8859-1"));

More information about Enconding tu can be found at link


Leave a Reply

Your email address will not be published. Required fields are marked *