Using Go for Data Science: A Fresh Perspective

Feb 27

In the ever-evolving landscape of data science, Python has long been the reigning champion, largely due to its simplicity and the vast array of libraries available. However, the tide is slowly turning, and other languages are making their mark in the data science realm. One such contender is Go. Go is gaining popularity for its efficiency, performance, and ease of use. In this blog post, we'll explore how Go can be utilized in data science, specifically focusing on a common task: cleaning HTML pages of unwanted tags and elements.

Why Go for Data Science?

Before diving into the example, let's discuss why one might consider Go for data science tasks. Go offers several advantages:

Performance: Go is compiled to machine code, which means it runs directly on the hardware without the need for an interpreter. This results in faster execution of data processing tasks compared to interpreted languages like Python.
Concurrency: Go was designed with concurrency in mind, thanks to goroutines and channels. This makes it incredibly efficient for processing large datasets and performing operations in parallel, a common requirement in data science.
Simplicity: Go's syntax is clean and concise, making it easy to learn and write. This simplicity reduces the cognitive load on the programmer, allowing them to focus on solving data science problems rather than wrestling with complex syntax.
Robust Standard Library: Go's standard library is extensive, covering a wide range of needs, including HTTP server and client, JSON encoding and decoding, and more. For tasks not covered by the standard library, there's a growing ecosystem of third-party libraries.

Cleaning HTML with Go

Now, let's tackle a practical data science task with Go: cleaning an HTML page of unwanted tags and HTML elements. This can be particularly useful when scraping web data for analysis, allowing us to extract only the content we're interested in.

For this example, we'll use the goquery package, which allows Go to work with HTML documents in a manner similar to jQuery, making it easier to manipulate and query HTML elements.

Step 1: Installing goquery

First, you need to install the goquery package. You can do this by running:

go get github.com/PuerkitoBio/goquery

Step 2: Writing the Code

Here's a simple program that demonstrates how to load an HTML file, remove unwanted tags (for example, <script> and <style> tags), and print the cleaned HTML.

package main

import (
    "bytes"
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "log"
    "strings"
    "sync"
)

// cleanHTML removes <script> and <style> tags from the given HTML string.
func cleanHTML(html string, wg *sync.WaitGroup, cleanedHtmls chan<- string) {
    defer wg.Done()
    
    // Load the HTML document
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatal(err)
    }

    // Remove <script> and <style> elements
    doc.Find("script, style").Each(func(i int, s *goquery.Selection) {
        s.Remove()
    })

    // Render the document back to HTML
    var buf bytes.Buffer
    doc.Find("body").Each(func(i int, s *goquery.Selection) {
        html, err := s.Html()
        if err != nil {
            log.Fatal(err)
        }
        buf.WriteString(html)
    })

    // Send the cleaned HTML to the channel
    cleanedHtmls <- buf.String()
}

func main() {
    var wg sync.WaitGroup
    htmlStrings := []string{
        `<html><head><style>body {background-color: #fff;}</style></head><body><h1>Document 1</h1><script>alert('Hello, World!');</script></body></html>`,
        `<html><head><style>body {background-color: #eee;}</style></head><body><h1>Document 2</h1><script>alert('Hello, World!');</script></body></html>`,
    }

    cleanedHtmls := make(chan string, len(htmlStrings))

    // Dispatch a goroutine for each HTML string
    for _, html := range htmlStrings {
        wg.Add(1)
        go cleanHTML(html, &wg, cleanedHtmls)
    }

    // Wait for all goroutines to finish
    wg.Wait()
    close(cleanedHtmls)

    // Print the cleaned HTMLs
    for cleanedHtml := range cleanedHtmls {
        fmt.Println(cleanedHtml)
        fmt.Println("------")
    }
}

How It Works

Function Definition: The cleanHTML function is defined to take an HTML string, a WaitGroup pointer, and a channel for strings. It cleans the HTML by removing <script> and <style> tags and then sends the cleaned HTML to the provided channel.
Goroutines and Channels: For each HTML string in the htmlStrings slice, we start a goroutine that executes cleanHTML. This allows the HTML strings to be processed in parallel. We use a sync.WaitGroup to wait for all goroutines to finish their execution.
Concurrency Management: The sync.WaitGroup is used to ensure that the main goroutine waits for all processing goroutines to finish before it proceeds to close the cleanedHtmls channel. This is crucial to avoid sending on a closed channel or closing the channel before all goroutines have sent their output.
Output: After all goroutines have finished and the channel is closed, the main goroutine iterates over the channel to print out the cleaned HTML strings.
This example demonstrates the power of Go's concurrency model for parallel processing tasks, such as cleaning HTML documents. By using goroutines and channels, you can significantly speed up tasks that can be executed concurrently.

Step 3: Running the Program

After writing your program, save it with a .go extension and run it using the Go command:

go run html_clean.go

This will output the cleaned HTML, with all <script> and <style> tags removed, showcasing only the content within the <body> tag.

Conclusion

While Go may not replace Python as the de facto language for data science anytime soon, it offers compelling features that make it suitable for certain data science tasks, especially those requiring high performance and concurrency. By leveraging Go's strengths and the growing ecosystem of libraries, data scientists can tackle a wide range of problems efficiently. The example provided illustrates just one of the many ways Go can be utilized in the field of data science, opening the door for further exploration and innovation.

gogolangconcurrencydatadata science

Noah Parker