How to start web scraping with Go

May 16, 2024

Utilizing submersible fountain pumps for improved customer satisfaction

You can write web scraping scripts in Golang and run them concurrently, improving overall scraping speed. Golang’s built-in concurrency features make this possible. Moreover, Golang ( or Go) can run on popular platforms, including Linux, macOS, and Windows, making it possible to write a web scraping script once and run it on multiple platforms.

Besides concurrency features and cross-platform compilation, Go has several advantages in web scraping. So, how do you start web scraping with Go? Keep exploring to find out!

Understand the building blocks of a website

Before you begin scraping data for various purposes, including monitoring product prices, building research datasets, or analyzing trends, you must understand a website’s structure. This means learning and understanding the basics of HTML (Hypertext Markup Language) and CSS (Cascading Style Sheets).

HTML defines a web page’s structure and content, while CSS defines its aesthetic. Learn the purpose of particular HTML tags and elements. Common HTML elements for web scraping include links, tables, images, and text content like headings and paragraphs. Also, explore CSS selectors to discover how you can use them to target and scrape specific HTML elements. 

Moreover, it would help if you learned how to use browser developer tools to inspect the elements of a web page. Developer tools like CSS/XPath selector queries can efficiently locate the elements containing the data you want to scrape. You can also do more with these tools, like testing selectors or debugging a web scraping script.

Set up a Golang programming environment

To scrape a website effectively with Go, you should be able to set up proxy servers with Go and a programming environment. Why proxy servers? Proxies can help you bypass anti-scraping mechanisms put in place by some websites.

You set up a Golang programming environment alongside servers to reduce the chances of encountering errors or unexpected IP blocks. 

Now, to set up a Go development environment, start by installing Go on your machine. Visit the official Go website and download the appropriate installer for your operating system (Linux, Windows, macOS). Then, install Go and type this command into the terminal to verify whether you’ve set up Go correctly. 

Command: go version 

The command would return details about the installed Go version. For example: go version go1. 18. 4 darwin/amd64

Once that’s done, choose and install an IDE (Integrated Development Environment). An IDE gives you access to tools for writing, debugging, and managing Go code, streamlining development tasks, and enhancing productivity. 

For this step-by-step process, we’ll focus on using Visual Studio Code (VS Code). VS Code is a lightweight and popular IDE. To use it, visit its website, download, and install it. Next, navigate to VS Code’s extension marketplace and install the “Go” extension.

VS Code will also prompt you to install Go tools like gopls and gocode. Accept the prompt and install the tools to enable features such as code formatting and completion. After setting up VS Code to write, run, and manage Go code, proceed to set up a project.

Set up a Go project

Create a main project folder (my-web-scraper-project) for shared resources, including settings and dependencies. Within the main project folder, create a settings.json file  and add these settings:

{ “go.useLanguageServer”: true, 

   “go.toolsEnvVars”: {“GO111MODULE”: “on”}

 }

The go.useLangaugeServer setting allows you to disable or enable the Go language server (gopls) within VS Code. Setting it to “true” ensures you can use gopls features, including code formatting, diagnostics, and completion.

Conversely, the go.toolsEnvVars allows you to control variables for Go tools like gopls. GO111MODULE, in this case, is an environment variable. Setting it to the value “on” enables Go modules to manage dependencies.

After adding the settings to the settings.json file, assess whether the Go workspace (GOPATH) is set up correctly. Run this command within the terminal to confirm this:

Command: go env GOPATH 

If the Go workspace is not set, add it to the system environment variables. If all is well, create a new folder (for example, “my- go-web-scraper” within the main project folder to host your web scraper scripts. Then, open your terminal, navigate to the folder, and initialize Go modules. Use this command:

Command: go mod init mygowebscraper

This command adds a go.mod file to manage project dependencies. Once that’s done, you are ready to write your first Go web scraping script. 

Write and run your first Go web scraping script

As mentioned, using Go for web scraping is advantageous due to its features. You won’t have to write the whole script from scratch at this phase. Go has built-in web scraping libraries and frameworks you can use to speed up the scripting process. Some popular Go web scraping libraries or frameworks include Colly, GoQuery, and Selenium.

For this tutorial, we’ll use Colly to demonstrate how you can prepare a simple Go web scraping script. 

So, head to your terminal and key in this command to install Colly. 

Command: go get github.com/gocolly/colly/v2

Navigate to the folder you’d prepared to host your web scraper and create a new Go file like first-scraper.go and add the following code to the file:

package main

import (

    “fmt”

    “github.com/gocolly/colly/v2”

)

func main() {

    // Summon a new collector

    c := colly.NewCollector()

    // Add the target URL

    targetURL := “https://example.com”

    // Set up a callback

    c.OnHTML(“h1”, func(e *colly.HTMLElement) {

        fmt.Println(“Title:”, e.Text)

    })

    err := c.Visit(targetURL)

    if err != nil {

        fmt.Println(“Error:”, err)

    }

}

You should replace the “targetURL” section with the link to the website you desire to scrape. The above script scrapes the text within the <h1> tag of the web page.

After writing the script, save and run it. The title of the web page should be printed on the console, signaling success. 

As you write Go web scraping scripts with greater capabilities, always be ethical. Respect the target website’s terms of use and the scraping limits defined in the robots.txt files. Also, avoid overloading website servers with numerous requests, as doing so may lead to legal issues.

Conclusion

Golang has a robust stand library with packages for parsing HTML/XML, handling HTTP requests, and working with regular expressions. You also have access to web scraping libraries and frameworks, which means you don’t have to rely on third-party libraries to prepare web scraping scripts.

More must-read stories from Enterprise League:

Related Articles