Python and Spark

Hard Prerequisites
  • PROJECTS: Data Wrangling
  • PROJECTS: RabbitMQ
  • PROJECTS: SQL
  • As a Data Engineer, you will be required to process large data sets for various reasons. In this fast paced world, the rate at which you carry out the processing matters and as a result, there exits various tools which help Data Engineers process large datasets quickly. Apache Spark is an open-source general-purpose distributed processing system used for big data workloads.

    Apache Spark is written in Scala and can be integrated with Python, Scala, Java, R, SQL languages.

    But what does that mean? Do you have to use Scala or Java?

    The answer is a simple no. Fortunately for us, PySpark enables us to work with Spark in Python. PySpark is the Python API written in python to support Apache Spark.

    Go forth and learn.

    Resources

    This is a good tutorial to get you started with PySpark. It’ll take you from zero to hero.

    There is also a YouTube tutorial on Apache Spark.

    Pick the one you’re comfortable with or even both if you can.


    RAW CONTENT URL