Posts

Map Dataframe

import org.apache.spark.sql.{Row , SparkSession} import org.apache.spark.sql.types.{MapType , StringType , StructField , StructType} object SparkProject { def main (args: Array[String]): Unit = { // Set log levels org.apache.log4j.LogManager. getLogger ( "org" ).setLevel(org.apache.log4j.Level. ERROR ) org.apache.log4j.LogManager. getLogger ( "akka" ).setLevel(org.apache.log4j.Level. ERROR ) // Create a Spark session val spark = SparkSession. builder () .master( "local[1]" ) .appName( "SparkByExample" ) .getOrCreate() // Define the schema for the DataFrame val schema = StructType (Seq( StructField( "name" , StringType , true ) , StructField( "songs" , MapType(StringType , StringType , true ) , true ) )) // Create a Seq of Rows representing the data val data = Seq( Row ( "sublime" , Map( "good_song" -> "santeria" , &qu

Jupyter Nodebook setup

1) Don't use root user expect installation of Python. Install Jupyter: Don't run the command through root user. python3 -m pip install --user jupyterlab Start Jupyter using below command to make it accessible on the remote browser: //On my system, I have used port as 8888 jupyter notebook --no-browser --port=8080 --ip=0.0.0.0

When there is an error of configuration this link contain the solution

 http://www.alternatestack.com/development/intellij-add-new-scala-class-option-not-available/ When there is an error of configuration this link contain the solution

IntelliJ Class not Found

 While runnning my first program of Spark with Scala, I started getting an error message which prevented me from learning. I spent complete night to figure out the problem. Fortunately I am able to resolve the issue. Below are the solution: 1) Close the project. 2) Close IntelliJ 3) Check environment variables. (I was missing HADOOP_HOME) 4) restart the IntelliJ 5) Open the project 6) Wait for sbt configuration to complete. This is another thing I was missing. 7) Project will execute now.

Running PySpark Program through file

If you have created a file containing PySpark program and need to run the file then that could be run through Spark Submit utility of spark which is at below location ./spark/bin/spark-submit <FileName.py>

Create File URL

File URL can be created by mentioning the absolute path of the file along with file:////PATH_TO_FILE

Running SQLContext

val sqlcontext = new org.apache.spark.sql.SQLContext(sc) val cataDF= sqlcontext.read.format("jdbc").option("url", "jdbc:vertica://172.16.67.241:5433/vertica_db").option("driver", "com.vertica.jdbc.Driver").option("dbtable", "DT1_0_8_OOB.Char1_Table").option("user", "release").option("password", "gl").load() cataDF.show()