Project information

  • Category: Big Data Skillset
  • Description: Final Exam Course
  • Project date: December, 2022
  • Project URL: Github

Design data schema and analyze knowledge discovery in database (KDD) of Academic System Simaster UGM

Simaster UGM is student academic system of Universitas Gadjah Mada.

Big Data Skillset:
  1. Analyze business process and data requirements.
  2. Design database star schema.
  3. Perform data mining to knowledge discovery.
Step:
  1. Analyze business process and data requirements

  2. Simaster UGM menu display


    Business Process: Graduates of university have a social and intellectual responsibility to become a member of society who has the high competencies include hard skill and soft skill. Therefore, a university needs to help graduates in good career planning. Simaster UGM has Academic Affairs and Career Alumni menu which really support their students in career planning. The process chosen in designing the star database schema is the career planning process. The design carried out will represent the Symphony sub menu, with needed additional data such as study results data.

    Data Requirements
    Profile menu: Name, NIM, Study Program, Level and Class.
    Study results submenu: Semester and year period, course code, courses, credits, and value.
    Curriculum Vitae submenu: ID number, name of experience such as the name of the training attended, experience of the internship position, then explanatory description and time period of the experience.

  3. Design database schema

  4. Tools: Draw.io
    The grain of the fact table is student activities inside and outside lectures.

  5. Perform data mining to knowledge discovery

  6. Tools: SQL, Python, Big Data tools (Apache, etc)

    Classification of student's CGPA prediction
    SELECT id_mahasiswa as “NIU”, nama_mahasiswa as “Nama”, nilai_matkul as “nilai”, sks_matkul as “sks”, nilai*sks as “total_nilai
    FROM fact_aktivitas, dim_mahasiswa, dim_matkul GROUP BY(NIU)

    Based on the stages of data selection, pre-processing, data cleaning and transformation, the dataset concept is needed for student classification based on CGPA predicate groups are as follows:
    The dataset containing ID, name and IP is split every semester into a training dataset and test. The algorithm commonly used for classification is the K-Nearest Neighbour algorithm. In this algorithm, classifications that already exist in the training data can be used for prediction of unclassified data. This algorithm finds group K in the training data is closest to the test data and continues search for specific class labels. Testing the best k value based on the highest accuracy carried out using test data. The classification results with the algorithm are compared with classified data. The following is an example of the process of finding the k value using highest accuracy.
    Predictions are carried out at the end of each semester to forecast the CGPA. The forecast will help students in preparing strategy for achieving the desired CGPA target. So students will be motivated to improve grades in the next semester.


    Clustering of career path course
    SELECT id_matkul, COUNT(id_mahasiswa) as “jumlah_peserta”
    FROM fact_aktivitas GROUP BY (id_matkul)

    The algorithm commonly used for clustering is the K-Means algorithm. This algorithm begins by determining the desired number of clusters existing datasets. In this experiment, the value k=3 was determined, it would form 3 clusters, C1, C2, and C3. Grouping based on course variables unfavorite, moderate, and favorite. The initial centroid is determined randomly, example C1=12 students, C2=30 students, and C3=80 students. Then, using the Euclidean Distance formula, the distance between the data and the centroid is calculated.

    Clusters are obtained from centroids that have a minimum distance from data on the number of students. The K-Means algorithm works iteratively with centroid changes until the cluster members do not change. The centroid value in the next iteration is calculated from the average data from cluster members in the previous iteration.

    Clustering of first iteration
    unfavorite: TIF215112, TIF215122, TIF215124, TIF215134.
    moderate: TIF215111, TIF215114, TIF215121, TIF215131, TIF215132, TIF215133
    favorite: TIF215113, TIF215123

    Forecast of GPA
    SELECT id_mahasiswa as “NIU”, nilai_matkul as “nilai”, sks_matkul as “sks”, nilai*sks as “total_nilai
    FROM fact_aktivitas, dim_matkul WHERE id_mahasiswa=440305 AND id_periode=20191 GROUP BY(NIU)

    The CGPA score cannot be increased instantly, but a stable and satisfactory GPA score is needed every semester. GPA is a variable response to time with semester periods so that it can be forecasted with time series forecast. The GPA score should not be forms a trend but changes occur smoothly, then the method The appropriate method is the exponential smoothing method. On exponential smoothing, the values ​​used to forecast GPA for the next semester are the previous GPA and forecast. This can minimize other influencing factors GPA in a certain period (outliers are formed) in the forecast calculation. GPA value The past period is given a weight ယ where 0 ≤ယ≤1.
    Forecast (t) = Forecast (t-1) + ယ*{Actual(t-1)-Forecast(t-1)}
    The following is an example of a forecast calculation with a weight of ယ=0.2:

    Based on the exponential smoothing forecast, the possible Semester 7 GPA is obtained is 3.46. The forecast results from previous semesters are similar compared to the original data so it can be said that the forecast can represent the GPA of student with NIU 440305.

    Association rule of course
    SELECT id_mahasiswa as “NIU”, id_matkul FROM fact_aktivitas GROUP BY (id_mahasiswa)

    To execute the association rule, the course taken by each student needs to be converted into cross-tabulated:


    Many students take course TIF215123 and TIF215132 simultaneously, while course that taken by almost all students is the TIF215132 course.

    Text mining of student skills
    judul_pengalaman dan desk_pengalaman can be used for text mining. Text mining can be done to look for skills or expertise that students have based on experience such as internships, projects, research, etc.
    The Term Frequency-Inverse Document Frequency (TF-IDF) method is a way of giving weight to the relationship of a word (term) to a document. This method combines 2 concepts of weight calculation, The expertise term often appears in titles or descriptions of student experiences become the student's expertise. Meanwhile, the expertise term in many other people experience will have little weight. So, the weight of each word is: ယ= TF*IDF

    Based on calculation above, the order of student expertise from greatest is technology, present, design, lead, and software.